The ARIMA model for time series analysis and forecasting can be tricky to configure.

There are 3 parameters that require estimation by iterative trial and error from reviewing diagnostic plots and using 40-year-old heuristic rules.

We can automate the process of evaluating a large number of hyperparameters for the ARIMA model by using a grid search procedure.

In this tutorial, you will discover how to tune the ARIMA model using a grid search of hyperparameters in Python.

After completing this tutorial, you will know:

- A general procedure that you can use to tune the ARIMA hyperparameters for a rolling one-step forecast.
- How to apply ARIMA hyperparameter optimization on a standard univariate time series dataset.
- Ideas for extending the procedure for more elaborate and robust models.

Let’s get started.

## Grid Searching Method

Diagnostic plots of the time series can be used along with heuristic rules to determine the hyperparameters of the ARIMA model.

These are good in most, but perhaps not all, situations.

We can automate the process of training and evaluating ARIMA models on different combinations of model hyperparameters. In machine learning this is called a grid search or model tuning.

In this tutorial, we will develop a method to grid search ARIMA hyperparameters for a one-step rolling forecast.

The approach is broken down into two parts:

- Evaluate an ARIMA model.
- Evaluate sets of ARIMA parameters.

The code in this tutorial makes use of the scikit-learn, Pandas, and the statsmodels Python libraries.

### Stop learning Time Series Forecasting the *slow way*!

Take my free 7-day email course and discover data prep, modeling and more (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

### 1. Evaluate ARIMA Model

We can evaluate an ARIMA model by preparing it on a training dataset and evaluating predictions on a test dataset.

This approach involves the following steps:

- Split the dataset into training and test sets.
- Walk the time steps in the test dataset.
- Train an ARIMA model.
- Make a one-step prediction.
- Store prediction; get and store actual observation.

- Calculate error score for predictions compared to expected values.

We can implement this in Python as a new standalone function called *evaluate_arima_model()* that takes a time series dataset as input as well as a tuple with the *p*, *d*, and *q* parameters for the model to be evaluated.

The dataset is split in two: 66% for the initial training dataset and the remaining 34% for the test dataset.

Each time step of the test set is iterated. Just one iteration provides a model that you could use to make predictions on new data. The iterative approach allows a new ARIMA model to be trained each time step.

A prediction is made each iteration and stored in a list. This is so that at the end of the test set, all predictions can be compared to the list of expected values and an error score calculated. In this case, a mean squared error score is calculated and returned.

The complete function is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# evaluate an ARIMA model for a given order (p,d,q) def evaluate_arima_model(X, arima_order): # prepare training dataset train_size = int(len(X) * 0.66) train, test = X[0:train_size], X[train_size:] history = [x for x in train] # make predictions predictions = list() for t in range(len(test)): model = ARIMA(history, order=arima_order) model_fit = model.fit(disp=0) yhat = model_fit.forecast()[0] predictions.append(yhat) history.append(test[t]) # calculate out of sample error error = mean_squared_error(test, predictions) return error |

Now that we know how to evaluate one set of ARIMA hyperparameters, let’s see how we can call this function repeatedly for a grid of parameters to evaluate.

### 2. Iterate ARIMA Parameters

Evaluating a suite of parameters is relatively straightforward.

The user must specify a grid of *p*, *d*, and *q* ARIMA parameters to iterate. A model is created for each parameter and its performance evaluated by calling the *evaluate_arima_model()* function described in the previous section.

The function must keep track of the lowest error score observed and the configuration that caused it. This can be summarized at the end of the function with a print to standard out.

We can implement this function called *evaluate_models()* as a series of four loops.

There are two additional considerations. The first is to ensure the input data are floating point values (as opposed to integers or strings), as this can cause the ARIMA procedure to fail.

Second, the statsmodels ARIMA procedure internally uses numerical optimization procedures to find a set of coefficients for the model. These procedures can fail, which in turn can throw an exception. We must catch these exceptions and skip those configurations that cause a problem. This happens more often then you would think.

Additionally, it is recommended that warnings be ignored for this code to avoid a lot of noise from running the procedure. This can be done as follows:

1 2 |
import warnings warnings.filterwarnings("ignore") |

Finally, even with all of these protections, the underlying C and Fortran libraries may still report warnings to standard out, such as:

1 |
** On entry to DLASCL, parameter number 4 had an illegal value |

These have been removed from the results reported in this tutorial for brevity.

The complete procedure for evaluating a grid of ARIMA hyperparameters is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# evaluate combinations of p, d and q values for an ARIMA model def evaluate_models(dataset, p_values, d_values, q_values): dataset = dataset.astype('float32') best_score, best_cfg = float("inf"), None for p in p_values: for d in d_values: for q in q_values: order = (p,d,q) try: mse = evaluate_arima_model(dataset, order) if mse < best_score: best_score, best_cfg = mse, order print('ARIMA%s MSE=%.3f' % (order,mse)) except: continue print('Best ARIMA%s MSE=%.3f' % (best_cfg, best_score)) |

Now that we have a procedure to grid search ARIMA hyperparameters, let’s test the procedure on two univariate time series problems.

We will start with the Shampoo Sales dataset.

## Shampoo Sales Case Study

The Shampoo Sales dataset describes the monthly number of sales of shampoo over a 3-year period.

The units are a sales count and there are 36 observations. The original dataset is credited to Makridakis, Wheelwright, and Hyndman (1998).

Learn more about the dataset from here.

Download the dataset and place it into your current working directory with the filename “*shampoo-sales.csv*“.

The timestamps in the time series do not contain an absolute year component. We can use a custom date-parsing function when loading the data and baseline the year from 1900, as follows:

1 2 3 4 |
# load dataset def parser(x): return datetime.strptime('190'+x, '%Y-%m') series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser) |

Once loaded, we can specify a site of *p*, *d*, and *q* values to search and pass them to the *evaluate_models()* function.

We will try a suite of lag values (*p*) and just a few difference iterations (*d*) and residual error lag values (*q*).

1 2 3 4 5 6 |
# evaluate parameters p_values = [0, 1, 2, 4, 6, 8, 10] d_values = range(0, 3) q_values = range(0, 3) warnings.filterwarnings("ignore") evaluate_models(series.values, p_values, d_values, q_values) |

Putting this all together with the generic procedures defined in the previous section, we can grid search ARIMA hyperparameters in the Shampoo Sales dataset.

The complete code example is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
import warnings from pandas import read_csv from pandas import datetime from statsmodels.tsa.arima_model import ARIMA from sklearn.metrics import mean_squared_error # evaluate an ARIMA model for a given order (p,d,q) def evaluate_arima_model(X, arima_order): # prepare training dataset train_size = int(len(X) * 0.66) train, test = X[0:train_size], X[train_size:] history = [x for x in train] # make predictions predictions = list() for t in range(len(test)): model = ARIMA(history, order=arima_order) model_fit = model.fit(disp=0) yhat = model_fit.forecast()[0] predictions.append(yhat) history.append(test[t]) # calculate out of sample error error = mean_squared_error(test, predictions) return error # evaluate combinations of p, d and q values for an ARIMA model def evaluate_models(dataset, p_values, d_values, q_values): dataset = dataset.astype('float32') best_score, best_cfg = float("inf"), None for p in p_values: for d in d_values: for q in q_values: order = (p,d,q) try: mse = evaluate_arima_model(dataset, order) if mse < best_score: best_score, best_cfg = mse, order print('ARIMA%s MSE=%.3f' % (order,mse)) except: continue print('Best ARIMA%s MSE=%.3f' % (best_cfg, best_score)) # load dataset def parser(x): return datetime.strptime('190'+x, '%Y-%m') series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser) # evaluate parameters p_values = [0, 1, 2, 4, 6, 8, 10] d_values = range(0, 3) q_values = range(0, 3) warnings.filterwarnings("ignore") evaluate_models(series.values, p_values, d_values, q_values) |

Running the example prints the ARIMA parameters and MSE for each successful evaluation completed.

The best parameters of ARIMA(4, 2, 1) are reported at the end of the run with a mean squared error of 4,694.873.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
ARIMA(0, 0, 0) MSE=52425.268 ARIMA(0, 0, 1) MSE=38145.167 ARIMA(0, 0, 2) MSE=23989.567 ARIMA(0, 1, 0) MSE=18003.173 ARIMA(0, 1, 1) MSE=9558.410 ARIMA(0, 2, 0) MSE=67339.808 ARIMA(0, 2, 1) MSE=18323.163 ARIMA(1, 0, 0) MSE=23112.958 ARIMA(1, 1, 0) MSE=7121.373 ARIMA(1, 1, 1) MSE=7003.683 ARIMA(1, 2, 0) MSE=18607.980 ARIMA(2, 1, 0) MSE=5689.932 ARIMA(2, 1, 1) MSE=7759.707 ARIMA(2, 2, 0) MSE=9860.948 ARIMA(4, 1, 0) MSE=6649.594 ARIMA(4, 1, 1) MSE=6796.279 ARIMA(4, 2, 0) MSE=7596.332 ARIMA(4, 2, 1) MSE=4694.873 ARIMA(6, 1, 0) MSE=6810.080 ARIMA(6, 2, 0) MSE=6261.107 ARIMA(8, 0, 0) MSE=7256.028 ARIMA(8, 1, 0) MSE=6579.403 Best ARIMA(4, 2, 1) MSE=4694.873 |

## Daily Female Births Case Study

The Daily Female Births dataset describes the number of daily female births in California in 1959.

The units are a count and there are 365 observations. The source of the dataset is credited to Newton (1988).

Learn more about the dataset here.

Download the dataset and place it in your current working directory with the filename “*daily-total-female-births.csv*“.

This dataset can be easily loaded directly as a Pandas Series.

1 2 |
# load dataset series = Series.from_csv('daily-total-female-births.csv', header=0) |

To keep things simple, we will explore the same grid of ARIMA hyperparameters as in the previous section.

1 2 3 4 5 6 |
# evaluate parameters p_values = [0, 1, 2, 4, 6, 8, 10] d_values = range(0, 3) q_values = range(0, 3) warnings.filterwarnings("ignore") evaluate_models(series.values, p_values, d_values, q_values) |

Putting this all together, we can grid search ARIMA parameters on the Daily Female Births dataset. The complete code listing is provided below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
import warnings from pandas import Series from statsmodels.tsa.arima_model import ARIMA from sklearn.metrics import mean_squared_error # evaluate an ARIMA model for a given order (p,d,q) def evaluate_arima_model(X, arima_order): # prepare training dataset train_size = int(len(X) * 0.66) train, test = X[0:train_size], X[train_size:] history = [x for x in train] # make predictions predictions = list() for t in range(len(test)): model = ARIMA(history, order=arima_order) model_fit = model.fit(disp=0) yhat = model_fit.forecast()[0] predictions.append(yhat) history.append(test[t]) # calculate out of sample error error = mean_squared_error(test, predictions) return error # evaluate combinations of p, d and q values for an ARIMA model def evaluate_models(dataset, p_values, d_values, q_values): dataset = dataset.astype('float32') best_score, best_cfg = float("inf"), None for p in p_values: for d in d_values: for q in q_values: order = (p,d,q) try: mse = evaluate_arima_model(dataset, order) if mse < best_score: best_score, best_cfg = mse, order print('ARIMA%s MSE=%.3f' % (order,mse)) except: continue print('Best ARIMA%s MSE=%.3f' % (best_cfg, best_score)) # load dataset series = Series.from_csv('daily-total-female-births.csv', header=0) # evaluate parameters p_values = [0, 1, 2, 4, 6, 8, 10] d_values = range(0, 3) q_values = range(0, 3) warnings.filterwarnings("ignore") evaluate_models(series.values, p_values, d_values, q_values) |

Running the example prints the ARIMA parameters and mean squared error for each configuration successfully evaluated.

The best mean parameters are reported as ARIMA(6, 1, 0) with a mean squared error of 53.187.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
ARIMA(0, 0, 0) MSE=67.063 ARIMA(0, 0, 1) MSE=62.165 ARIMA(0, 0, 2) MSE=60.386 ARIMA(0, 1, 0) MSE=84.038 ARIMA(0, 1, 1) MSE=56.653 ARIMA(0, 1, 2) MSE=55.272 ARIMA(0, 2, 0) MSE=246.414 ARIMA(0, 2, 1) MSE=84.659 ARIMA(1, 0, 0) MSE=60.876 ARIMA(1, 1, 0) MSE=65.928 ARIMA(1, 1, 1) MSE=55.129 ARIMA(1, 1, 2) MSE=55.197 ARIMA(1, 2, 0) MSE=143.755 ARIMA(2, 0, 0) MSE=59.251 ARIMA(2, 1, 0) MSE=59.487 ARIMA(2, 1, 1) MSE=55.013 ARIMA(2, 2, 0) MSE=107.600 ARIMA(4, 0, 0) MSE=59.189 ARIMA(4, 1, 0) MSE=57.428 ARIMA(4, 1, 1) MSE=55.862 ARIMA(4, 2, 0) MSE=80.207 ARIMA(6, 0, 0) MSE=58.773 ARIMA(6, 1, 0) MSE=53.187 ARIMA(6, 1, 1) MSE=57.055 ARIMA(6, 2, 0) MSE=69.753 ARIMA(8, 0, 0) MSE=56.984 ARIMA(8, 1, 0) MSE=57.290 ARIMA(8, 2, 0) MSE=66.034 ARIMA(8, 2, 1) MSE=57.884 ARIMA(10, 0, 0) MSE=57.470 ARIMA(10, 1, 0) MSE=57.359 ARIMA(10, 2, 0) MSE=65.503 ARIMA(10, 2, 1) MSE=57.878 ARIMA(10, 2, 2) MSE=58.309 Best ARIMA(6, 1, 0) MSE=53.187 |

## Extensions

The grid search method used in this tutorial is simple and can easily be extended.

This section lists some ideas to extend the approach you may wish to explore.

**Seed Grid**. The classical diagnostic tools of ACF and PACF plots can still be used with the results used to seed the grid of ARIMA parameters to search.**Alternate Measures**. The search seeks to optimize the out-of-sample mean squared error. This could be changed to another out-of-sample statistic, an in-sample statistic, such as AIC or BIC, or some combination of the two. You can choose a metric that is most meaningful on your project.**Residual Diagnostics**. Statistics can automatically be calculated on the residual forecast errors to provide an additional indication of the quality of the fit. Examples include statistical tests for whether the distribution of residuals is Gaussian and whether there is an autocorrelation in the residuals.**Update Model**. The ARIMA model is created from scratch for each one-step forecast. With careful inspection of the API, it may be possible to update the internal data of the model with new observations rather than recreating it from scratch.**Preconditions**. The ARIMA model can make assumptions about the time series dataset, such as normality and stationarity. These could be checked and a warning raised for a given of a dataset prior to a given model being trained.

## Summary

In this tutorial, you discovered how to grid search the hyperparameters for the ARIMA model in Python.

Specifically, you learned:

- A procedure that you can use to grid search ARIMA hyperparameters for a one-step rolling forecast.
- How to apply ARIMA hyperparameters tuning on standard univariate time series datasets.
- Ideas on how to further improve grid searching of ARIMA hyperparameters.

Now it’s your turn.

Try this procedure on your favorite time series dataset. What results did you get?

Report your results in the comments below.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

If you are willing to consider an R solution , then I can point you to the function auto.arima in the R package ‘forecast’ : https://cran.r-project.org/web/packages/forecast/forecast.pdf

This will do all the gridsearch you need without writing a single line of code .

Now , in general , the use of gridsearch for solving the hyperparameters optimization problem in machine learning models is a poor inefficient choice . It has been proven that random search is faster and Bayesian search is even faster . See this : https://www.youtube.com/watch?v=cWQDeB9WqvU (lecture by Geoff Hinton) . For Python , there is a package called hyperopt that provides this functionality : https://github.com/hyperopt/hyperopt

An intro to hyperopt is here : https://www.youtube.com/watch?v=Mp1xnPfE4PY

Thanks for the links Gerrit.

A noted difference is the optimizaiton of an out of sample statistics, i.e. test performance.

Re grid vs random search, the ARIMA grid is small enough that it can be enumerated. When working with small grids with low compute times, random search would be much less efficient.

hello I have used the evaluate model function to chose the best configuration, but it skipped those configurations that I expect the best according to the Box-Jenkins Method. what that means? and is there any way to check that configurations?

Great question Abdallah, I am frustrated by this as well.

I believe you may be able to tinker with the ARIMA configuration further, such as configuring it use or not use a trend constant.

The issue is caused by instabilities in the linalg and optimization libraries used under the covers.

You could try an alternate implementation (R?), try implementing the method from scratch by hand or perhaps try fitting a linear regression model on a version of the dataset transformed using the same ARIMA operations.

Does that help?

You are doing here one-step rolling forecast for tuning ARIMA parameters. Will the resulting model behave best for forecasting the next observation only? Let’s assume that I would like to get the best possible prediction for the period of next 30 observations. Should the parameters tuning be changed for 30 steps rolling forecast in this case?

Yes Andres, spot on. It is critically important to optimize for the outcome you require.

Amazing stuff here, man. Love it. Keep up the good work!

Thanks for your support Stuart.

Hi Jason,

Thanks for the post. However, I encountered problems while trying to parse the date column in the Shampoo Sales example. I had downloaded the data from the following link(csv format) and am using Python 3:

https://datamarket.com/data/set/22r0/sales-of-shampoo-over-a-three-year-period#!ds=22r0&display=line

I faced 2 problems during parsing:

1) The format of Month column was “1-Jan”, indicating that we need to specify “%Y-%b” instead of “%Y-%m”

2) For values >9, that is , 10-Jan, 11-Jan and so on, the parsed date will be rendered invalid. Since it will be in the format : “19010-Jan” and similar

Please find the modified function which worked for me:

def parser(x):

#the following code chunk will take care of parsing for two conditions:

#1. for dates 10

test = int(x.split(‘-‘)[0])

#print(test)

if(test < 10):

return(datetime.strptime("190"+str(x),"%Y-%b"))

else:

return(datetime.strptime("19"+str(x),"%Y-%b"))

series = read_csv('sales-of-shampoo-over-a-three-ye.csv', header=0, parse_dates=[0], index_col=0,

squeeze=True, date_parser=parser)

Please correct me if there is a mistake in the approach. Hope this helps. Thanks again for the article. Have a good day 🙂

I have tested and confirm that the example works in Python3.

Perhaps confirm that you have the same dataset, that you have removed the footer from the file, and that you have copied the code from the post exactly?

On my computer the first example script breaks with:

** On entry to DLASCL, parameter number 4 had an illegal value

so I get no best settings.

The second script breaks with “Best ArimaNone MSE=inf”

I have already removed the footer line. Any hints available?

Hey Jason

What is the difference (or benefit) of doing the grid search this way vs. using SARIMAX? (reference: https://www.digitalocean.com/community/tutorials/a-guide-to-time-series-forecasting-with-arima-in-python-3)

I have not read that post, but skimming it suggests that are using a for loop just the same as in my tutorial.

“Each time step of the test set is iterated. Just one iteration provides a model that you could use to make predictions on new data. The iterative approach allows a new ARIMA model to be trained each time step.”

First of all, thank you for this tutorial ! I am a bit confused about using your iterative approach above. My questions are:

1. Why are you adding the test example to the training set (in history) and retraining the ARIMA model ? This way each subsequent test prediction is trained on the original training set plus an element added from the prior test example. Is this to improve the test predictions by adding more training data to the model (which now includes original training + test examples )?

2. Using the predict function, can I just train an ARIMA on the training set and use the in-built predict function on the test example set aside ? What are the pitfalls using this approach ?

Thank you again !

I am using walk-forward validation, learn more here:

http://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/

Yes. The walk-forward validation lets you re-fit the model each step, predict does not.

What does this error mean – Best ARIMANone RMSE=inf?

No good result found Sam. Did you run the code as-is or adapt it to your problem? Perhaps debug the example?

Can the amount of input data affect to the forecast? I mean, maybe the oldest lagged data is not quite correlated with the current one. If so, wouldn’t be better to limit the length of history to 500 rows, for instance? How do I find the optimal amount of training data?

It can affect the forecast. I recommend testing the amount of forecast data used. Here’s an example:

http://machinelearningmastery.com/sensitivity-analysis-history-size-forecast-skill-arima-python/

Thanks Jason, very useful! But now I have another dilemma: should I performance the gridsearch before or after finding the optimal amount of data?

That is a good question. It is also an open question.

A purist might treat the amount of training data as yet another variable to grid search.

This model is taking forever to load – is there something I can do to optimize performance?

Remove some data?