How to Create an ARIMA Model for Time Series Forecasting with Python

A popular and widely used statistical method for time series forecasting is the ARIMA model.

ARIMA is an acronym that stands for AutoRegressive Integrated Moving Average. It is a class of model that captures a suite of different standard temporal structures in time series data.

In this tutorial, you will discover how to develop an ARIMA model for time series data with Python.

After completing this tutorial, you will know:

  • About the ARIMA model the parameters used and assumptions made by the model.
  • How to fit an ARIMA model to data and use it to make forecasts.
  • How to configure the ARIMA model on your time series problem.

Let’s get started.

Autoregressive Integrated Moving Average Model

An ARIMA model is a class of statistical models for analyzing and forecasting time series data.

It explicitly caters to a suite of standard structures in time series data, and as such provides a simple yet powerful method for making skillful time series forecasts.

ARIMA is an acronym that stands for AutoRegressive Integrated Moving Average. It is a generalization of the simpler AutoRegressive Moving Average and adds the notion of integration.

This acronym is descriptive, capturing the key aspects of the model itself. Briefly, they are:

  • AR: Autoregression. A model that uses the dependent relationship between an observation and some number of lagged observations.
  • I: Integrated. The use of differencing of raw observations (e.g. subtracting an observation from an observation at the previous time step) in order to make the time series stationary.
  • MA: Moving Average. A model that uses the dependency between an observation and a residual error from a moving average model applied to lagged observations.

Each of these components are explicitly specified in the model as a parameter. A standard notation is used of ARIMA(p,d,q) where the parameters are substituted with integer values to quickly indicate the specific ARIMA model being used.

The parameters of the ARIMA model are defined as follows:

  • p: The number of lag observations included in the model, also called the lag order.
  • d: The number of times that the raw observations are differenced, also called the degree of differencing.
  • q: The size of the moving average window, also called the order of moving average.

A linear regression model is constructed including the specified number and type of terms, and the data is prepared by a degree of differencing in order to make it stationary, i.e. to remove trend and seasonal structures that negatively affect the regression model.

A value of 0 can be used for a parameter, which indicates to not use that element of the model. This way, the ARIMA model can be configured to perform the function of an ARMA model, and even a simple AR, I, or MA model.

Adopting an ARIMA model for a time series assumes that the underlying process that generated the observations is an ARIMA process. This may seem obvious, but helps to motivate the need to confirm the assumptions of the model in the raw observations and in the residual errors of forecasts from the model.

Next, let’s take a look at how we can use the ARIMA model in Python. We will start with loading a simple univariate time series.

Stop learning Time Series Forecasting the slow way!

Take my free 7-day email course and discover data prep, modeling and more (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Shampoo Sales Dataset

This dataset describes the monthly number of sales of shampoo over a 3 year period.

The units are a sales count and there are 36 observations. The original dataset is credited to Makridakis, Wheelwright, and Hyndman (1998).

Learn more about the dataset and download it from here.

Download the dataset and place it in your current working directory with the filename “shampoo-sales.csv“.

Below is an example of loading the Shampoo Sales dataset with Pandas with a custom function to parse the date-time field. The dataset is baselined in an arbitrary year, in this case 1900.

Running the example prints the first 5 rows of the dataset.

The data is also plotted as a time series with the month along the x-axis and sales figures on the y-axis.

Shampoo Sales Dataset Plot

Shampoo Sales Dataset Plot

We can see that the Shampoo Sales dataset has a clear trend.

This suggests that the time series is not stationary and will require differencing to make it stationary, at least a difference order of 1.

Let’s also take a quick look at an autocorrelation plot of the time series. This is also built-in to Pandas. The example below plots the autocorrelation for a large number of lags in the time series.

Running the example, we can see that there is a positive correlation with the first 10-to-12 lags that is perhaps significant for the first 5 lags.

A good starting point for the AR parameter of the model may be 5.

Autocorrelation Plot of Shampoo Sales Data

Autocorrelation Plot of Shampoo Sales Data

ARIMA with Python

The statsmodels library provides the capability to fit an ARIMA model.

An ARIMA model can be created using the statsmodels library as follows:

  1. Define the model by calling ARIMA() and passing in the p, d, and q parameters.
  2. The model is prepared on the training data by calling the fit() function.
  3. Predictions can be made by calling the predict() function and specifying the index of the time or times to be predicted.

Let’s start off with something simple. We will fit an ARIMA model to the entire Shampoo Sales dataset and review the residual errors.

First, we fit an ARIMA(5,1,0) model. This sets the lag value to 5 for autoregression, uses a difference order of 1 to make the time series stationary, and uses a moving average model of 0.

When fitting the model, a lot of debug information is provided about the fit of the linear regression model. We can turn this off by setting the disp argument to 0.

Running the example prints a summary of the fit model. This summarizes the coefficient values used as well as the skill of the fit on the on the in-sample observations.

First, we get a line plot of the residual errors, suggesting that there may still be some trend information not captured by the model.

ARMA Fit Residual Error Line Plot

ARMA Fit Residual Error Line Plot

Next, we get a density plot of the residual error values, suggesting the errors are Gaussian, but may not be centered on zero.

ARMA Fit Residual Error Density Plot

ARMA Fit Residual Error Density Plot

The distribution of the residual errors is displayed. The results show that indeed there is a bias in the prediction (a non-zero mean in the residuals).

Note, that although above we used the entire dataset for time series analysis, ideally we would perform this analysis on just the training dataset when developing a predictive model.

Next, let’s look at how we can use the ARIMA model to make forecasts.

Rolling Forecast ARIMA Model

The ARIMA model can be used to forecast future time steps.

We can use the predict() function on the ARIMAResults object to make predictions. It accepts the index of the time steps to make predictions as arguments. These indexes are relative to the start of the training dataset used to make predictions.

If we used 100 observations in the training dataset to fit the model, then the index of the next time step for making a prediction would be specified to the prediction function as start=101, end=101. This would return an array with one element containing the prediction.

We also would prefer the forecasted values to be in the original scale, in case we performed any differencing (d>0 when configuring the model). This can be specified by setting the typ argument to the value ‘levels’: typ=’levels’.

Alternately, we can avoid all of these specifications by using the forecast() function, which performs a one-step forecast using the model.

We can split the training dataset into train and test sets, use the train set to fit the model, and generate a prediction for each element on the test set.

A rolling forecast is required given the dependence on observations in prior time steps for differencing and the AR model. A crude way to perform this rolling forecast is to re-create the ARIMA model after each new observation is received.

We manually keep track of all observations in a list called history that is seeded with the training data and to which new observations are appended each iteration.

Putting this all together, below is an example of a rolling forecast with the ARIMA model in Python.

Running the example prints the prediction and expected value each iteration.

We can also calculate a final mean squared error score (MSE) for the predictions, providing a point of comparison for other ARIMA configurations.

A line plot is created showing the expected values (blue) compared to the rolling forecast predictions (red). We can see the values show some trend and are in the correct scale.

ARIMA Rolling Forecast Line Plot

ARIMA Rolling Forecast Line Plot

The model could use further tuning of the p, d, and maybe even the q parameters.

Configuring an ARIMA Model

The classical approach for fitting an ARIMA model is to follow the Box-Jenkins Methodology.

This is a process that uses time series analysis and diagnostics to discover good parameters for the ARIMA model.

In summary, the steps of this process are as follows:

  1. Model Identification. Use plots and summary statistics to identify trends, seasonality, and autoregression elements to get an idea of the amount of differencing and the size of the lag that will be required.
  2. Parameter Estimation. Use a fitting procedure to find the coefficients of the regression model.
  3. Model Checking. Use plots and statistical tests of the residual errors to determine the amount and type of temporal structure not captured by the model.

The process is repeated until either a desirable level of fit is achieved on the in-sample or out-of-sample observations (e.g. training or test datasets).

The process was described in the classic 1970 textbook on the topic titled Time Series Analysis: Forecasting and Control by George Box and Gwilym Jenkins. An updated 5th edition is now available if you are interested in going deeper into this type of model and methodology.

Given that the model can be fit efficiently on modest-sized time series datasets, grid searching parameters of the model can be a valuable approach.

Summary

In this tutorial, you discovered how to develop an ARIMA model for time series forecasting in Python.

Specifically, you learned:

  • About the ARIMA model, how it can be configured, and assumptions made by the model.
  • How to perform a quick time series analysis using the ARIMA model.
  • How to use an ARIMA model to forecast out of sample predictions.

Do you have any questions about ARIMA, or about this tutorial?
Ask your questions in the comments below and I will do my best to answer.

Want to Develop Time Series Forecasts with Python?

Introduction to Time Series Forecasting With Python

Develop Your Own Forecasts in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Introduction to Time Series Forecasting With Python

It covers self-study tutorials and end-to-end projects on topics like:
Loading data, visualization, modeling, algorithm tuning, and much more...

Finally Bring Time Series Forecasting to
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.

104 Responses to How to Create an ARIMA Model for Time Series Forecasting with Python

  1. SalemAmeen January 9, 2017 at 7:22 am #

    Many thank

  2. Blessing Ojeme January 9, 2017 at 1:20 pm #

    Much appreciated, Jason. Keep them coming, please.

    • Jason Brownlee January 10, 2017 at 8:55 am #

      Sure thing! I’m glad you’re finding them useful.

      What else would you like to see?

      • Utkarsh July 22, 2017 at 10:31 pm #

        Hi Jason ,can you suggest how one can solve time series problem if the target variable is categorical having around 500 categories.

        Thanks

        • Jason Brownlee July 23, 2017 at 6:24 am #

          That is a lot of categories.

          Perhaps moving to a neural network type model with a lot of capacity. You may also require a vast amount of data to learn this problem.

  3. Chow Xixi January 9, 2017 at 6:00 pm #

    good,Has been paid close attention to your blog.

  4. Kevin January 17, 2017 at 12:58 am #

    Gives me loads of errors:

    Traceback (most recent call last):
    File “/Users/kevinoost/anaconda/lib/python3.5/site-packages/pandas/io/parsers.py”, line 2276, in converter
    date_parser(*date_cols), errors=’ignore’)
    File “/Users/kevinoost/PycharmProjects/ARIMA/main.py”, line 6, in parser
    return datetime.strptime(‘190’+x, ‘%Y-%m’)
    TypeError: strptime() argument 1 must be str, not numpy.ndarray

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
    File “/Users/kevinoost/anaconda/lib/python3.5/site-packages/pandas/io/parsers.py”, line 2285, in converter
    dayfirst=dayfirst),
    File “pandas/src/inference.pyx”, line 841, in pandas.lib.try_parse_dates (pandas/lib.c:57884)
    File “pandas/src/inference.pyx”, line 838, in pandas.lib.try_parse_dates (pandas/lib.c:57802)
    File “/Users/kevinoost/PycharmProjects/ARIMA/main.py”, line 6, in parser
    return datetime.strptime(‘190’+x, ‘%Y-%m’)
    File “/Users/kevinoost/anaconda/lib/python3.5/_strptime.py”, line 510, in _strptime_datetime
    tt, fraction = _strptime(data_string, format)
    File “/Users/kevinoost/anaconda/lib/python3.5/_strptime.py”, line 343, in _strptime
    (data_string, format))
    ValueError: time data ‘190Sales of shampoo over a three year period’ does not match format ‘%Y-%m’

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
    File “/Users/kevinoost/PycharmProjects/ARIMA/main.py”, line 8, in
    series = read_csv(‘shampoo-sales.csv’, header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
    File “/Users/kevinoost/anaconda/lib/python3.5/site-packages/pandas/io/parsers.py”, line 562, in parser_f
    return _read(filepath_or_buffer, kwds)
    File “/Users/kevinoost/anaconda/lib/python3.5/site-packages/pandas/io/parsers.py”, line 325, in _read
    return parser.read()
    File “/Users/kevinoost/anaconda/lib/python3.5/site-packages/pandas/io/parsers.py”, line 815, in read
    ret = self._engine.read(nrows)
    File “/Users/kevinoost/anaconda/lib/python3.5/site-packages/pandas/io/parsers.py”, line 1387, in read
    index, names = self._make_index(data, alldata, names)
    File “/Users/kevinoost/anaconda/lib/python3.5/site-packages/pandas/io/parsers.py”, line 1030, in _make_index
    index = self._agg_index(index)
    File “/Users/kevinoost/anaconda/lib/python3.5/site-packages/pandas/io/parsers.py”, line 1111, in _agg_index
    arr = self._date_conv(arr)
    File “/Users/kevinoost/anaconda/lib/python3.5/site-packages/pandas/io/parsers.py”, line 2288, in converter
    return generic_parser(date_parser, *date_cols)
    File “/Users/kevinoost/anaconda/lib/python3.5/site-packages/pandas/io/date_converters.py”, line 38, in generic_parser
    results[i] = parse_func(*args)
    File “/Users/kevinoost/PycharmProjects/ARIMA/main.py”, line 6, in parser
    return datetime.strptime(‘190’+x, ‘%Y-%m’)
    File “/Users/kevinoost/anaconda/lib/python3.5/_strptime.py”, line 510, in _strptime_datetime
    tt, fraction = _strptime(data_string, format)
    File “/Users/kevinoost/anaconda/lib/python3.5/_strptime.py”, line 343, in _strptime
    (data_string, format))
    ValueError: time data ‘190Sales of shampoo over a three year period’ does not match format ‘%Y-%m’

    Process finished with exit code 1

    Help would be much appreciated.

    • Jason Brownlee January 17, 2017 at 7:39 am #

      It looks like there might be an issue with your data file.

      Open the csv in a text editor and confirm the header line looks sensible.

      Also confirm that you have no extra data at the end of the file. Sometimes the datamarket files download with footer data that you need to delete.

  5. NGUYEN Quang Anh January 19, 2017 at 6:28 pm #

    Let say I have a time series data with many attribute. For example a row will have (speed, fuel, tire_pressure), how could we made a model out of this ? the value of each column may affect each other, so we cannot do forecasting on solely 1 column. I google a lot but all the example I’ve found so far only work on time series of 1 attribute.

    • Jason Brownlee January 20, 2017 at 10:19 am #

      This is called multivariate time series forecasting. Linear models like ARIMA were not designed for this type of problem.

      generally, you can use the lag-based representation of each feature and then apply a standard machine learning algorithm.

      I hope to have some tutorials on this soon.

      • rchesak May 30, 2017 at 12:37 pm #

        Wanted to check in on this, do you have any tutorials on multivariate time series forecasting?

        Also, when you say standard machine learning algorithm, would a random forest model work?

        Thanks!

        • rchesak May 30, 2017 at 12:52 pm #

          Update: the statsmodels.tsa.arima_model.ARIMA() function documentation says it takes the optional parameter exog, which is described in the documentation as ‘an optional array of exogenous variables’. This sounds like multivariate analysis to me, would you agree?

          I am trying to predict number of cases of a mosquito-borne disease, over time, given weather data. So I believe the ARIMA model should work for this, correct?

          Thank you!

          • Jason Brownlee June 2, 2017 at 12:32 pm #

            I have not experimented with this argument.

        • Jason Brownlee June 2, 2017 at 12:32 pm #

          No multivariate examples at this stage.

          Yes, any supervised learning method.

    • Muyi Ibidun February 7, 2017 at 9:36 am #

      Hello Ng,

      Your problem fits what VAR (Vector Autoregression) models is designed for. See the following links for more information. I hope this helps your work.

      https://en.wikipedia.org/wiki/Vector_autoregression
      http://statsmodels.sourceforge.net/devel/vector_ar.html

  6. Kelvid January 20, 2017 at 11:55 am #

    Hi, would you have a example for the seasonal ARIMA post? I have installed latest statsmodels module, but there is an error of import the SARIMAX. Do help if you manage to figure it out. Thanks.

    • Jason Brownlee January 21, 2017 at 10:23 am #

      Hi Kelvid, I don’t have one at the moment. I ‘ll prepare an example of SARIMAX and post it soon.

  7. Muhammad Arsalan January 29, 2017 at 10:13 pm #

    It is so informative..thankyou

  8. Sebastian January 31, 2017 at 3:33 am #

    Great post Jason!

    I have a couple of questions:

    – Just to be sure. model_fit.forecast() is single step ahead forecasts and model_fit.predict() is for multiple step ahead forecasts?

    – I am working with a series that seems at least quite similar to the shampoo series (by inspection). When I use predict on the training data, I get this zig-zag pattern in the prediction as well. But for the test data, the prediction is much smoother and seems to saturate at some level. Would you expect this? If not, what could be wrong?

    • Jason Brownlee February 1, 2017 at 10:28 am #

      Hi Sebastian,

      Yes, forecast() is for one step forecasts. You can do one step forecasts with predict() also, but it is more work.

      I would not expect prediction beyond a few time steps to be very accurate, if that is your question?

      • Sebastian February 3, 2017 at 9:25 am #

        Thanks for the reply!

        Concerning the second question. Yes, you are right the prediction is not very accurate. But moreover, the predicted time series has a totally different frequency content. As I said, it is smooth and not zig-zaggy as the original data. Is this normal or am I doing something wrong. I also tried the multiple step prediction (model_fit.predict()) on the training data and then the forecast seem to have more or less the same frequency content (more zig-zaggy) as the data I am trying to predict.

        • Jason Brownlee February 3, 2017 at 10:22 am #

          Hi Sebastian, I see.

          In the case of predicting on the training dataset, the model has access to real observations. For example, if you predict the next 5 obs somewhere in the training dataset, it will use obs(t+4) to predict t+5 rather than prediction(t+4).

          In the case of predicting beyond the end of the model data, it does not have obs to make predictions (unless you provide them), it only has access to the predictions it made for prior time steps. The result is the errors compound and things go off the rails fast (flat forecast).

          Does that make sense/help?

          • Sebastian February 3, 2017 at 6:34 pm #

            That helped!

            Thanks!

          • Jason Brownlee February 4, 2017 at 10:00 am #

            Glad to hear it Sebastian.

          • satya May 22, 2017 at 9:19 pm #

            Hi Jason,

            suppose my training set is 1949 to 1961. Can I get the data for 1970 with using Forecast or Predict function

            Thanks
            Satya

          • Jason Brownlee May 23, 2017 at 7:51 am #

            Yes, you would have to predict 10 years worth of data though. The predictions after 10 years would likely have a lot of error.

  9. Elliot January 31, 2017 at 10:07 am #

    So this is building a model and then checking it off of the given data right?

    -How can I predict what would come next after the last data point? Am I misunderstanding the code?

  10. Muyi Ibidun February 7, 2017 at 9:38 am #

    Thanks Jason for this post!

    It was really useful. And your blogs are becoming a must read for me because of the applicable and piecemeal nature of your tutorials.

    Keep up the good work!

  11. Kalin Stoyanov February 8, 2017 at 9:30 pm #

    Hi,
    This is not the first post on ARIMA, but it is the best so far. Thank you.

  12. James Zhang February 10, 2017 at 7:42 pm #

    Hey Jason,

    thank you very much for the post, very good written! I have a question: so I used your approach to build the model, but when I try to forecast the data that are out of sample, I commented out the obs = test[t] and change history.append(obs) to history.append(yhat), and I got a flat prediction… so what could be the reason? and how do you actually do the out-of-sample predictions based on the model fitted on train dataset? Thank you very much!

    • Jason Brownlee February 11, 2017 at 5:00 am #

      Hi james,

      Each loop in the rolling forecast shows you how to make a one-step out of sample forecast.

      Train your ARIMA on all available data and call forecast().

      If you want to perform a multi-step forecast, indeed, you will need to treat prior forecasts as “observations” and use them for subsequent forecasts. You can do this automatically using the predict() function. Depending on the problem, this approach is often not skillful (e.g. a flat forecast).

      Does that help?

      • James February 16, 2017 at 2:03 am #

        Hi Jason,

        thank you for you reply! so what could be the reason a flat forecast occurs and how to avoid it?

        • Jason Brownlee February 16, 2017 at 11:09 am #

          Hi James,

          The model may not have enough information to make a good forecast.

          Consider exploring alternate methods that can perform multi-step forecasts in one step – like neural nets or recurrent neural nets.

          • James February 16, 2017 at 7:41 pm #

            Hi Jason,

            thanks a lot for your information! still need to learn a lot from people like you! 😀 nice day!

          • Jason Brownlee February 17, 2017 at 9:53 am #

            I’m here to help James!

  13. Supriya February 16, 2017 at 1:27 am #

    when i calculate train and test error , train rmse is greater than test rmse.. why is it so?

    • Jason Brownlee February 16, 2017 at 11:08 am #

      I see this happen sometimes Supriya.

      It suggests the model may not be well suited for the data.

  14. Matias T February 18, 2017 at 12:04 am #

    Hello Jason, thanks for this amazing post.
    I was wondering how does the “size” work here. For example lets say i want to forecast only 30 days ahead. I keep getting problems with the degrees of freedom.
    Could you please explain this to me.

    Thanks

    • Jason Brownlee February 18, 2017 at 8:40 am #

      Hi Matias, the “size” in the example is used to split the data into train/test sets for model evaluation using walk forward validation.

      You can set this any way you like or evaluate your model different ways.

      To forecast 30 days ahead, you are going to need a robust model and enough historic data to evaluate this model effectively.

      • Matias R February 21, 2017 at 6:39 am #

        I get it. Thanks Jason.

        I was thinking, in this particular example, ¿will the prediction change if we keep adding data?

        • Jason Brownlee February 21, 2017 at 9:41 am #

          Great question Matias.

          The amount of history is one variable to test with your model.

          Design experiments to test if having more or less history improves performance.

  15. ubald kuijpers February 24, 2017 at 10:05 pm #

    Dear Jason,

    Thank you for explaining the ARIMA model in such clear detail.
    It helped me to make my own model to get numerical forrcasts and store it in a database.
    So nice that we live in an era where knowledge is de-mystified .

  16. Jacques Sauve February 25, 2017 at 6:41 am #

    Hi Jason. Very good work!
    It would be great to see how forecasting models can be used to detect anomalies in time series. thanks.

  17. Mehran March 1, 2017 at 12:56 am #

    Hi there. Many thanks. I think you need to change the way you parse the datetime to:

    datetime.strptime(’19’+x, ‘%Y-%b’)

    Many thanks

    • Jason Brownlee March 1, 2017 at 8:41 am #

      Are you sure?

      See this list of abbreviations:
      https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

      The “%m” refers to “Month as a zero-padded decimal number.” which is exactly what we have here.

      See a sample of the raw data file:

      The “%b” refers to “Month as locale’s abbreviated name.” which we do not have here.

  18. Niirkshith March 6, 2017 at 4:49 pm #

    Hi Jason,
    Lucky i found this at the begining of my project.. Its a great start point and enriching.
    Keep it coming :).
    This can also be used for non linear time series as well?

    Thanks,
    niri

  19. Anthony of Sydney March 8, 2017 at 9:00 am #

    Dear Dr Jason,

    In the above example of the rolling forecast, you used the rmse of the predicted and the actual value.

    Another way of getting the residuals of the model is to get the std devs of the residuals of the fitted model

    Question, is the std dev of the residuals the same as the root_mean_squared(actual, predicted)?
    Thank you
    Anthony of Sydney NSW

    what is the difference between measuring the std deviation of the residuals of a fitted model and the rmse of the rolling forecast will

  20. Niirkshith March 10, 2017 at 1:28 pm #

    Hi Jason,
    Great writeup, had a query, when u have a seasonal data and do seasonal differencing. i.e for exy(t)=y(t)-y(t-12) for yearly data. What will be the value of d in ARIMA(p,d,q).

    • Niirkshith March 10, 2017 at 1:29 pm #

      typo, ex y(t)=y(t)-y(t-12) for monthly data not yearly

    • Jason Brownlee March 11, 2017 at 7:56 am #

      Great question Niirkshith.

      ARIMA will not do seasonal differencing (there is a version that will called SARIMA). The d value on ARIMA will be unrelated to the seasonal differencing and will assume the input data is already seasonally adjusted.

  21. Niirkshith March 13, 2017 at 1:09 pm #

    Thanks for getting back.

  22. ivan March 19, 2017 at 5:17 am #

    Hi, Jason

    thanks for this example. My question how is chosen the parameter q ?
    best Ivan

  23. Narbukra March 30, 2017 at 4:21 am #

    Hi Jason, I am wondering if you did a similar tutorial on multi-variate time series forecasting?

    • Jason Brownlee March 30, 2017 at 8:57 am #

      Not yet, I am working on some.

      • Nirikshith May 12, 2017 at 1:02 pm #

        Hi Jason,
        any updates on the same

  24. David March 30, 2017 at 8:53 am #

    Hi Jason,

    Thanks for the great post! It was very helpful. I’m currently trying to forecast with the ARIMA model using order (4, 1, 5) and I’m getting an error message “The computed initial MA coefficients are not invertible. You should induce invertibility, choose a different model order, or you can pass your own start_params.” The model works when fitting, but seems to error out when I move to model_fit = model.fit(disp=0). The forecast works well when using your parameters of (0, 1, 5) and I used ACF and PACF plots to find my initial p and q parameters. Any ideas on the cause/fix for the error? Any tips would be much appreciated.

  25. tom reilly April 27, 2017 at 6:39 am #

    It’s a great blog that you have, but the PACF determines the AR order not the ACF.

  26. Evgeniy May 2, 2017 at 1:22 am #

    Good afternoon!
    Is there an analog to the function auto.arima in the package for python from the package of the language R.
    For automatic selection of ARIMA parameters?
    Thank you!

  27. timer May 18, 2017 at 7:23 pm #

    Hi. Great one. Suppose I have multiple airlines data number of passengers for two years recorded on daily basis. Now I want to predict for each airline number of possible passangers on next few months. How can I fit these time series models. Separate model for each airline or one single model?

    • Jason Brownlee May 19, 2017 at 8:16 am #

      Try both approaches and double down on what works best.

      • Kashif May 26, 2017 at 2:06 am #

        Hi Jason, if in my dataset, my first column is date (YYYYMMDD) and second column is time (hhmmss) and third column is value at given date and time. So could I use ARIMA model for forecasting such type of time series ?

        • Jason Brownlee June 2, 2017 at 11:47 am #

          Yes, use a custom parse function to combine the date and time into one index column.

  28. Kashif May 25, 2017 at 6:30 pm #

    Hi Sir, Do you have tutorial about vector auto regression model (for multi-variate time series forecasting?)

  29. Ebrahim Aly May 30, 2017 at 5:03 am #

    Thanks a lot, Dr. Jason. This tutorial explained a lot. But I tried to run it on an oil prices data set from Bp and I get the following error:

    SVD did not converge

    I used (p,d,q) = (5, 1, 0)

    Would you please help me on solving or at least understanding this error?

    • Jason Brownlee June 2, 2017 at 12:29 pm #

      Perhaps consider rescaling your input data and explore other configurations?

  30. Alex June 9, 2017 at 8:01 am #

    Hi Jason,
    I have a general question about ARIMA model in the case of multiple Time Series:
    suppose you have not only one time series but many (i.e. the power generated per hour at 1000 different wind farms). So you have a dataset of 1000 time series of N points each and you want to predict the next N+M points for each of the time series.
    Analyzing each time series separately with the ARIMA could be a waste. Maybe there are similarities in the time evolution of these 1000 different patterns which could help my predictions. What approach would you suggest in this case?

    • Jason Brownlee June 10, 2017 at 8:11 am #

      You could not use ARIMA.

      For linear models, you could use vector autoregressions (VAR).

      For nonlinear methods, I’d recommend a neural network.

      I hope that helps as a start.

  31. Donato June 13, 2017 at 10:23 pm #

    Hi Jeson, it’s possible to training the ARIMA with more files? Thanks!

  32. TaeWoo Kim June 23, 2017 at 3:22 am #

    “First, we get a line plot of the residual errors, suggesting that there may still be some trend information not captured by the model.”

    So are you looking for a smooth flat line in the curve?

    • Jason Brownlee June 23, 2017 at 6:47 am #

      No, the upward trend that appears to exist in the plot of residuals.

  33. Ukesh June 24, 2017 at 12:37 am #

    At the end of the code, when I tried to print the predictions, it printed as the array, how do I convert it to the data points???

    print(predictions)

    [array([ 309.59070719]), array([ 388.64159699]), array([ 348.77807261]), array([ 383.60202178]), array([ 360.99214813]), array([ 449.34210105]), array([ 395.44928401]), array([ 434.86484106]), array([ 512.30201612]), array([ 428.59722583]), array([ 625.99359188]), array([ 543.53887362])]

  34. Ukesh June 24, 2017 at 12:53 am #

    Never mind.. I figured it out…

    forecasts = numpy.array(predictions)

    [[ 309.59070719]
    [ 388.64159699]
    [ 348.77807261]
    [ 383.60202178]
    [ 360.99214813]
    [ 449.34210105]
    [ 395.44928401]
    [ 434.86484106]
    [ 512.30201612]
    [ 428.59722583]
    [ 625.99359188]
    [ 543.53887362]]

    Keep up the good work Jason.. Your blogs are extremely helpful and easy to follow.. Loads of appreciation..

  35. Vincent June 29, 2017 at 6:53 pm #

    Hi Jason and thank you for this post, its really helpful!

    I have one question regarding ARIMA computation time.

    I’m working on a dataset of 10K samples, and I’ve tried rolling and “non rolling” (where coefficients are only estimated once or at least not every new sample) forecasting with ARIMA :
    – rolling forecast produces good results but takes a big amount of time (I’m working with an old computer, around 3/6h depending on the ARMA model);
    – “non rolling” doesn’t forecast well at all.

    Re-estimating the coefficients for each new sample is the only possibility for proper ARIMA forecasting?

    Thanks for your help!

    • Jason Brownlee June 30, 2017 at 8:11 am #

      I would focus on the approach that gives the best results on your problem and is robust. Don’t get caught up on “proper”.

  36. Kashif July 12, 2017 at 11:29 pm #

    Dear Respected Sir, I have tried to use ARIMA model for my dataset, some samples of my dataset are following,
    YYYYMMDD hhmmss Duration
    20100916 130748 18
    20100916 131131 99
    20100916 131324 214
    20100916 131735 72
    20100916 135342 37
    20100916 144059 250
    20100916 150148 87
    20100916 150339 0
    20100916 150401 180
    20100916 154652 248
    20100916 183403 0
    20100916 210148 0
    20100917 71222 179
    20100917 73320 0
    20100917 81718 25
    20100917 93715 15
    But when I used ARIMA model for such type of dataset, the prediction was very bad and test MSE was very high as well, My dataset has irregular pattern and autocorrelation is also very low. so could ARIMA model be used for such type of dataset ? or I have to do some modification in my dataset for using ARIMA model?
    Looking forward.
    Thanks

  37. Vaibhav Agarwal July 14, 2017 at 6:53 am #

    Hi Jason,

    def parser(x):
    return datetime.strptime(‘190’+x, ‘%Y-%m’)
    series = read_csv(‘/home/administrator/Downloads/shampoo.csv’, header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
    print(series.head())

    for these lines of code, I’m getting the following error

    ValueError: time data ‘190Sales of shampoo over a three year period’ does not match format ‘%Y-%m’

    Please help.

    Thanks

    • Jason Brownlee July 14, 2017 at 8:37 am #

      Check that you have deleted the footer in the raw data file.

  38. Kushal July 14, 2017 at 6:53 pm #

    Hi Jason

    Does ARIMA have any limitations for size of the sample. I have a dataset with 18k rows of data, ARIMA just doesn’t complete.

    Thanks

    Kushal

    • Jason Brownlee July 15, 2017 at 9:41 am #

      Yes, it does not work well with lots of data (linalg methods under the covers blow up) and it can take forever as you see.

      You could fit the model using gradient descent, but not with statsmodels, you may need to code it yourself.

  39. Olivia July 18, 2017 at 4:51 am #

    Love this. The code is very straightforward and the explanations are nice.
    I would like to see a HMM model on here. I have been struggling with a few different packages (pomegranate and hmmlearn) for some time now. would like to see what you can do with it! (particularly a stock market example)

    • Jason Brownlee July 18, 2017 at 8:48 am #

      Thanks Olivia, I hope to cover HMMs in the future.

  40. Ben July 19, 2017 at 11:27 am #

    Good evening,
    In what I am doing, I have a training set and a test set. In the training set, I am fitting an ARIMA model, let’s say ARIMA(0,1,1) to the training set. What I want to do is use this model and apply it to the test set to get the residuals.
    So far I have:
    model = ARIMA(data,order = (0,1,1))
    model_fit = model.fit(disp=0)
    res = model_fit.resid
    This gives me the residuals for the training set. So I want to apply the ARIMA model in ‘model’ to the test data.
    Is there a function to do this?
    Thank you

    • Jason Brownlee July 19, 2017 at 4:09 pm #

      Hi Ben,

      You could use your fit model to make a prediction for the test dataset then compare the predictions vs the real values to calculate the residual errors.

Leave a Reply