How to Make Out-of-Sample Forecasts with ARIMA in Python

Making out-of-sample forecasts can be confusing when getting started with time series data.

The statsmodels Python API provides functions for performing one-step and multi-step out-of-sample forecasts.

In this tutorial, you will clear up any confusion you have about making out-of-sample forecasts with time series data in Python.

After completing this tutorial, you will know:

  • How to make a one-step out-of-sample forecast.
  • How to make a multi-step out-of-sample forecast.
  • The difference between the forecast() and predict() functions.

Let’s get started.

How to Make Out-of-Sample Forecasts with ARIMA in Python

How to Make Out-of-Sample Forecasts with ARIMA in Python
Photo by dziambel, some rights reserved.

Tutorial Overview

This tutorial is broken down into the following 5 steps:

  1. Dataset Description
  2. Split Dataset
  3. Develop Model
  4. One-Step Out-of-Sample Forecast
  5. Multi-Step Out-of-Sample Forecast

Stop learning Time Series Forecasting the slow way!

Take my free 7-day email course and discover how to get started (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

1. Minimum Daily Temperatures Dataset

This dataset describes the minimum daily temperatures over 10 years (1981-1990) in the city of Melbourne, Australia.

The units are in degrees Celsius and there are 3,650 observations. The source of the data is credited as the Australian Bureau of Meteorology.

Learn more about the dataset on Data Market.

Download the Minimum Daily Temperatures dataset to your current working directory with the filename “daily-minimum-temperatures.csv”.

Note: The downloaded file contains some question mark (“?”) characters that must be removed before you can use the dataset. Open the file in a text editor and remove the “?” characters. Also, remove any footer information in the file.

The example below loads the dataset as a Pandas Series.

Running the example prints the first 20 rows of the loaded dataset.

A line plot of the time series is also created.

Minimum Daily Temperatures Dataset Line Plot

Minimum Daily Temperatures Dataset Line Plot

2. Split Dataset

We can split the dataset into two parts.

The first part is the training dataset that we will use to prepare an ARIMA model. The second part is the test dataset that we will pretend is not available. It is these time steps that we will treat as out of sample.

The dataset contains data from January 1st 1981 to December 31st 1990.

We will hold back the last 7 days of the dataset from December 1990 as the test dataset and treat those time steps as out of sample.

Specifically 1990-12-25 to 1990-12-31:

The code below will load the dataset, split it into the training and validation datasets, and save them to files dataset.csv and validation.csv respectively.

Run the example and you should now have two files to work with.

The last observation in the dataset.csv is Christmas Eve 1990:

That means Christmas Day 1990 and onwards are out-of-sample time steps for a model trained on dataset.csv.

3. Develop Model

In this section, we are going to make the data stationary and develop a simple ARIMA model.

The data has a strong seasonal component. We can neutralize this and make the data stationary by taking the seasonal difference. That is, we can take the observation for a day and subtract the observation from the same day one year ago.

This will result in a stationary dataset from which we can fit a model.

We can invert this operation by adding the value of the observation one year ago. We will need to do this to any forecasts made by a model trained on the seasonally adjusted data.

We can fit an ARIMA model.

Fitting a strong ARIMA model to the data is not the focus of this post, so rather than going through the analysis of the problem or grid searching parameters, I will choose a simple ARIMA(7,0,7) configuration.

We can put all of this together as follows:

Running the example loads the dataset, takes the seasonal difference, then fits an ARIMA(7,0,7) model and prints the summary of the fit model.

We are now ready to explore making out-of-sample forecasts with the model.

4. One-Step Out-of-Sample Forecast

ARIMA models are great for one-step forecasts.

A one-step forecast is a forecast of the very next time step in the sequence from the available data used to fit the model.

In this case, we are interested in a one-step forecast of Christmas Day 1990:

Forecast Function

The statsmodel ARIMAResults object provides a forecast() function for making predictions.

By default, this function makes a single step out-of-sample forecast. As such, we can call it directly and make our forecast. The result of the forecast() function is an array containing the forecast value, the standard error of the forecast, and the confidence interval information. Now, we are only interested in the first element of this forecast, as follows.

Once made, we can invert the seasonal difference and convert the value back into the original scale.

The complete example is listed below:

Running the example prints 14.8 degrees, which is close to the expected 12.9 degrees in the validation.csv file.

Predict Function

The statsmodel ARIMAResults object also provides a predict() function for making forecasts.

The predict function can be used to predict arbitrary in-sample and out-of-sample time steps, including the next out-of-sample forecast time step.

The predict function requires a start and an end to be specified, these can be the indexes of the time steps relative to the beginning of the training data used to fit the model, for example:

The start and end can also be a datetime string or a “datetime” type; for example:

and

Using anything other than the time step indexes results in an error on my system, as follows:

Perhaps you will have more luck; for now, I am sticking with the time step indexes.

The complete example is listed below:

Running the example prints the same forecast as above when using the forecast() function.

You can see that the predict function is more flexible. You can specify any point or contiguous forecast interval in or out of sample.

Now that we know how to make a one-step forecast, we can now make some multi-step forecasts.

5. Multi-Step Out-of-Sample Forecast

We can also make multi-step forecasts using the forecast() and predict() functions.

It is common with weather data to make one week (7-day) forecasts, so in this section we will look at predicting the minimum daily temperature for the next 7 out-of-sample time steps.

Forecast Function

The forecast() function has an argument called steps that allows you to specify the number of time steps to forecast.

By default, this argument is set to 1 for a one-step out-of-sample forecast. We can set it to 7 to get a forecast for the next 7 days.

We can then invert each forecasted time step, one at a time and print the values. Note that to invert the forecast value for t+2, we need the inverted forecast value for t+1. Here, we add them to the end of a list called history for use when calling inverse_difference().

The complete example is listed below:

Running the example prints the forecast for the next 7 days.

Predict Function

The predict() function can also forecast the next 7 out-of-sample time steps.

Using time step indexes, we can specify the end index as 6 more time steps in the future; for example:

The complete example is listed below.

Running the example produces the same results as calling the forecast() function in the previous section, as you would expect.

Summary

In this tutorial, you discovered how to make out-of-sample forecasts in Python using statsmodels.

Specifically, you learned:

  • How to make a one-step out-of-sample forecast.
  • How to make a 7-day multi-step out-of-sample forecast.
  • How to use both the forecast() and predict() functions when forecasting.

Do you have any questions about out-of-sample forecasts, or about this post? Ask your questions in the comments and I will do my best to answer.

Want to Develop Time Series Forecasts with Python?

Introduction to Time Series Forecasting With Python

Develop Your Own Forecasts in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Introduction to Time Series Forecasting With Python

It covers self-study tutorials and end-to-end projects on topics like:
Loading data, visualization, modeling, algorithm tuning, and much more...

Finally Bring Time Series Forecasting to
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.

117 Responses to How to Make Out-of-Sample Forecasts with ARIMA in Python

  1. Steve March 24, 2017 at 10:44 pm #

    Your tutorials are the most helpful machine learning resources I have found on the Internet and have been hugely helpful in work and personal side projects. I don’t know if you take requests but I’d love to see a series of posts on recommender systems one of these days!

  2. Tim April 27, 2017 at 12:43 pm #

    Hi,

    This is a really nice example. Do you know if the ARIMA class allows to define the specification of the model without going through the fitting procedure. Let’s say I have parameters that were estimated using a dataset that I no longer have but I still want to produce a forecast.

    Thanks

  3. masum May 11, 2017 at 8:32 pm #

    sir,

    would it be possible to do the same using LSTM RNN ?

    if it is would you please come up with a blog?

    Thanking you

  4. masum May 12, 2017 at 8:29 pm #

    I tried to run the above example without any seasonal difference with given below code.

    from pandas import Series
    from matplotlib import pyplot
    from pandas import Series
    from statsmodels.tsa.arima_model import ARIMA
    # load dataset
    series = Series.from_csv(‘daily-minimum-temperatures.csv’, header=0)
    print(series.head(20))
    series.plot()
    pyplot.show()

    split_point = len(series) – 7
    dataset, validation = series[0:split_point], series[split_point:]
    print(‘Dataset %d, Validation %d’ % (len(dataset), len(validation)))
    dataset.to_csv(‘dataset.csv’)
    validation.to_csv(‘validation.csv’)

    series = Series.from_csv(‘dataset.csv’, header=None)
    model = ARIMA(series, order=(7,0,1))
    model_fit = model.fit(disp=0)

    forecast = model_fit.forecast(steps=7)[0]
    print(‘Forecast: %f’ % forecast)

    for the code i am getting an error:

    TypeError: only length-1 arrays can be converted to Python scalars

    how can i solve this? it does well for single step forecast

    • Jason Brownlee May 13, 2017 at 6:13 am #

      I would recommend double checking your data, make sure any footer information was deleted.

  5. Hans June 1, 2017 at 12:58 am #

    What does ‘seasonal difference’ mean?

    And what are the details of:

    ‘Once made, we can invert the seasonal difference and convert the value back into the original scale.’

    Is it worth to test this code with non-seasonal data or is there another ARIMA-tutorial for non-seasonal approaches on this site?

  6. Hans June 15, 2017 at 11:27 am #

    If I pretend data in test-partition is not given, does this tutorial do the same except of the seasonal cleaning?

    http://machinelearningmastery.com/tune-arima-parameters-python/

  7. Hans June 15, 2017 at 11:29 am #

    Can I obtain a train RMSE from this example. Is training involved?

    • Jason Brownlee June 16, 2017 at 7:47 am #

      The model is trained, then the trained model is used to make a forecast.

      Consider reading and working through the tutorial.

      • Hans June 16, 2017 at 12:16 pm #

        I did so several times.
        How can I obtain a train RMSE from the model?

        • Jason Brownlee June 17, 2017 at 7:20 am #

          See this post on how to estimate the skill of a model prior to using it to make out of sample predictions:
          http://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/

          See this post to understand the difference between evaluating a model and using a final model to make predictions:
          http://machinelearningmastery.com/train-final-machine-learning-model/

          • Hans June 19, 2017 at 5:35 am #

            I actually meant obtain a train RMSE from the model in the example.
            As I understand the model was trained before making an out of sample prediction.
            If we place a

            print(model_fit.summary())

            right after fitting/training it prints some information’s, but no train RMSE.

            A)
            Is there a way to use the summery-information to obtain a train RMSE?
            B)
            Is there a way in Python to obtain all properties and methods from the model_fit object- like in other languages?

          • Jason Brownlee June 19, 2017 at 8:47 am #

            Yes, this tutorial assumes you have already estimated the skill of your model and are now ready to use it to make forecasts.

            Estimating the skill of the model is a different task. You can do this using walk forward validation or a train/test split evaluation.

      • Hans June 16, 2017 at 3:06 pm #

        Is this the line where the training happens?

        model = ARIMA(differenced, order=(7,0,1))

        • Jason Brownlee June 17, 2017 at 7:22 am #

          No here:

        • Hans June 25, 2017 at 12:29 pm #

          Yes I know. I actually thought there could be a direct answer to A) and B).
          I would use it for archiving.

  8. Hans June 15, 2017 at 12:40 pm #

    If I write: ‘split_point = len(series) – 0’ while my last datapoint in dataset is from today.

    Would I have a valid forecast for tomorrow?

  9. M.Swefy June 22, 2017 at 12:39 am #

    thanks a lot for the nice detailed article, i followed all steps and they all seem working properly, i seek your support Dr. to help me organize my project.

    i have a raw data for temperature readings for some nodes (hourly readings), i selected the training set and divided them to test and training sets.
    i used ARINA model to train and test and i got Test MSE: 3.716.

    now i need to expose the mass raw data to the trained model, then get the forecased values vs. the actual values in the same csv file.

    what should i do

  10. AMU June 23, 2017 at 5:33 am #

    Thank you Jason for this wonderful post… It is very detailed and easy to understand..

    Do you also have something similar for LSTM Neural Network algorithm as well? something like – How to Make Out-of-Sample Forecasts with LSTM in Python.

    If not, will you write one blog like this with detail explanation? I am sure there are lot of people have the same question.

    • Jason Brownlee June 23, 2017 at 6:45 am #

      Almost every post I have on LSTMs shows how to make out of sample forecasts. The code is wrapped up in the walk-forward validation.

  11. Franklin July 1, 2017 at 1:09 am #

    Hi Jason,

    Thanks a lot for this lesson. It was pretty straightforward and easy to follow. It would have been a nice bonus to show how to evaluate the forecasts though with standard metrics. We separated the validation set out and forecasted values for that week, but didn’t compare to see how accurate the forecast was.

    On that note, I want to ask, does it make sense to use R^2 to score a time series forecast against test data? I’m trying to create absolute benchmarks for a time series that I’m analyzing and want to report unit-independent metrics, i.e. not standard RMSE that is necessarily expressed in the problem’s unit scale. What about standardizing the data using zero mean and unit variance, fitting ARIMA, forecasting, and reporting that RMSE? I’ve been doing this and taking the R^2 and the results are pretty interpretable. RMSE: 0.149 / R^2: 0.8732, but I’m just wondering if doing things this way doesn’t invalidate something along the way. Just want to be correct in my process.

    Thanks!

    • Jason Brownlee July 1, 2017 at 6:37 am #

      We do that in other posts. Tens of other posts in fact.

      This post was laser focused on “how do I make a prediction when I don’t know the real answer”.

      Yes, if R^2 is meaningful to you, that you can interpret it in your domain.

      Generally, I recommend inverting all transforms on the prediction and then evaluating model skill at least for RMSE or MAE where you want apples-to-apples. This may be less of a concern for an R^2.

  12. Vishanth July 19, 2017 at 6:56 am #

    Seriously amazing. Thanks a lot professor

  13. Kirui July 20, 2017 at 5:15 pm #

    I get this error from your code

    Traceback (most recent call last):
    File “..”, line 22, in
    differenced = difference(X, days_in_year)
    File “..”, line 9, in difference
    value = dataset[i] – dataset[i – interval]
    TypeError: unsupported operand type(s) for -: ‘str’ and ‘str’

    Cant tell where the problem is.

    • Jason Brownlee July 21, 2017 at 9:31 am #

      Perhaps check that you have loaded your data correct (as real values) and that you have copied all of the code from the post without extra white space.

  14. Antoine August 23, 2017 at 1:00 am #

    Hi Jason,
    Thanks for this detailled explanation. Very clear.

    Do you know if it is possible to use the fitted parameters of an ARMA model (ARMAResults.params) and apply it on an other data set ?

    I have an online process that compute a forecasting and I would like to have only one learning process (one usage of the fit() function). The rest of the time, I would like to applied the previously found parameters to the data.

    Thanks in advance !

  15. Bob October 6, 2017 at 11:53 pm #

    Ciao Jason,
    Thanks for this tutorial and all the time series related ones. There is always a sense of order in how you write both posts and code.
    I’m by the way still confused about something which is probably more conceptual about ARIMA.
    The ARIMA parameters specify the lag which it uses to forecast.
    In your case you used p=7 for example so that you would take into consideration the previous week.
    A first silly question is why do I need to fit an entire year of data if Im only looking at my window/lags ?
    The second question is that fitting my model I get an error which is really minimal even if I use a short training (2 days vs 1 year) which would reinforce my first point.
    What am I missing?
    Thanks

    • Jason Brownlee October 7, 2017 at 5:56 am #

      The model needs lots of examples in order to generalize to new cases.

      More data is often better, to a point of diminishing returns in terms of model skill.

  16. Kai October 31, 2017 at 12:02 pm #

    Hi Jason. Thanks for this awesome post.
    But I have a question that is it possible to fit a multivariable time series using ARIMA model? Let’s say we have a 312-dimension at each time step in the dataset.
    Thanks!

    • Jason Brownlee October 31, 2017 at 2:51 pm #

      Yes, but you will need to use an extension of ARIMA called ARIMAX. I do not have an example, sorry.

  17. Dave J November 5, 2017 at 7:12 am #

    Hi Dr Brownlee, thanks so much for the tutorials!

    I’ve searched but didn’t find anyhting – perhaps my fault…

    But do you have any tutorials or suggestions about forecasting with limited historical observations? Specifically, I’m in a position where some sensors may have a very limited set of historical observations (complete, but short, say it’s only been online for a month), but I have many sensors which could possibly be used as historical analogies (multiple years of data).

    I’ve considered constructing a process that uses each large-history sensor as the “Training” set, and iterating over each sensor and finding which sensor best predicts the observed readings for the newer sensors.

    However I’m struggling to find any established best practices for this type of thing. Do you have any suggestions for me?

    If not I understand, but I really appreciate all the insight you’ve given over these tutorials and in your book!

    • Jason Brownlee November 6, 2017 at 4:45 am #

      Great question.

      You might be able to use the historical data or models for different but similar sensors (one some dimension). Get creative!

      • Dave J November 6, 2017 at 10:53 am #

        I would likely just be looking at the RMSE and MAE to gauge accuracy, correct? Is there another measure of fitness I would be wise to consider?

        • Jason Brownlee November 7, 2017 at 9:45 am #

          No MSE and RMSE are error scores for regression problems. Accuracy is for classification problems (predicting a label).

  18. Debola November 11, 2017 at 5:28 am #

    Hi, Geat tutorial. A question about the difference function. How is it accounting for leap years?

    • Jason Brownlee November 11, 2017 at 9:24 am #

      It doesn’t, that would be a good extension to this tutorial.

      • Debola November 12, 2017 at 12:37 am #

        Is it possible to apply seasonal_decompose on the dataset used in this tutorial since it’s a daily forecast. Most applications of seasonal_decompose i have seen are usually on monthly and quarterly data

  19. Akanksha November 19, 2017 at 4:32 am #

    Thank you for an amazing tutorial. I wanted to ask if I can store the multiple step values that are predicted in the end of your tutorial into a variable for comparison with actual/real values?

    • Jason Brownlee November 19, 2017 at 11:10 am #

      Sure, you can assign them to a variable or save them to file.

      • Jonathon July 29, 2018 at 10:45 am #

        Thank you for the amazing blog!, I am finding it difficult to assign multi-step values to variable, Could you please help me with the same.

        Thanks in Advance!

    • Kapil July 29, 2018 at 10:36 pm #

      Hi Jason, Thank you for the amazing blog, could you please help me with assigning multi-step predict values to variable.

      • Jason Brownlee July 30, 2018 at 5:48 am #

        You can use the forecast() function and specify the number of steps.

        • kapil August 8, 2018 at 2:31 am #

          Thank you for your response Jason, I am getting different values with forecast() function and with predict() function, Predict function values are more accurate so I want them to assigned to variable, Can that be done? If yes what changes can I make.

          Thanks in Advance!

          • Jason Brownlee August 8, 2018 at 6:23 am #

            That is surprising, if not impossible.

            Perhaps confirm that you are providing the same arguments/data/model in both cases?

          • Kapil August 8, 2018 at 6:56 am #

            No Worries, I got it – Thank you

  20. Satyajit Pattnaik December 21, 2017 at 5:01 pm #

    @Jason, Thanks for this, but my dataset is in a different format, it’s in YYYY-MM-DD HH:MI:SS, and the data is hourly data, let say if we have data till 11/25/2017 23:00 5.486691952

    And we need to predict the next day’s data, so we need to predict our next 24 steps, what needs to be done?

    Need a help on this.

    • Jason Brownlee December 22, 2017 at 5:31 am #

      Sure, you can specify the date-time format when loading the Pandas Series.

      You can predict multiple steps using the predict() function.

  21. Satyajit Pattnaik December 21, 2017 at 8:02 pm #

    One more question on top of my previous question,
    let say my data is hourly data, and i have one week’s data as of now, as per your code do i have to take the days_in_year parameter as 7 for my case?

    And as per my data’s ACF & PACF, my model should be ARIMA(xyz, order=(4,1,2))
    and taking the days_in_year parameter as 7, is giving my results, but not sure how correct is that.. please elaborate a bit @Jason

    • Jason Brownlee December 22, 2017 at 5:32 am #

      I would recommend tuning the model to your specific data.

  22. Satyajit Pattnaik January 3, 2018 at 11:47 pm #

    Hi Jason,

    I am bugging you, but here’s my last question, my model is ready and i have predicted the p,d,q values as per the ACF, PACF plots.

    Now my code looks like this:

    Here, as i am appending obs to the history data, what if i add my prediction to history and then pass it to the model, do i have to run this in a loop to predict pdq values again in a loop?

    My question is, if we are doing Recursive multi step forecast do we have to run the history data to multiple ARIMA models, or can we just use history.append(yhat) in the above code and get my results?

    • Jason Brownlee January 4, 2018 at 8:12 am #

      Recursive multi-step means you will use predictions as history when you re-fit the model.

      • Satyajit Pattnaik January 4, 2018 at 4:48 pm #

        Reply to my previous response, so predictions to be added as history, that’s fine, we will be doing history.append(yhat) instead of history.append(obs), but do we have to run the above code using the same ARIMA model i.e. 6,1,2 or for each history we will determine the pdq values and run on multiple ARIMA models to get the next predictions?

        I hope, you are getting my point.

  23. Olagot Andree January 7, 2018 at 1:06 pm #

    Hello,
    I am actually working on a project for implicit volatility forecasting. My forecast is multi-output Your tutorial has been a lot of help but i just want to clarify something please.
    1. Is it okay to train on the all dataset and not divide it in train/test?
    2. What is the sample of data selected for the forecast function? I mean is it the 7 last values of the original dataset?

    Thank you

  24. Sooraj February 2, 2018 at 2:04 pm #

    How do we add more input parameters? Like for example, i would like to predict the weather forecast based on historic forecast but i would also like to consider, say the total number of rainy days last 10 years and have both influence my prediction?

    • Jason Brownlee February 3, 2018 at 8:32 am #

      You may have to use a different linear model such as ARIMAX.

      • Sooraj February 7, 2018 at 9:13 am #

        Thank you.

        Do you have any samples that I could learn from or use as a base to build my own forecast? Similar to the article that you shared above?

        • Jason Brownlee February 7, 2018 at 9:34 am #

          Perhaps try searching the blog and see if there is a tutorial that is a good fit?

          • Sooraj February 19, 2018 at 6:55 am #

            Will do that. Thanks!

  25. Daphne February 5, 2018 at 1:51 am #

    Hey Jason, let’s say if I wanted to forecast the value in the next 365 days, so I just simply change the line below to:

    forecast = model_fit.forecast(steps=365)[0]

    Will it works? Thanks!

  26. Chuck February 18, 2018 at 12:23 pm #

    Hi Jason,

    Thank you for sharing a such wonderful article with us which I am looking for a while.

    However, I got an error of “ValueError: The computed initial AR coefficients are not stationary.” when run your code block 5 beneath “We can put all of this together as follows:”

    If I run it under Sypder, I got “cannot import name ‘recarray_select'”.

    It would be appreciated if you could give me some clue how to fix it.

    Thank you!

    Chuck

  27. masum March 9, 2018 at 12:59 pm #

    how can we calculate the total RMSE?

    • Jason Brownlee March 10, 2018 at 6:16 am #

      The square root of the mean squared differences between predicted and expected values.

  28. Rishabh Agrawal March 30, 2018 at 3:19 am #

    Hi Jason,

    Thanks for the wonderful post.

    One thing which I can’t understand is that we are forecasting for the next 7 days in the same dataset (dataset.csv) that we have trained the model on.

    In other words, in the initial steps we had split the data into ‘dataset.csv’ and ‘validation.csv’ and then we fit the ARIMA on ‘dataset.csv’ but we never called ‘validation.csv’ before making a forecast. How does it wok?

    • Jason Brownlee March 30, 2018 at 6:44 am #

      No, we are forecasting beyond the end of dataset.csv as though validation.csv does not exist. We can then look in validation.csv and see how our forecasts compare.

      Perhaps re-read the tutorial?

      • Rishabh Agrawal March 30, 2018 at 5:04 pm #

        yep! got it. Actually I have exogenous inputs as well. So, I had to use ‘validation’ dataset as well.

  29. aadi April 19, 2018 at 9:14 pm #

    Hi jason
    Can you tell why did we leave the test data as it is?
    and what if so in the above method we dont separate the training and testing data?

    • Jason Brownlee April 20, 2018 at 5:49 am #

      In the above tutorial we are pretending we are making an out of sample forecast, e.g. that we do not know the true outcome values.

  30. Serkan May 17, 2018 at 6:34 pm #

    Could you please tell about what should be changed in the code if multivariate analysis is done, i.e, if we have extra 3 features in dataset.

    • Jason Brownlee May 18, 2018 at 6:21 am #

      Different methods will need to be used. I hope to have examples soon.

  31. Piyasi Choudhury May 30, 2018 at 8:27 am #

    Hi Jason, Thanks for the post..very intuitive. I am at Step3: Developing Model. I ran through the other doc on: how to choose your grid params for ARIMA configuration and came up with (10,0,0) with the lowest MSerror. I do the following:

    # seasonal difference
    X = series.values
    days_in_year = 365
    differenced = difference(X, days_in_year)

    # fit model
    model = ARIMA(differenced, order=(10,0,0))

    and get error: Insufficient degrees of freedom to estimate.

    My data is on monthly level (e.g. 1/31/2014, 2/28/2014, 3/31/2014)..I have 12 readings from each year of 2014-2017+3 readings from 2018 making it 52 readings. Do I have to change the #seasonal difference based on this?

    Thanks

    • Jason Brownlee May 30, 2018 at 3:07 pm #

      It is a good idea to seasonally adjust if you have a seasonal component or model it directly via SARIMA.

    • vamshi December 4, 2018 at 9:24 pm #

      i am getting same problem what should i do to rectify it

  32. SJ June 17, 2018 at 6:00 am #

    @ Jason

    Thank you for your article, this is helpful.
    I used Shampo sales dataset and used ARIMA Forecast & Predict function for next 12 months but i get different results.

    • Jason Brownlee June 18, 2018 at 6:36 am #

      Perhaps you have done something different to the tutorial?

  33. Rasangika June 23, 2018 at 8:42 pm #

    Hello sir,

    Can you please tell me how i can take the predicted output to a CSV ?
    Thank you!

  34. Kay July 10, 2018 at 6:34 am #

    Hi, @Jason
    I am trying to use predict(start, end), and I found only integer parameter will work. I want to specify the start and end by a date, but it gives me an error:
    ‘only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices’
    I have searched a lot online, but none of them work. Thank you so much!

    • Jason Brownlee July 10, 2018 at 6:54 am #

      The API says it does support dates, and I assume your data must be a pandas Series. I have not tried it though, sorry.

  35. Shivaprasad July 20, 2018 at 5:23 pm #

    If my dataset is less than 365 days it is showng an error in the below code:If my dataset is of just 50rows how that can be perfomed?

    from pandas import Series
    from statsmodels.tsa.arima_model import ARIMA
    import numpy

    # create a differenced series
    def difference(dataset, interval=1):
    diff = list()
    for i in range(interval, len(dataset)):
    value = dataset[i] – dataset[i – interval]
    diff.append(value)
    return numpy.array(diff)

    # invert differenced value
    def inverse_difference(history, yhat, interval=1):
    return yhat + history[-interval]

    # load dataset
    series = Series.from_csv(‘dataset.csv’, header=None)
    # seasonal difference
    X = series.values
    days_in_year = 365
    differenced = difference(X, days_in_year)
    # fit model
    model = ARIMA(differenced, order=(7,0,1))
    model_fit = model.fit(disp=0)
    # multi-step out-of-sample forecast
    forecast = model_fit.forecast(steps=7)[0]
    # invert the differenced forecast to something usable
    history = [x for x in X]
    day = 1
    for yhat in forecast:
    inverted = inverse_difference(history, yhat, days_in_year)
    print(‘Day %d: %f’ % (day, inverted))
    history.append(inverted)
    day += 1

  36. Fel September 2, 2018 at 8:23 am #

    I am trying to apply this code to other dataset, but I get this error. Please, any help?

    C:\Users\Fel\Anaconda3\lib\site-packages\statsmodels\tsa\tsatools.py:676: RuntimeWarning: divide by zero encountered in true_divide
    invmacoefs = -np.log((1-macoefs)/(1+macoefs))
    C:\Users\Fel\Anaconda3\lib\site-packages\statsmodels\tsa\tsatools.py:650: RuntimeWarning: invalid value encountered in true_divide
    newparams = ((1-np.exp(-params))/(1+np.exp(-params))).copy()
    C:\Users\Fel\Anaconda3\lib\site-packages\statsmodels\tsa\tsatools.py:651: RuntimeWarning: invalid value encountered in true_divide
    tmp = ((1-np.exp(-params))/(1+np.exp(-params))).copy()
    —————————————————————————
    LinAlgError Traceback (most recent call last)
    in ()
    24 # fit model
    25 model = ARIMA(differenced, order=(7,0,1))
    —> 26 model_fit = model.fit(disp=0)
    27 # multi-step out-of-sample forecast
    28 forecast = model_fit.forecast(steps=period_forecast)[0]

    ~\Anaconda3\lib\site-packages\statsmodels\tsa\arima_model.py in fit(self, start_params, trend, method, transparams, solver, maxiter, full_output, disp, callback, start_ar_lags, **kwargs)
    957 maxiter=maxiter,
    958 full_output=full_output, disp=disp,
    –> 959 callback=callback, **kwargs)
    960 params = mlefit.params
    961

    ~\Anaconda3\lib\site-packages\statsmodels\base\model.py in fit(self, start_params, method, maxiter, full_output, disp, fargs, callback, retall, skip_hessian, **kwargs)
    464 callback=callback,
    465 retall=retall,
    –> 466 full_output=full_output)
    467
    468 # NOTE: this is for fit_regularized and should be generalized

    ~\Anaconda3\lib\site-packages\statsmodels\base\optimizer.py in _fit(self, objective, gradient, start_params, fargs, kwargs, hessian, method, maxiter, full_output, disp, callback, retall)
    189 disp=disp, maxiter=maxiter, callback=callback,
    190 retall=retall, full_output=full_output,
    –> 191 hess=hessian)
    192
    193 optim_settings = {‘optimizer’: method, ‘start_params’: start_params,

    ~\Anaconda3\lib\site-packages\statsmodels\base\optimizer.py in _fit_lbfgs(f, score, start_params, fargs, kwargs, disp, maxiter, callback, retall, full_output, hess)
    408 callback=callback, args=fargs,
    409 bounds=bounds, disp=disp,
    –> 410 **extra_kwargs)
    411
    412 if full_output:

    ~\Anaconda3\lib\site-packages\scipy\optimize\lbfgsb.py in fmin_l_bfgs_b(func, x0, fprime, args, approx_grad, bounds, m, factr, pgtol, epsilon, iprint, maxfun, maxiter, disp, callback, maxls)
    197
    198 res = _minimize_lbfgsb(fun, x0, args=args, jac=jac, bounds=bounds,
    –> 199 **opts)
    200 d = {‘grad’: res[‘jac’],
    201 ‘task’: res[‘message’],

    ~\Anaconda3\lib\site-packages\scipy\optimize\lbfgsb.py in _minimize_lbfgsb(fun, x0, args, jac, bounds, disp, maxcor, ftol, gtol, eps, maxfun, maxiter, iprint, callback, maxls, **unknown_options)
    333 # until the completion of the current minimization iteration.
    334 # Overwrite f and g:
    –> 335 f, g = func_and_grad(x)
    336 elif task_str.startswith(b’NEW_X’):
    337 # new iteration

    ~\Anaconda3\lib\site-packages\scipy\optimize\lbfgsb.py in func_and_grad(x)
    278 if jac is None:
    279 def func_and_grad(x):
    –> 280 f = fun(x, *args)
    281 g = _approx_fprime_helper(x, fun, epsilon, args=args, f0=f)
    282 return f, g

    ~\Anaconda3\lib\site-packages\scipy\optimize\optimize.py in function_wrapper(*wrapper_args)
    291 def function_wrapper(*wrapper_args):
    292 ncalls[0] += 1
    –> 293 return function(*(wrapper_args + args))
    294
    295 return ncalls, function_wrapper

    ~\Anaconda3\lib\site-packages\statsmodels\base\model.py in f(params, *args)
    438
    439 def f(params, *args):
    –> 440 return -self.loglike(params, *args) / nobs
    441
    442 if method == ‘newton’:

    ~\Anaconda3\lib\site-packages\statsmodels\tsa\arima_model.py in loglike(self, params, set_sigma2)
    778 method = self.method
    779 if method in [‘mle’, ‘css-mle’]:
    –> 780 return self.loglike_kalman(params, set_sigma2)
    781 elif method == ‘css’:
    782 return self.loglike_css(params, set_sigma2)

    ~\Anaconda3\lib\site-packages\statsmodels\tsa\arima_model.py in loglike_kalman(self, params, set_sigma2)
    788 Compute exact loglikelihood for ARMA(p,q) model by the Kalman Filter.
    789 “””
    –> 790 return KalmanFilter.loglike(params, self, set_sigma2)
    791
    792 def loglike_css(self, params, set_sigma2=True):

    ~\Anaconda3\lib\site-packages\statsmodels\tsa\kalmanf\kalmanfilter.py in loglike(cls, params, arma_model, set_sigma2)
    647 loglike, sigma2 = kalman_loglike.kalman_loglike_double(y, k,
    648 k_ar, k_ma, k_lags, int(nobs), Z_mat,
    –> 649 R_mat, T_mat)
    650 elif issubdtype(paramsdtype, np.complex128):
    651 loglike, sigma2 = kalman_loglike.kalman_loglike_complex(y, k,

    kalman_loglike.pyx in statsmodels.tsa.kalmanf.kalman_loglike.kalman_loglike_double()

    kalman_loglike.pyx in statsmodels.tsa.kalmanf.kalman_loglike.kalman_filter_double()

    ~\Anaconda3\lib\site-packages\numpy\linalg\linalg.py in pinv(a, rcond)
    1722 return wrap(res)
    1723 a = a.conjugate()
    -> 1724 u, s, vt = svd(a, full_matrices=False)
    1725
    1726 # discard small singular values

    ~\Anaconda3\lib\site-packages\numpy\linalg\linalg.py in svd(a, full_matrices, compute_uv)
    1442
    1443 signature = ‘D->DdD’ if isComplexType(t) else ‘d->ddd’
    -> 1444 u, s, vh = gufunc(a, signature=signature, extobj=extobj)
    1445 u = u.astype(result_t, copy=False)
    1446 s = s.astype(_realType(result_t), copy=False)

    ~\Anaconda3\lib\site-packages\numpy\linalg\linalg.py in _raise_linalgerror_svd_nonconvergence(err, flag)
    96
    97 def _raise_linalgerror_svd_nonconvergence(err, flag):
    —> 98 raise LinAlgError(“SVD did not converge”)
    99
    100 def get_linalg_error_extobj(callback):

    LinAlgError: SVD did not converge

    • Jason Brownlee September 3, 2018 at 6:09 am #

      Perhaps try some other configurations of the model?
      Perhaps try to scale or difference your data first?
      Perhaps try more or less data?

  37. Tejas Haritsa V K September 7, 2018 at 8:10 pm #

    Truly an outstanding work. I had been searching all over the net for the forecast and predict functions and this made my day. Thank you for this wonderful knowledge.

    Do share your YouTube channel link if you have a channel, I would love to subscribe.

    • Jason Brownlee September 8, 2018 at 6:04 am #

      Thanks.

      I don’t make videos. Developers learn by doing, not watching.

  38. Ashutosh Sharma September 17, 2018 at 7:09 am #

    I get this error from your code

    Traceback (most recent call last):
    File “..”, line 22, in
    differenced = difference(X, days_in_year)
    File “..”, line 9, in difference
    value = dataset[i] – dataset[i – interval]
    TypeError: unsupported operand type(s) for -: ‘str’ and ‘str’

    Cant tell where the problem is.

    • Jason Brownlee September 17, 2018 at 2:07 pm #

      Ensure that you copy the complete example and preserve indenting.

  39. Bhadri October 1, 2018 at 3:42 am #

    Thanks Jason. this is very helpful.

    When I run the original dataset, train it and test it, I get a MSE of .09 which is very good where I use (p,d,q) as 2,1,0.

    My dataset contains 60 observations out of I push 12 to validation set.

    When I forecasted using step=12 and did a MSE with validation set, I get a MSE of .42.
    Is this expected and is it a good measure?

    regards
    Bhadri.

  40. SN October 9, 2018 at 10:47 pm #

    Hi Jason,

    Thanks ever so much for this post! Your posts are all very clear and easy to follow. I cannot steady the heavily mathematical stuff, it just confuses me.

    I have a question. If my daily data is for Mondays-Fridays, should I adjust the number of days in a year to 194 instead of 365? That is the total number of days in this year excluding holidays and weekends in Germany.

    Regards,

    S:N

  41. PyTom October 11, 2018 at 11:39 pm #

    Dear Jason, thank you very much for the tutorial. Is it normal that if I do a long-term prediction (for instance, 200 steps) the performance of the predictor degradates? In particular, I observe that the prediction converges to a certain value. What can I do to perform a long term out-of-sample prediction?

    • Jason Brownlee October 12, 2018 at 6:40 am #

      Yes, the further into the future you predict, the worse the performance. Predicting the future is very hard.

  42. Raghu October 15, 2018 at 12:46 am #

    Hi Jason, Thank you very much for the post.
    I checked stationarity test for the provided data-set with Augmented Dickey-Fuller method and below is the result

    Test Statistic -4.445747
    p-value 0.000246
    #Lags Used 20.000000
    Number of Observation Used 3629.000000
    Critical Value (1%) -3.432153
    Critical Value (5%) -2.862337
    Critical Value (10%) -2.567194
    The result shows that data looks stationary. So my question is

    1. Even though data is stationary why did you apply Seasonality dereference ?
    2. You have taken seasonality dereference of data and the parameter d of ARIMA model is still 0(ARIMA model 7 0 1). isn’t required to mention d > 0(No of dereference taken) when dereference has applied on actual data?

    • Jason Brownlee October 15, 2018 at 7:30 am #

      The problem is easier to model with the seasonality removed.

      The d parameter is intended to counter any trend, there is no trend, therefore d can remain zero.

  43. July October 24, 2018 at 1:06 pm #

    Hi, this is wonderful.
    I have a small question about the out of sample one step forecast for several days. For example, I need to predict data from 1990-12-25 to 1990-12-31, and I want to use one step forecast for every. How can I make it using api predict or forecast? Thanks.

    • Jason Brownlee October 24, 2018 at 2:48 pm #

      I believe the example in the tutorial above does this. Perhaps I don’t understand your question?

      • July October 25, 2018 at 1:38 am #

        Well, thanks for the reply.
        Let’s talk about the 7 data from 1990-12-25 to 1990-12-31 that needs to be forecasted. In your tutorial, you use the function forecast(period=7) getting the forecasting in one time. But I want to only use the function forecast(period=1) in 7 times to make the forecasting. For forecast(period=7), the new predicted data would affect the next data to be predicted(for example, the predicted data 1990-12-25 would affect the data 1990-12-26 to be predicted). For forecast(period=1), every predicted data is affected by the real data. That is to say, when predicting 1990-12-26, the real data 1990-12-25 would add into the model, not the predicted data 1990-12-25 like in forecast(period=7). My question is how to program the dynamic data update using statsmodels.
        Forgive my unskilled expression.

        • Jason Brownlee October 25, 2018 at 8:02 am #

          Ahh, I see, thanks.

          I assume that real observations are made available after each prediction, so that they can be used as input.

          The simplest answer is to re-fit the model with the new obs and make a 1-step prediction.

          The complex answer is to study the API/code and figure out how to provide the dynamic input, I’m not sure off the cuff if the statsmodel API supports this usage.

          Also, this may help for the latter:
          https://machinelearningmastery.com/make-manual-predictions-arima-models-python/

  44. July October 27, 2018 at 7:38 pm #

    Thanks for your reply again.
    I have been working with the first method you mentioned. It is the correct method that can meet my demand. But it has a very high time spending. Well I test on the stock index data such as DJI.NYSE including 3000+ data. It is very hard for arima method to make a good regression. Maybe stocks data can not be predicted.

Leave a Reply