Time Series Forecast Study with Python: Monthly Sales of French Champagne

Time series forecasting is a process, and the only way to get good forecasts is to practice this process.

In this tutorial, you will discover how to forecast the monthly sales of French champagne with Python.

Working through this tutorial will provide you with a framework for the steps and the tools for working through your own time series forecasting problems.

After completing this tutorial, you will know:

  • How to confirm your Python environment and carefully define a time series forecasting problem.
  • How to create a test harness for evaluating models, develop a baseline forecast, and better understand your problem with the tools of time series analysis.
  • How to develop an autoregressive integrated moving average model, save it to file, and later load it to make predictions for new time steps.

Let’s get started.

  • Update Mar/2017: Fixed a typo in the code example in the “Review Residual Errors” section where the wrong model was run.
  • Updated Apr/2019: Updated the link to dataset.
Time Series Forecast Study with Python - Monthly Sales of French Champagne

Time Series Forecast Study with Python – Monthly Sales of French Champagne
Photo by Basheer Tome, some rights reserved.

Overview

In this tutorial, we will work through a time series forecasting project from end-to-end, from downloading the dataset and defining the problem to training a final model and making predictions.

This project is not exhaustive, but shows how you can get good results quickly by working through a time series forecasting problem systematically.

The steps of this project that we will through are as follows.

  1. Environment.
  2. Problem Description.
  3. Test Harness.
  4. Persistence.
  5. Data Analysis.
  6. ARIMA Models.
  7. Model Validation.

This will provide a template for working through a time series prediction problem that you can use on your own dataset.

Stop learning Time Series Forecasting the slow way!

Take my free 7-day email course and discover how to get started (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

1. Environment

This tutorial assumes an installed and working SciPy environment and dependencies, including:

  • SciPy
  • NumPy
  • Matplotlib
  • Pandas
  • scikit-learn
  • statsmodels

If you need help installing Python and the SciPy environment on your workstation, consider the Anaconda distribution that manages much of it for you.

This script will help you check your installed versions of these libraries.

The results on my workstation used to write this tutorial are as follows:

2. Problem Description

The problem is to predict the number of monthly sales of champagne for the Perrin Freres label (named for a region in France).

The dataset provides the number of monthly sales of champagne from January 1964 to September 1972, or just under 10 years of data.

The values are a count of millions of sales and there are 105 observations.

The dataset is credited to Makridakis and Wheelwright, 1989.

Download the dataset as a CSV file and place it in your current working directory with the filename “champagne.csv“.

3. Test Harness

We must develop a test harness to investigate the data and evaluate candidate models.

This involves two steps:

  1. Defining a Validation Dataset.
  2. Developing a Method for Model Evaluation.

3.1 Validation Dataset

The dataset is not current. This means that we cannot easily collect updated data to validate the model.

Therefore we will pretend that it is September 1971 and withhold the last one year of data from analysis and model selection.

This final year of data will be used to validate the final model.

The code below will load the dataset as a Pandas Series and split into two, one for model development (dataset.csv) and the other for validation (validation.csv).

Running the example creates two files and prints the number of observations in each.

The specific contents of these files are:

  • dataset.csv: Observations from January 1964 to September 1971 (93 observations)
  • validation.csv: Observations from October 1971 to September 1972 (12 observations)

The validation dataset is about 11% of the original dataset.

Note that the saved datasets do not have a header line, therefore we do not need to cater for this when working with these files later.

3.2. Model Evaluation

Model evaluation will only be performed on the data in dataset.csv prepared in the previous section.

Model evaluation involves two elements:

  1. Performance Measure.
  2. Test Strategy.

3.2.1 Performance Measure

The observations are a count of champagne sales in millions of units.

We will evaluate the performance of predictions using the root mean squared error (RMSE). This will give more weight to predictions that are grossly wrong and will have the same units as the original data.

Any transforms to the data must be reversed before the RMSE is calculated and reported to make the performance between different methods directly comparable.

We can calculate the RMSE using the helper function from the scikit-learn library mean_squared_error() that calculates the mean squared error between a list of expected values (the test set) and the list of predictions. We can then take the square root of this value to give us an RMSE score.

For example:

3.2.2 Test Strategy

Candidate models will be evaluated using walk-forward validation.

This is because a rolling-forecast type model is required from the problem definition. This is where one-step forecasts are needed given all available data.

The walk-forward validation will work as follows:

  • The first 50% of the dataset will be held back to train the model.
  • The remaining 50% of the dataset will be iterated and test the model.
  • For each step in the test dataset:
    • A model will be trained.
    • A one-step prediction made and the prediction stored for later evaluation.
    • The actual observation from the test dataset will be added to the training dataset for the next iteration.
  • The predictions made during the iteration of the test dataset will be evaluated and an RMSE score reported.

Given the small size of the data, we will allow a model to be re-trained given all available data prior to each prediction.

We can write the code for the test harness using simple NumPy and Python code.

Firstly, we can split the dataset into train and test sets directly. We’re careful to always convert a loaded dataset to float32 in case the loaded data still has some String or Integer data types.

Next, we can iterate over the time steps in the test dataset. The train dataset is stored in a Python list as we need to easily append a new observation each iteration and NumPy array concatenation feels like overkill.

The prediction made by the model is called yhat for convention, as the outcome or observation is referred to as y and yhat (a ‘y‘ with a mark above) is the mathematical notation for the prediction of the y variable.

The prediction and observation are printed each observation for a sanity check prediction in case there are issues with the model.

4. Persistence

The first step before getting bogged down in data analysis and modeling is to establish a baseline of performance.

This will provide both a template for evaluating models using the proposed test harness and a performance measure by which all more elaborate predictive models can be compared.

The baseline prediction for time series forecasting is called the naive forecast, or persistence.

This is where the observation from the previous time step is used as the prediction for the observation at the next time step.

We can plug this directly into the test harness defined in the previous section.

The complete code listing is provided below.

Running the test harness prints the prediction and observation for each iteration of the test dataset.

The example ends by printing the RMSE for the model.

In this case, we can see that the persistence model achieved an RMSE of 3186.501. This means that on average, the model was wrong by about 3,186 million sales for each prediction made.

We now have a baseline prediction method and performance; now we can start digging into our data.

5. Data Analysis

We can use summary statistics and plots of the data to quickly learn more about the structure of the prediction problem.

In this section, we will look at the data from five perspectives:

  1. Summary Statistics.
  2. Line Plot.
  3. Seasonal Line Plots
  4. Density Plots.
  5. Box and Whisker Plot.

5.1 Summary Statistics

Summary statistics provide a quick look at the limits of observed values. It can help to get a quick idea of what we are working with.

The example below calculates and prints summary statistics for the time series.

Running the example provides a number of summary statistics to review.

Some observations from these statistics include:

  • The number of observations (count) matches our expectation, meaning we are handling the data correctly.
  • The mean is about 4,641, which we might consider our level in this series.
  • The standard deviation (average spread from the mean) is relatively large at 2,486 sales.
  • The percentiles along with the standard deviation do suggest a large spread to the data.

5.2 Line Plot

A line plot of a time series can provide a lot of insight into the problem.

The example below creates and shows a line plot of the dataset.

Run the example and review the plot. Note any obvious temporal structures in the series.

Some observations from the plot include:

  • There may be an increasing trend of sales over time.
  • There appears to be systematic seasonality to the sales for each year.
  • The seasonal signal appears to be growing over time, suggesting a multiplicative relationship (increasing change).
  • There do not appear to be any obvious outliers.
  • The seasonality suggests that the series is almost certainly non-stationary.
Champagne Sales Line Plot

Champagne Sales Line Plot

There may be benefit in explicitly modeling the seasonal component and removing it. You may also explore using differencing with one or two levels in order to make the series stationary.

The increasing trend or growth in the seasonal component may suggest the use of a log or other power transform.

5.3 Seasonal Line Plots

We can confirm the assumption that the seasonality is a yearly cycle by eyeballing line plots of the dataset by year.

The example below takes the 7 full years of data as separate groups and creates one line plot for each. The line plots are aligned vertically to help spot any year-to-year pattern.

Running the example creates the stack of 7 line plots.

We can clearly see a dip each August and a rise from each August to December. This pattern appears the same each year, although at different levels.

This will help with any explicitly season-based modeling later.

Seasonal Per Year Line Plots

Seasonal Per Year Line Plots

It might have been easier if all season line plots were added to the one graph to help contrast the data for each year.

5.4 Density Plot

Reviewing plots of the density of observations can provide further insight into the structure of the data.

The example below creates a histogram and density plot of the observations without any temporal structure.

Run the example and review the plots.

Some observations from the plots include:

  • The distribution is not Gaussian.
  • The shape has a long right tail and may suggest an exponential distribution

This lends more support to exploring some power transforms of the data prior to modeling.

5.5 Box and Whisker Plots

We can group the monthly data by year and get an idea of the spread of observations for each year and how this may be changing.

We do expect to see some trend (increasing mean or median), but it may be interesting to see how the rest of the distribution may be changing.

The example below groups the observations by year and creates one box and whisker plot for each year of observations. The last year (1971) only contains 9 months and may not be a useful comparison with the 12 months of observations for other years. Therefore, only data between 1964 and 1970 was plotted.

Running the example creates 7 box and whisker plots side-by-side, one for each of the 7 years of selected data.

Some observations from reviewing the plots include:

  • The median values for each year (red line) may show an increasing trend.
  • The spread or middle 50% of the data (blue boxes) does appear reasonably stable.
  • There are outliers each year (black crosses); these may be the tops or bottoms of the seasonal cycle.
  • The last year, 1970, does look different from the trend in prior years
Champagne Sales Box and Whisker Plots

Champagne Sales Box and Whisker Plots

The observations suggest perhaps some growth trend over the years and outliers that may be a part of the seasonal cycle.

This yearly view of the data is an interesting avenue and could be pursued further by looking at summary statistics from year-to-year and changes in summary stats from year-to-year.

6. ARIMA Models

In this section, we will develop Autoregressive Integrated Moving Average, or ARIMA, models for the problem.

We will approach modeling by both manual and automatic configuration of the ARIMA model. This will be followed by a third step of investigating the residual errors of the chosen model.

As such, this section is broken down into 3 steps:

  1. Manually Configure the ARIMA.
  2. Automatically Configure the ARIMA.
  3. Review Residual Errors.

6.1 Manually Configured ARIMA

The ARIMA(p,d,q) model requires three parameters and is traditionally configured manually.

Analysis of the time series data assumes that we are working with a stationary time series.

The time series is almost certainly non-stationary. We can make it stationary this by first differencing the series and using a statistical test to confirm that the result is stationary.

The seasonality in the series is seemingly year-to-year. Seasonal data can be differenced by subtracting the observation from the same time in the previous cycle, in this case the same month in the previous year. This does mean that we will lose the first year of observations as there is no prior year to difference with.

The example below creates a deseasonalized version of the series and saves it to file stationary.csv.

Running the example outputs the result of a statistical significance test of whether the differenced series is stationary. Specifically, the augmented Dickey-Fuller test.

The results show that the test statistic value -7.134898 is smaller than the critical value at 1% of -3.515. This suggests that we can reject the null hypothesis with a significance level of less than 1% (i.e. a low probability that the result is a statistical fluke).

Rejecting the null hypothesis means that the process has no unit root, and in turn that the time series is stationary or does not have time-dependent structure.

For reference, the seasonal difference operation can be inverted by adding the observation for the same month the year before. This is needed in the case that predictions are made by a model fit on seasonally differenced data. The function to invert the seasonal difference operation is listed below for completeness.

A plot of the differenced dataset is also created.

The plot does not show any obvious seasonality or trend, suggesting the seasonally differenced dataset is a good starting point for modeling.

We will use this dataset as an input to the ARIMA model. It also suggests that no further differencing may be required, and that the d parameter may be set to 0.

Seasonal Differenced Champagne Sales Line Plot

Seasonal Differenced Champagne Sales Line Plot

The next first step is to select the lag values for the Autoregression (AR) and Moving Average (MA) parameters, p and q respectively.

We can do this by reviewing Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots.

Note, we are now using the seasonally differenced stationary.csv as our dataset.

The example below creates ACF and PACF plots for the series.

Run the example and review the plots for insights into how to set the p and q variables for the ARIMA model.

Below are some observations from the plots.

  • The ACF shows a significant lag for 1 month.
  • The PACF shows a significant lag for 1 month, with perhaps some significant lag at 12 and 13 months.
  • Both the ACF and PACF show a drop-off at the same point, perhaps suggesting a mix of AR and MA.

A good starting point for the p and q values is also 1.

The PACF plot also suggests that there is still some seasonality present in the differenced data.

We may consider a better model of seasonality, such as modeling it directly and explicitly removing it from the model rather than seasonal differencing.

ACF and PACF Plots of Seasonally Differenced Champagne Sales

ACF and PACF Plots of Seasonally Differenced Champagne Sales

This quick analysis suggests an ARIMA(1,0,1) on the stationary data may be a good starting point.

The historic observations will be seasonally differenced prior to the fitting of each ARIMA model. The differencing will be inverted for all predictions made to make them directly comparable to the expected observation in the original sale count units.

Experimentation shows that this configuration of ARIMA does not converge and results in errors by the underlying library. Further experimentation showed that adding one level of differencing to the stationary data made the model more stable. The model can be extended to ARIMA(1,1,1).

We will also disable the automatic addition of a trend constant from the model by setting the ‘trend‘ argument to ‘nc‘ for no constant in the call to fit(). From experimentation, I find that this can result in better forecast performance on some problems.

The example below demonstrates the performance of this ARIMA model on the test harness.

Note, you may see a warning message from the underlying linear algebra library; this can be ignored for now.

Running this example results in an RMSE of 956.942, which is dramatically better than the persistence RMSE of 3186.501.

This is a great start, but we may be able to get improved results with a better configured ARIMA model.

6.2 Grid Search ARIMA Hyperparameters

The ACF and PACF plots suggest that an ARIMA(1,0,1) or similar may be the best that we can do.

To confirm this analysis, we can grid search a suite of ARIMA hyperparameters and check that no models result in better out of sample RMSE performance.

In this section, we will search values of p, d, and q for combinations (skipping those that fail to converge), and find the combination that results in the best performance on the test set. We will use a grid search to explore all combinations in a subset of integer values.

Specifically, we will search all combinations of the following parameters:

  • p: 0 to 6.
  • d: 0 to 2.
  • q: 0 to 6.

This is (7 * 3 * 7), or 147, potential runs of the test harness and will take some time to execute.

It may be interesting to evaluate MA models with a lag of 12 or 13 as were noticed as potentially interesting from reviewing the ACF and PACF plots. Experimentation suggested that these models may not be stable, resulting in errors in the underlying mathematical libraries.

The complete worked example with the grid search version of the test harness is listed below.

Running the example runs through all combinations and reports the results on those that converge without error. The example takes a little over 2 minutes to run on modern hardware.

The results show that the best configuration discovered was ARIMA(0, 0, 1) with an RMSE of 939.464, slightly lower than the manually configured ARIMA from the previous section. This difference may or may not be statistically significant.

We will select this ARIMA(0, 0, 1) model going forward.

6.3 Review Residual Errors

A good final check of a model is to review residual forecast errors.

Ideally, the distribution of residual errors should be a Gaussian with a zero mean.

We can check this by using summary statistics and plots to investigate the residual errors from the ARIMA(0, 0, 1) model. The example below calculates and summarizes the residual forecast errors.

Running the example first describes the distribution of the residuals.

We can see that the distribution has a right shift and that the mean is non-zero at 165.904728.

This is perhaps a sign that the predictions are biased.

The distribution of residual errors is also plotted.

The graphs suggest a Gaussian-like distribution with a bumpy left tail, providing further evidence that perhaps a power transform might be worth exploring.

Residual Forecast Error Density Plots

Residual Forecast Error Density Plots

We could use this information to bias-correct predictions by adding the mean residual error of 165.904728 to each forecast made.

The example below performs this bias correlation.

The performance of the predictions is improved very slightly from 939.464 to 924.699, which may or may not be significant.

The summary of the forecast residual errors shows that the mean was indeed moved to a value very close to zero.

Finally, density plots of the residual error do show a small shift towards zero.

It is debatable whether this bias correction is worth it, but we will use it for now.

Bias Corrected Residual Forecast Error Density Plots

Bias-Corrected Residual Forecast Error Density Plots

It is also a good idea to check the time series of the residual errors for any type of autocorrelation. If present, it would suggest that the model has more opportunity to model the temporal structure in the data.

The example below re-calculates the residual errors and creates ACF and PACF plots to check for any significant autocorrelation.

The results suggest that what little autocorrelation is present in the time series has been captured by the model.

Residual Forecast Error ACF and PACF Plots

Residual Forecast Error ACF and PACF Plots

7. Model Validation

After models have been developed and a final model selected, it must be validated and finalized.

Validation is an optional part of the process, but one that provides a ‘last check’ to ensure we have not fooled or misled ourselves.

This section includes the following steps:

  1. Finalize Model: Train and save the final model.
  2. Make Prediction: Load the finalized model and make a prediction.
  3. Validate Model: Load and validate the final model.

7.1 Finalize Model

Finalizing the model involves fitting an ARIMA model on the entire dataset, in this case on a transformed version of the entire dataset.

Once fit, the model can be saved to file for later use.

The example below trains an ARIMA(0,0,1) model on the dataset and saves the whole fit object and the bias to file.

There is a bug in the current stable version of the statsmodels library (v0.6.1) that results in an error when you try to load a saved ARIMA model from file. The error reported is:

This bug also seems present in the 0.8 release candidate 1 of statsmodels when I tested it. For more details, see Zae Myung Kim‘s discussion and fix in this GitHub issue.

We can work around this with a monkey patch that adds a __getnewargs__() instance function to the ARIMA class before saving.

The example below saves the fit model to file in the correct state so that it can be loaded successfully later.

Running the example creates two local files:

  • model.pkl This is the ARIMAResult object from the call to ARIMA.fit(). This includes the coefficients and all other internal data returned when fitting the model.
  • model_bias.npy This is the bias value stored as a one-row, one-column NumPy array.

7.2 Make Prediction

A natural case may be to load the model and make a single forecast.

This is relatively straightforward and involves restoring the saved model and the bias and calling the forecast() method. To invert the seasonal differencing, the historical data must also be loaded.

The example below loads the model, makes a prediction for the next time step, and prints the prediction.

Running the example prints the prediction of about 6794.

If we peek inside validation.csv, we can see that the value on the first row for the next time period is 6981.

The prediction is in the right ballpark.

7.3 Validate Model

We can load the model and use it in a pretend operational manner.

In the test harness section, we saved the final 12 months of the original dataset in a separate file to validate the final model.

We can load this validation.csv file now and use it see how well our model really is on “unseen” data.

There are two ways we might proceed:

  • Load the model and use it to forecast the next 12 months. The forecast beyond the first one or two months will quickly start to degrade in skill.
  • Load the model and use it in a rolling-forecast manner, updating the transform and model for each time step. This is the preferred method as it is how one would use this model in practice as it would achieve the best performance.

As with model evaluation in previous sections, we will make predictions in a rolling-forecast manner. This means that we will step over lead times in the validation dataset and take the observations as an update to the history.

Running the example prints each prediction and expected value for the time steps in the validation dataset.

The final RMSE for the validation period is predicted at 361.110 million sales.

This is much better than the expectation of an error of a little more than 924 million sales per month.

A plot of the predictions compared to the validation dataset is also provided.

At this scale on the plot, the 12 months of forecast sales figures look fantastic.

Forecast for Champagne Sales Validation Dataset

Forecast for Champagne Sales Validation Dataset

Summary

In this tutorial, you discovered the steps and the tools for a time series forecasting project with Python.

We have covered a lot of ground in this tutorial; specifically:

  • How to develop a test harness with a performance measure and evaluation method and how to quickly develop a baseline forecast and skill.
  • How to use time series analysis to raise ideas for how to best model the forecast problem.
  • How to develop an ARIMA model, save it, and later load it to make predictions on new data.

How did you do? Do you have any questions about this tutorial?
Ask your questions in the comments below and I will do my best to answer.

Want to Develop Time Series Forecasts with Python?

Introduction to Time Series Forecasting With Python

Develop Your Own Forecasts in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Introduction to Time Series Forecasting With Python

It covers self-study tutorials and end-to-end projects on topics like:
Loading data, visualization, modeling, algorithm tuning, and much more...

Finally Bring Time Series Forecasting to
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.

151 Responses to Time Series Forecast Study with Python: Monthly Sales of French Champagne

  1. Benson Dube February 20, 2017 at 8:46 pm #

    Thank You Jason.

    Would you recommend this example to a starter. I am totally new to programming and Machine Learning, and I am currently learning Python sytnax and understanding the basic terms around algorithms used.

    Regards,

    Benson

    • Jason Brownlee February 21, 2017 at 9:34 am #

      Hi Benson, this tutorial is an example for a beginner in time series forecasting, but does assume you know your way around Python first (or you can pick it up fast).

      • Benson March 1, 2017 at 6:45 am #

        A surely steep learning curve taking you out of your comfort zone, but that’s the way to learn. ?

      • laurenciatitan November 14, 2018 at 1:57 am #

        helo jason will i be able to contact you and discuss, need explaination..

  2. Viral Mehta February 24, 2017 at 4:56 am #

    Jason, thanks for very detailed instructions on how to construct the model. When I add a few additional periods in the validation set (manually) for a short-term forecast beyond the dataset, the model won’t run until I provide some ‘fake’ targets (expected y). However, when I provide some fake targets, I see that model quickly adjusts to those targets. I tried different levels of y values and I see model predicted accordingly, shouldn’t the predictions be independent of what targets it sees?

    • Jason Brownlee February 24, 2017 at 10:14 am #

      You should not need fake targets Viral.

      You can forecast a new out of sample data point by training the model on all of the available data and predicting one-step. E.g.:

      • Viral Mehta February 25, 2017 at 2:14 am #

        Jason, much appreciate your response. What I am trying to convey is that the model should predict same outputs regardless of the observations. For example, when I change y values in my validation set (last 12 months that the model hasn’t seen), my predictions change and the model gives me a very nice RMSE every time. If the model was trained properly, it should have same output for next twelve months regardless of my y values in the validation set. I think you will see this effect very quickly if you also change your y values from validation set. Unless the model is only good for one period forward and needs to continuously adjust based on observed values of last period.

        • Jason Brownlee February 25, 2017 at 6:00 am #

          In this specific case we are using walk-forward validation with a model trained for one-step forecasts.

          This means as we walk over the validation dataset, observations are moved from validation into training and the model is refit. This is to simulate a new observation being available after a prediction is made for a time step and us being able to update our model in response.

          I think what you are talking about is a multi-step forecast, e.g. fitting the model on all of the training data, then forecasting the next n time steps. This is not the experimental setup here, but you are welcome to try it.

      • Benson Dube March 15, 2017 at 9:11 am #

        Hello Jason,

        If you don’t mind me asking, where exactly should the line below fit?:

        yhat = model_fit.forecast()[0]

        • Jason Brownlee March 16, 2017 at 7:57 am #

          Hi Benson,

          Fit the ARIMA model on all of your data.

          After that, make a forecast for the next time step by calling the forecast() function.

  3. Juanlu February 27, 2017 at 10:15 am #

    Time to upgrade to matplotlib 2.0, the colors are nicer 🙂

    • Jason Brownlee February 28, 2017 at 8:09 am #

      I agree Juanlu! I have recently upgraded myself.

  4. Hugo March 24, 2017 at 2:38 am #

    Dear Jason,
    Thanks for the post. Clear, concise and working.

    Have you considered to forecast the TS using a SARIMA model instead of substracting the seasonality and adding it latter? As a matter of fact, statsmodel has it integrated in its dev version. (http://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.html)

    I am wondering if it is possible to take into account hierarchy, like the forecast of monthly sales divided in sub-forecasts of french champagne destination markets, using python.

    Thanks and keep posting!

    • Jason Brownlee March 24, 2017 at 7:59 am #

      Hi Hugo, yes I am familiar with SARIMA, but I decided to keep the example limited to ARIMA.

      Yes, as long as you have the data to train/verify the model. I would consider starting with independent models for each case, using all available data and tease out how to optimize it further after that.

  5. Luis Ashurei May 2, 2017 at 7:25 pm #

    Thanks for the wonderful post Jason,

    Two quick question:

    1.
    When I’m doing Grid Search ARIMA Hyperparameters with your example, it’s quite slow on my machine, spent about 1 minutes. However the parameter range and data are not large at all. Is it slow specific on my end? I’m concerning the model is too slow to use.

    2.
    Does ARIMA support multi-step forecast? E.g. What if I keep use predicted value into next forecast, will it overfitting?

    Thanks!

  6. Pramod Gupta July 3, 2017 at 7:58 pm #

    Hello Jason

    Thanks for this awesome hands-on on Time series. However I have a query.

    I extracted year and month from date column as my features for model. Then I built a Linear Regression model and a Random Forest model. My final prediction was a weighted average of both these models. I got an rmse of 366 (similar to yours i.e. 361 on validation data).

    Can this approach be the alternative for an ARIMA model? What can be the possible drawbacks of this approach?

    Appreciate your comments
    Thanks

    • Jason Brownlee July 6, 2017 at 10:01 am #

      Nice work. ARIMA may be a simpler model if model complexity is important on a project.

  7. Roberto July 5, 2017 at 7:26 am #

    Hi just a silly note have you consider using itertools for your evaluate_models() function?
    https://pastebin.com/w0apb6Tg

    in python nested loops are not so readable.

  8. Rishi Patil July 8, 2017 at 9:15 am #

    Jason, I just finished reading “Introduction to Time Series Forecasting in Python” in detail and had two questions:
    * Are you planning to come out with a “Intermediate course in Time Series Forecasting in Python” soon?
    * You spend the first few chapters discussing how to reframe a time-series problem into a typical Machine Learning problem using Lag features. However, in later chapters, you only use ARIMA models, that obviate the necessity of using explicitly generated lag features. What’s the best way to use explicitly generated timeseries features with other ML algorithms (such as SVM or XGBoost)?
    (I tried out a few AR lags with XGB and got decent results using but couldn’t figure out how to incorporate the MA parts).
    Would appreciate any insights.

    • Jason Brownlee July 9, 2017 at 10:50 am #

      I may have a more advanced TS book in the future.

      Great question. You would have to calculate the MA features manually.

  9. Jan August 2, 2017 at 8:34 pm #

    Hi Jason,

    can it be that TimeGrouper() does not work if there are months missing in a year? Also there are no docs available for TimeGrouper(). May you can use pd.Grouper in your future examples?

    Regards

  10. Sambit August 17, 2017 at 5:09 am #

    Hi,
    Thanks for this great link on time series.

    I am not able to acces the dataset stored at:- https://datamarket.com/data/set/22r5/perrin-freres-monthly-champagne-sales-millions-64-72#!ds=22r5&display=line

    Regards,
    Sambit

    • Jason Brownlee August 17, 2017 at 6:49 am #

      I’m sorry to hear that, here is the full dataset:

  11. sarrauste August 24, 2017 at 1:15 am #

    hi,
    i am having the following error:

    X = X.astype(‘float32’)
    ValueError: could not convert string to float: ‘1972-09’

    can you please help me
    ty

  12. sandip August 28, 2017 at 8:09 pm #

    Hi Jason , Nice approach for time series forecasting.Thanks for this Article.

    Just had a one small doubt,Here we are tuning our ARIMA parameter on test data using RMSE as a measure but generally we tune our parameter on the training(train+validation)data and then we check whether those parameters are well generalized or not by using test data. is it right?
    Because my concern is that if we choose our parameter on that particular test data,whether those parameter will generalized to coming new test data or not? I feel that there will be biasness while choosing parameter because we are specifically choosing parameter those giving less RMSE for that test data.Here we are not checking whether our model is working/fitted well for our train data or not?
    If i am wrong just correct me.Let me know the logic behind this parameter tuning on the basis of test data?

  13. sandip August 29, 2017 at 5:58 pm #

    Thank you so much Jason.

  14. Anthy August 31, 2017 at 2:24 am #

    Hi Jason,
    Thank you for this detailed and clear tutorial.
    I have a question concerning fiding ARIMA’s parameters. I didn’t understand why we have an imbricated “for” loop and we do consider all the p values and not all the q values.

    Please ask if the question is not that clear.

    Thank you in advance.

    • Jason Brownlee August 31, 2017 at 6:23 am #

      Please review the example again, we grid search p, d and q values.

  15. Ella Zhao September 2, 2017 at 4:26 am #

    Hi Jason,

    Thanks for this awesome example! It helped me a lot! However if you don’t mind, I have a question on the prediction loop.

    Why did you use history.append(obs) instead of history.append(yhat) when doing the loop for prediction? Sees like you add the observation (real data in test) to history. And yhat = bias + inverse_difference(history, yhat, months_in_year) is based on the history data. but actually we don’t have those observation data when solving practical problem.

    I’ve tried to use history.append(yhat) in my model, but the result is worse than using history.append(obs).

    Appreciate your comments

    Thanks,

    Ella

    • Jason Brownlee September 2, 2017 at 6:16 am #

      Good question, because in this scenario, we are assuming that each new real observation is available after we make a prediction.

      Pretend we are running this loop one iteration per day, and we get yesterdays observation this morning so we can accurately predict tomorrow.

      • Ella Zhao September 2, 2017 at 11:46 pm #

        Thank you so much for your explanation!!

  16. Alaoui September 3, 2017 at 12:27 am #

    Hello,

    Thanks a lot for this work.
    I have a point to discuss on concerning the walk forward validation.
    I saw that you fill the history part by a value from the test set at each iteration :
    …..
    # observation
    obs = test[i]
    history.append(obs)
    ….

    Do we really have to do this or we have to use the new yhat computed for the next predictions?
    Indeed when we have to do a future prediction on some range date we don’t have the historical value….
    Regards and thanks in advance for your feedback

    Simo

    • Jason Brownlee September 3, 2017 at 5:48 am #

      It depends on your model and project goals.

      In this case, we are assuming that the true observations are available after each modeling step and can be used in the formulation of the next prediction, a reasonable assumption, but not required in general.

      If you use outputs as inputs (e.g. recursive multi-step forecasting), error will compound and things will go more crazy, sooner (e.g. model skill will drop). See this post:
      https://machinelearningmastery.com/multi-step-time-series-forecasting/

  17. Alaoui September 3, 2017 at 12:49 am #

    Sorry, I just read the Ella question…So for long range future prediction we have ti use the yhat…
    Thanks

  18. Sujeet September 16, 2017 at 12:22 pm #

    Hi,

    I am getting NAN after these steps:

    from pandas import Series
    series = Series.from_csv(‘champagne.csv’, header=0)
    print(len(series))
    split_point = len(series) – 12
    print(split_point)
    dataset, validation = series[0:split_point], series[split_point:]
    print(dataset)
    print(‘Dataset %d, Validation %d’ % (len(dataset), len(validation)))
    dataset.to_csv(‘dataset.csv’)
    validation.to_csv(‘validation.csv’)

    • Jason Brownlee September 17, 2017 at 5:22 am #

      I’m sorry to hear that.

      Ensure that you have copied all of the code and that you have removed the footer from the data file. Also confirm that your environment is up to date.

  19. Jaryn September 20, 2017 at 3:24 am #

    Hi Jason,

    This is extremely useful. Thank you very much!

    I was wondering if you could provide the full code to extend the forecast for 1 year ahead. I know you mentioned above that there is a forecast function but when I run:
    yhat = model_fit.forecast()[0]
    pyplot.plot(y)
    pyplot.plot(yhat, color=’green’)
    pyplot.plot(predictions, color=’red’)
    pyplot.show()

    I don’t get any green lines in my code. I’m sure I’m missing a lot of things here, but I don’t know what.

    Thank you again, much appreciated!

  20. Kunal Shrivastava September 20, 2017 at 4:54 pm #

    Hi Jason,
    I am using this modelling steps to model my problem. When I am passing best parameter which i have got from Grid Search for doing prediction, I am getting Linear Algebra error and sometimes not converging
    Due to this I am not able to predict the values.

    My second concern is if this is happening for Grid search, how can i automate the script in production environment. Lets say i have to forecast some values for 50000 different sites then what is best way to achieve this goal ?

  21. Tri Bien September 24, 2017 at 7:57 pm #

    Hi Jason,

    Thank for your post, it is really helpful for me. It is great!
    I have a small question:
    In the post, you use 2 data set: dataset.csv and validation.csv
    + In dataset.csv, you split it and use Walk-forward validation -> I totally agree.
    + In validation.csv, you still re-train ARIMA model with walk-forward validation (model_fit = model.fit(trend=’nc’, disp=0) – line 45 in section 7.3) -> I think it is unseen data for testing, so we don’t train model here and only test the model’s performance. Is it correct?

    • Jason Brownlee September 25, 2017 at 5:38 am #

      In the case of walk forward validation, we are assuming that the real observations are available after the time of prediction.

      You can make different assumptions when evaluating your model.

  22. Navid September 30, 2017 at 1:50 am #

    Hi, Jason
    First of all, thank you for your wonderful tutorial.
    Until now (part 5.4) I just have one doubt: why you plotted a histogram/kde without removing the trend first? If the goal is to see how the distribution shape is similar to Gaussian distribution, doesn’t a trend changes the distribution of data?

    Thanks
    Navid

    • Jason Brownlee September 30, 2017 at 7:44 am #

      Great question, at that point we are early in the analysis. Generally, I’d recommend looking at density plots before and after making the series stationary.

      Trend removal generally won’t impact the form of the distribution if we are doing simple linear operations.

  23. Nick October 19, 2017 at 12:46 am #

    Hi Jason,
    Thanks for your awesome post. Could you explain how to set the ‘interval’ in the function difference if I only have 1-year data? My dataset is recorded in half an hour, from June 2016 to June 2017. These data number are large in summer and small in winter, i.e. in winter it is between 0-2000, but in summer it is between 5000-14000.

  24. Shashank Hegde October 22, 2017 at 3:46 pm #

    Hi Jason,
    Thank you for the post.I want to implement ‘ARIMA’ function instead of using built-in function.
    Do you know where I can find the algorithm to implement ‘ARIMA’ function as well understanding that in detail manner?

  25. Hara October 28, 2017 at 3:38 am #

    Hi Jason,

    Thanks for the brief tutorial on Time Series forecasting. I am receiving the error “Given a pandas object and the index does not contain dates” when running the ARIMA model code snippet.

    • Jason Brownlee October 28, 2017 at 5:16 am #

      Be sure you copied all of the sample code exactly including indenting, and also be sure you have prepared the data, including removing the footer information from the file.

  26. Makis December 18, 2017 at 8:27 pm #

    Dear Jason,

    Thanks for your amazing tut!

    I have one problem only when i run the rolling forecasts.

    I use a different dataset which has temperature values for every minute of the day. I train the model with the first four days and i use the last day as a test dataset. My problem might be because i add the test observation to the history and my arima returns the following error: raise ValueError(“The computed initial AR coefficients are not ”
    ValueError: The computed initial AR coefficients are not stationary
    You should induce stationarity, choose a different model order, or you can
    pass your own start_params.

    for i in range(0, len(x_test)):

    # difference data
    diff = difference(history, minutes)
    # predict
    model = importer.ARIMA(diff, order=(5, 0, 1))
    model_fit = model.fit(trend=’nc’, disp=0)
    yhat = model_fit.forecast()[0]
    yhat = inverse_difference(history, yhat, minutes)
    predictions.append(yhat)
    # observation
    obs = x_test[i]
    history.append(obs)
    print(‘Predicted=%.6f, Expected=%.6f’ % (yhat, obs))

    My x_test dataset contains 1440 rows and i get the error on the 1423 iteration of the loop. Until iteration 1423 the each arima model does not have issues.

    Your help is precious to me.

    Thanks again!

    Kindest Regards,
    Makis

    • Jason Brownlee December 19, 2017 at 5:17 am #

      Have you confirmed that your data set is stationary? Perhaps there is still some trend or seasonal structure?

      I have many posts on detrending and seasonal adjustment that may help.

      • Makis December 20, 2017 at 12:54 am #

        Dear Jason,

        Thanks for your response!

        I have used a different hyperparameter (Arima 5,1,1) instead, and everything worked. I don’t know if this is the right thing, since 5,0,1 and 5,1,1 had exactly the same RMSE. What is your opinion about that? And please, one final question, my results now are extremely good, with an RMSE less than 0,01. The prediction line in the plot is almost over the the real data line. Does this has to do with the history.append(obs)? I’m not sure i understand correctly why the test observation is added to the history. And what is the difference of doing a prediction in a for loop with a new model for every step compared by using the steps parameter in 1 model?

        Sorry for the long questions!

        Your tutorials are even better than the books i’m currently reading!

        Cheers from Greece.

        • Jason Brownlee December 20, 2017 at 5:47 am #

          Well done. I’d recommend using the model that is skillful and stable.

          • Makis December 20, 2017 at 8:22 am #

            Hey Jason! What i’m trying to say is, maybe the good predictions are because i append the y observation to the history. Why we do this? what is the purpose? If we do not append the y observation then the results are going to be still valid?

          • Jason Brownlee December 20, 2017 at 3:49 pm #

            In the test setup we are assuming that real observations are made after each prediction that we can in turn use to make the next prediction.

            Your specific framing of the problem may differ and I would encourage you to design a test harness to capture that.

          • Makis December 20, 2017 at 10:42 pm #

            Hey Jason, for once more thanks for your feedback!
            I will follow your suggestion right now 🙂

            Thanks for everything

            Kind Regards,
            Makis

          • Makis December 20, 2017 at 10:57 pm #

            Hey Jason, one last question.

            Since i do the rolling forecast manner and i introduce a new model for every iteration, why do i need to save a model and do a the first prediction based on that?

            Can i simply do the rolling forecast loop to start from 0 instead of 1?

          • Jason Brownlee December 21, 2017 at 5:26 am #

            Sure.

  27. Amir December 19, 2017 at 6:51 pm #

    Hi Jason,

    Thank you so much for this amazing tutorial. It really helps me to learn time series forecasting and prepare for my new job. I’m a total beginner in data analysis/science as I’m currently making transition from engineering to data analysis.

    But the problem that I’m facing with this tutorial is as I’m running the code, I’m stuck with the final step, “Validate the Model”. I’ve tried to re-copy and re-run the sample code many times but it didn’t seem to work. Here is the error.

    >Predicted=10101.763, Expected=9851
    >Predicted=13219.067, Expected=12670
    >Predicted=3996.535, Expected=4348
    >Predicted=3465.934, Expected=3564
    >Predicted=4522.683, Expected=4577
    >Predicted=4901.336, Expected=4788
    >Predicted=5190.094, Expected=4618
    >Predicted=4930.190, Expected=5312
    >Predicted=4944.785, Expected=4298
    >Predicted=1699.409, Expected=1413
    >Predicted=6085.324, Expected=5877
    >Predicted=7135.720, Expected=nan

    ValueError Traceback (most recent call last)
    in ()
    56
    57 # Report performance
    —> 58 mse = mean_squared_error(y, predictions)
    59 rmse = sqrt(mse)
    60 print(‘RMSE: %.3f’ % rmse)

    /Users/amir/Library/Enthought/Canopy/edm/envs/User/lib/python3.5/site-packages/sklearn/metrics/regression.py in mean_squared_error(y_true, y_pred, sample_weight, multioutput)
    229 “””
    230 y_type, y_true, y_pred, multioutput = _check_reg_targets(
    –> 231 y_true, y_pred, multioutput)
    232 output_errors = np.average((y_true – y_pred) ** 2, axis=0,
    233 weights=sample_weight)

    /Users/amir/Library/Enthought/Canopy/edm/envs/User/lib/python3.5/site-packages/sklearn/metrics/regression.py in _check_reg_targets(y_true, y_pred, multioutput)
    73 “””
    74 check_consistent_length(y_true, y_pred)
    —> 75 y_true = check_array(y_true, ensure_2d=False)
    76 y_pred = check_array(y_pred, ensure_2d=False)
    77

    /Users/amir/Library/Enthought/Canopy/edm/envs/User/lib/python3.5/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    405 % (array.ndim, estimator_name))
    406 if force_all_finite:
    –> 407 _assert_all_finite(array)
    408
    409 shape_repr = _shape_repr(array.shape)

    /Users/amir/Library/Enthought/Canopy/edm/envs/User/lib/python3.5/site-packages/sklearn/utils/validation.py in _assert_all_finite(X)
    56 and not np.isfinite(X).all()):
    57 raise ValueError(“Input contains NaN, infinity”
    —> 58 ” or a value too large for %r.” % X.dtype)
    59
    60

    ValueError: Input contains NaN, infinity or a value too large for dtype(‘float32’).

    Could you please help me with this?

    Thank you so much

    Amir

    • Jason Brownlee December 20, 2017 at 5:41 am #

      I’m not sure of the cause of your fault, sorry.

      Ensure your libraries are up to date and that you have copied all of the code exactly?

    • Ankit Tripathi April 30, 2018 at 7:24 pm #

      The csv file contains some jargon text which should be deleted before reading the file as list.

  28. Satyajit Pattnaik December 28, 2017 at 5:03 pm #

    Hi Jason,

    In the prediction area, where you have added the observation to history and then running the loop to find the ARIMA results.
    i.e.

    model = ARIMA(history, order=(4,1,2))
    model_fit = model.fit(disp=0)
    output = model_fit.forecast()
    yhat = output[0]
    predictions.append(yhat)
    obs = test[t]
    history.append(obs)

    Here, what if we append yhat to history, when i am appending yhat to history, my results are really bad, pls help..

    Because as per my model, we need to predict the test data by using only the training data, so we cannot use obs to be appended in history, hope you get my point.

    • Jason Brownlee December 29, 2017 at 5:18 am #

      Yes, that would be called a recursive multi-step forecast. It is challenging. You can learn a little more about it here:
      https://machinelearningmastery.com/multi-step-time-series-forecasting/

      • Satyajit Pattnaik December 29, 2017 at 4:06 pm #

        Concept wise i understand what is Recursive Multi step forecasting, that’s the same i have used in my earlier reply code, i have appended obs in history, so that means everytime my loop runs, it takes the existing training data + next observation, so that should work and should predict correctly, but my results are really bad..

        Or do you mean to say, for this we need to plot the ACF, PACF graph in a a loop to determine the pdq values, and then run the ARIMA function on the pdq values, if that so, pls help us in finding a way to determine pdq values in loop

        • Jason Brownlee December 30, 2017 at 5:18 am #

          Generally using forecasts in a recursive model is challenging, you may need to get creative with your model/models.

  29. Udi January 8, 2018 at 8:57 pm #

    Hello Jason,

    Thank you for your clear and straightforward post!
    I have the same question as Nick above – about the choice of differentiation interval.
    In your example you set it to 12 according to the expected cycle of the data.
    However, in a more problematic case the data does not seem to imply a clear cycle (ACF and PACF graphs notwithstanding).
    In my case, I’ve found out that setting the interval to 12 yielded better results than my default, which was 1. I can understand why choosing a small interval would be generally bad – random noise is too dominant. I have more difficulty understanding how to calibrate the ideal interval value for my data (except brute force, that is. Maybe I shouldn’t calibrate? That could induce overfit. Anyway, I still should find a generally decent value).

    • Jason Brownlee January 9, 2018 at 5:29 am #

      If the amount of data is relatively small, consider using a grid search across difference values to see what works.

  30. Udi January 8, 2018 at 9:09 pm #

    And another question. I applied a grid search in order to choose the hyperparameters of my ARIMA model. I fear that bias correction could be counter productive in this case.

    Do you have any insights on this subject? Would you perform both (grid search followed by bias correction)? Is the answer data dependent?

    • Jason Brownlee January 9, 2018 at 5:29 am #

      I would only perform bias correction if it lifted model skill (e.g. if the model was biased).

  31. Nirmal Kanti Sahoo January 29, 2018 at 11:33 am #

    Hi Jason,

    I have around more than 1000 of products. I have sales yearly history for last 7 years for each 1000 product. I want to develop a model by which I can predict sales amount for next 3 years.
    Could you please help me how do I proceed with evaluate, prediction, validation and interpret the same.

    Thanking in advance.

  32. Udi February 18, 2018 at 9:32 pm #

    (Noob’s parallelization question, probably)

    Does anybody know whether statmodels ARIMA uses multi-threading or any other kind of parallelization?

    I’m trying to run an analysis based on multiple ARIMA fits on my laptop and thread parallelization increases total runtime rather than decrease it. It seems a single ARIMA fit (part of a single thread) uses several processors at once….

    Thanks,
    Udi

  33. Raul February 28, 2018 at 3:49 am #

    Hi Jason,

    This is an excellent article, and it is super helpful. I had two questions for extension that may have been asked but after reading through the comments, I’m not sure if the same advice applies to me. My question is most similar to Nirmals.

    1. I have a dataset with multiple “wines”, each with their own historical sales data

    2. This dataset has other variables that aren’t just time related.

    For example,

    Month 1 Sales Month 2 Sales Month 3 Sales online Reviews,
    Wine A
    Wine B
    Wine C
    Wine D

    I’m wondering if there’s an extension of this Time Series model that can take into account other variables and other instances of historical data. Let me know if I should clarify.

    Thank you for taking the time to clarify.

    • Jason Brownlee February 28, 2018 at 6:10 am #

      Yes, perhaps you could model each series separately, or groups of series or all together, or all 3 approaches and ensemble the results.

      Also, I’d recommend moving on to other models, such as ML or even MLP models.

      I hope to cover this topic in more detail soon.

      • Raul February 28, 2018 at 7:01 am #

        Thanks for the quick reply.

        I’ll try out your first method.
        But do you have any good ML/MLP models/tutorials to start with? It’s okay if you don’t! I noticed you have a nice article on multivariate time series here:
        https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/

        I haven’t read through it yet, but I think it only takes into account historical data of one instance of multiple variables.

        I think it would be interesting to find out how to ensemble the results of my
        #1. one instance of historical data (Wine A’s last 3 month sales)
        #2. one instance of non-historical variables (Wine A’s online reviews, type, etc.)

        To me, # 1’s output is a “variable” to be used in the ML model for #2. If that makes sense, do you think that’s the right way of going about things?

        Thanks for being so responsive!

        • Jason Brownlee March 1, 2018 at 6:01 am #

          I hope to cover this topic in more detail soon – maybe write a book on the topic working one dataset from many different directions.

  34. Anchal April 6, 2018 at 4:20 am #

    Thank you Jason! for really making ML practitioners like me being awesome in ML.

    In this blog you have mentioned that the results suggest that what little autocorrelation is present in the time series has been captured by the model.

    What will be the next steps if there is high autocorrelation?

    Thanks in advance.

    • Jason Brownlee April 6, 2018 at 6:35 am #

      Great question!

      Some ideas:
      – Perhaps try alternate models.
      – Perhaps improve the data preparation.
      – Perhaps try downstream models to further correct the forecast.

  35. Ankit Tripathi May 10, 2018 at 5:55 pm #

    Hey Jason,

    That was a very well informed article! I am trying to forecast on a weekly data. Any tips for improving the model since weekly data is hard to forecast. Also should “months_in_year = 12” used in differencing be changed to “weeks in month=4” to accommodate for weekly frequency?

    Thanks

    • Jason Brownlee May 11, 2018 at 6:34 am #

      My best tips are to try lots of data preparation techniques and tune the ARIMA using a grid search.

  36. Maria Dossin May 26, 2018 at 9:47 am #

    Hi Jason,
    Thank you for an amazing tutorial!
    Just one quick question: can you please provide solution for the first scenario (not rolling forward forecast):
    – Load the model and use it to forecast the next 12 months. The forecast beyond the first one or two months will quickly start to degrade in skill.
    Thank you~

  37. Piyasi Choudhury June 2, 2018 at 11:38 am #

    Hi Jason

    My time series data is non stationary as well and I tried upto 3rd degree of differentiation after which I cant proceed any more as the data is exhausted. It has 3 calculated points and on doing Fuller test for checking stationarity, it errors out “maxlag should be < nobs". What can I do here?

    • Jason Brownlee June 3, 2018 at 6:20 am #

      Sounds like you might be running out of data in your series. Perhaps you need more observations?

  38. KACEM June 3, 2018 at 11:54 am #

    Hello, thank you.

    i wonder if we can say that python give good forecasts than other forecast logiciels like Aperia.

    To forecast sales we have to base ourselves in historic and baseline but in fact, in industry there are many factors that complexify the calculation of our forecast like the marketing actions, exceptional customer order, … How can we consider these perturbation in our forecasts to ensure good value of accuracy.

    Thank you for answering.

    • Jason Brownlee June 4, 2018 at 6:21 am #

      What is Aperia?

      Additional elements could be used as exogenous variables in the model.

  39. cathy June 23, 2018 at 10:00 am #

    Jason,

    The dataset I have requires to double difference in order to make it stationary, so i used:

    diff1 = difference(history, 12)
    diff2 = difference(diff1, 12),

    it worked and made it stationary with ADF test, however, how do I reverse it back please?

    Thank you

    • Jason Brownlee June 24, 2018 at 7:27 am #

      Add the values back. It requires that you keep the original and the first diffed data.

      • Cathy June 24, 2018 at 10:28 am #

        Thanks, so would the function look like this?

        yhat = inverse_difference(diff1, yhat, interval)
        yhat = inverse_difference(history, yhat, interval) ?

        Thank you!

  40. Rui July 27, 2018 at 4:11 am #

    First of all, this is a great article, well written and detailed. I have a question regarding on the use of the test set on all of the analysis. What about putting this model in production? You wouldn’t have the ‘test[i]’ (or ‘y[i]’) for each iteration to add to the list ‘history’ to have a truly generalized prediction.
    My point is that that instead of adding the values from the test set (‘y[i]’, ‘test[i]’) you’d rather add the predictions being made to the training set in order to do a true random walk.
    Thanks again for the resource and all the help putting out this content.

  41. James Lucas July 27, 2018 at 2:21 pm #

    I keep getting errors
    —————————————————————————
    TypeError Traceback (most recent call last)
    in ()
    9 obs = test[i]
    10 history.append(obs)
    —> 11 print(‘>Predicted=%.3f, Expected=%3.f’ % (yhat, obs))

    TypeError: must be real number, not ellipsis
    IE: # predict
    yhat = …
    those 3 dots (ellipsis) are throwing errors.

    The code has … throughout. Any idea how to fix this error?

    • Jason Brownlee July 28, 2018 at 6:28 am #

      The code with “…” is just example code, it is not for execution. Skip it.

      Perhaps a more careful read of the tutorial is in order James?

  42. Ben Kesby August 3, 2018 at 2:53 am #

    I’m having an issue with the PACF plot for the residuals in part 6.3 Plotting The Residuals. I have followed your code exactly with the ACF plot resulting in a perfect match of the one displayed here tower the PACF is plotting values nearing 8 at around 36 – 38 lags. Any idea what might be causing this?

  43. JP August 24, 2018 at 8:17 pm #

    Hi Jason,
    Thanks so much for this tutorial. One thing though. As series.from_csv is depricated, the date format gets lost when opening the dataset, for exemple when trying to generate the seasonal line plots. Are you aware of a workaround that would keep the date format while using read_csv instead of from_csv?
    Thanks!

    • Jason Brownlee August 25, 2018 at 5:48 am #

      Yes you can use pandas.read_csv() with the same arguments.

  44. Alexandra September 12, 2018 at 8:40 pm #

    How can I improve the readability of Seasonal Per Year Line Plots (especially when it comes to axes) ? I would be grateful for your help.

    • Jason Brownlee September 13, 2018 at 8:01 am #

      Perhaps create one plot per figure?

      • Alexandra September 13, 2018 at 4:44 pm #

        I’d rather have a comparison between all the subplots.

        • Jason Brownlee September 14, 2018 at 6:33 am #

          Perhaps plot them all on the same figure?
          Perhaps calculate the different and plot that?

  45. Chintan September 18, 2018 at 2:16 am #

    Hi Jason,
    I was able to predict and chart looks nice. Just wondering how can we predict for the future, I mean to say if I wanted to see for next month prediction how are we able to do that?

    Thanks,
    Chintan

  46. Pranay October 9, 2018 at 9:26 pm #

    Hi Jason, I deal with daily set data. What is the issue when I try months_in_year = 364?? I mean it’s not throwing best_arima out when I set months_in_year = 364. May I know the reason?

    • Jason Brownlee October 10, 2018 at 6:09 am #

      Why are you making that change? I don’t follow?

  47. Patrick October 30, 2018 at 2:07 am #

    Dear Jason,

    thank you so much for your tutorial. I would like to apply it to a process, forecasting some process variables. The time interval is not monthly based, as in your example, but is much shorter, like some days and the data collection is about every three seconds.

    Now I would like to ask some questions.

    Do you think that this model could work equally well ?

    What should I put in place of ‘months_in_year’ ? For now I put 7, which is the number of different days in my dataset.

    Is it normal that with 2000+ points the matrix analysis regarding the seeking of the optimal parameters is taking ages ? (e.g. one triplet after 30 min) If yes, how could I improve the code speed ?

    Thank you again,
    Best wishes,
    Patrick.

    • Jason Brownlee October 30, 2018 at 6:08 am #

      Try it and see.

      2K points may be too many for the method, perhaps try reducing to a few hundred at max.

  48. Markus December 25, 2018 at 12:20 am #

    Hi

    Where does the column ‘A’comes from by the code above which is:

    groups = series[‘1964′:’1970’].groupby(TimeGrouper(‘A’))
    y

    And printing out groups contains only the first 6 months of each year, why?

  49. Markus December 25, 2018 at 3:10 am #

    Hi

    What would be wrong if instead of defining the difference function by yourself, we would pass over d with the value of 12:

    model = ARIMA(diff, order=(0,12,1))

    Wouldn’t that be the same?

    Thanks!

  50. Anthony January 10, 2019 at 10:33 pm #

    Hi Jeson in running your code I get this error:
    raise ValueError(“The computed initial MA coefficients are not ”
    ValueError: The computed initial MA coefficients are not invertible
    You should induce invertibility, choose a different model order, or you can
    pass your own start_params.
    How can I solve it? Thank you

    • Jason Brownlee January 11, 2019 at 7:47 am #

      Sorry to hear that.

      Perhaps confirm your statsmodels and other libraries are up to date?
      Perhaps confirm that you copied all of the code in order?
      Perhaps try an alternative model configuration?

      • Anthony January 15, 2019 at 9:50 pm #

        Ok I changed ARIMA parameters and now it works thank you!

  51. aravind January 12, 2019 at 9:29 pm #

    This is my data: like….

    Name Month Qty Unit
    Wire Rods Total 2007-JAN 93798 t
    Wire Rods Total 2007-FEB 86621 t
    Wire Rods Total 2007-MAR 93118 t

    My code is :

    import pandas as pd
    from sklearn.metrics import mean_squared_error
    from math import sqrt
    # load data
    path_to_file = “C:/Users\ARAVIND\Desktop\jupyter notebook\project\datasets.csv”
    data = pd.read_csv(path_to_file, encoding=’utf-8′)
    # prepare data
    X = data.values
    X = X.astype(‘float32’)
    train_size = int(len(X) * 0.50)
    train, test = X[0:train_size], X[train_size:]
    # walk-forward validation
    history = [x for x in train]
    predictions = list()
    for i in range(len(test)):
    # predict
    yhat = history[-1]
    predictions.append(yhat)
    # observation
    obs = test[i]
    history.append(obs)
    print(‘>Predicted=%.3f, Expected=%3.f’ % (yhat, obs))
    # report performance
    mse = mean_squared_error(test, predictions)
    rmse = sqrt(mse)
    print(‘RMSE: %.3f’ % rmse)

    when i run this command it shows “Value error: could not convert string to float”… so could anyone tell how to convert string to float according to my dataset.. i want to convert the columns which is “Name, Month and Unit” to float.

  52. Mridul March 20, 2019 at 5:41 pm #

    Hi Jason, Thanks for this amazing tutorial.

    However, I get the below error when I am trying to run it on a time series:

    ValueError: The computed initial MA coefficients are not invertible
    You should induce invertibility, choose a different model order, or you can
    pass your own start_params.

    • Mridul March 20, 2019 at 5:45 pm #

      I just saw that this has already been answered and it worked when I followed that answer. Thanks!

    • Jason Brownlee March 21, 2019 at 7:59 am #

      Perhaps try a different configuration of the q/d/p variables for the model?

  53. Kaws March 20, 2019 at 7:42 pm #

    This is really a helpful tutorial. Thank you Jason!!

    And I have a small question. I got TypeError: a float is required error after i executed this code –

    history = [x for x in train]
    predictions = list()
    for i in range(len(test)):
    # predict
    yhat = …
    predictions.append(yhat)
    # observation
    obs = test[i]
    history.append(obs)
    print(‘>Predicted=%.3f, Expected=%3.f’ % (yhat, obs))

    Can you help me out?

    • Jason Brownlee March 21, 2019 at 8:02 am #

      Perhaps confirm that you have loaded the data correctly as a float?

  54. Varun April 15, 2019 at 4:23 pm #

    Hi Jason,

    What should be the approach when we need to provide long term forecasts ~ 12 months with exogenous variables using a technique like ARIMAX? Should we forecast the covariates and then add it in the model?

    Regards,
    Varun

    • Jason Brownlee April 16, 2019 at 6:45 am #

      Perhaps you can frame your model to predict +12 months given only the observations available?

      • Varun April 26, 2019 at 7:22 pm #

        Hi Jason,

        Just to be clear, if I add regressors and train the model, I would require future values right? E.g. xreg argument in auto.arima etc. How can I forecast +12 months without utilizing regressors that I have used for training?

        • Jason Brownlee April 27, 2019 at 6:28 am #

          You could train a new productive model that only requires t-12 data to make predictions.

  55. Park1 April 24, 2019 at 8:14 pm #

    Hi!

    It’s amazing, thank you!

    Could you give all files of this project. Cannot build it on my own.

  56. Prince Tiwari May 15, 2019 at 9:31 pm #

    Hi!

    It’s amazing, thank you!

    Actually i want develop a model which determine how many calls we can expect to come into our call center on a daily basis ?

  57. Prince Tiwari May 22, 2019 at 10:12 pm #

    Hi Jason Brownlee ,

    Thank you so much for responded, i need to one more help

    Could you please explain how to use Seasonality in ARIMA with some example

    Thanks

  58. Adi June 27, 2019 at 2:37 am #

    Hi Jason,

    That’s a very helpful article. My time series is a daily Point of sale data (eg. how many pepsi bottles get bought everyday in a walmart). There are missing dates when pepsi was not sold at all. What I want to forecast is the bottles of pepsi sold on each day for the next 3/4 days. What might be the best approach/ algorithm?

    Thanks.

  59. Nagaraj July 9, 2019 at 4:21 pm #

    Hi Jason,
    I’m getting the error like this, what does this error mean?

    TypeError Traceback (most recent call last)
    in
    test = …
    predictions = …
    mse = mean_squared_error(test, predictions)
    rmse = sqrt(mse)
    print(‘RMSE: %.3f’ % rmse)

    TypeError: Expected sequence or array-like, got

    • Jason Brownlee July 10, 2019 at 8:03 am #

      That is surprising, did you copy of all of the code exactly?

Leave a Reply