Time Series Forecast Study with Python: Monthly Sales of French Champagne

By Jason Brownlee on December 10, 2020 in Time Series 222

Time series forecasting is a process, and the only way to get good forecasts is to practice this process.

In this tutorial, you will discover how to forecast the monthly sales of French champagne with Python.

Working through this tutorial will provide you with a framework for the steps and the tools for working through your own time series forecasting problems.

After completing this tutorial, you will know:

How to confirm your Python environment and carefully define a time series forecasting problem.
How to create a test harness for evaluating models, develop a baseline forecast, and better understand your problem with the tools of time series analysis.
How to develop an autoregressive integrated moving average model, save it to file, and later load it to make predictions for new time steps.

Kick-start your project with my new book Time Series Forecasting With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Updated Mar/2017: Fixed a typo in the code example.
Updated Apr/2019: Updated the link to dataset.
Updated Aug/2019: Updated data loading and date grouping to use new API.
Updated Feb/2020: Updated to_csv() to remove warnings.
Updated Feb/2020: Fixed data preparation and loading.
Updated May/2020: Fixed small typo when making a prediction.
Updated Dec/2020: Updated modeling for changes to the API.

Time Series Forecast Study with Python – Monthly Sales of French Champagne
Photo by Basheer Tome, some rights reserved.

Overview

In this tutorial, we will work through a time series forecasting project from end-to-end, from downloading the dataset and defining the problem to training a final model and making predictions.

This project is not exhaustive, but shows how you can get good results quickly by working through a time series forecasting problem systematically.

The steps of this project that we will through are as follows.

Environment.
Problem Description.
Test Harness.
Persistence.
Data Analysis.
ARIMA Models.
Model Validation.

This will provide a template for working through a time series prediction problem that you can use on your own dataset.

Stop learning Time Series Forecasting the slow way!

Take my free 7-day email course and discover how to get started (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

1. Environment

This tutorial assumes an installed and working SciPy environment and dependencies, including:

SciPy
NumPy
Matplotlib
Pandas
scikit-learn
statsmodels

If you need help installing Python and the SciPy environment on your workstation, consider the Anaconda distribution that manages much of it for you.

This script will help you check your installed versions of these libraries.

# check the versions of key python libraries
# scipy
import scipy
print('scipy: %s' % scipy.__version__)
# numpy
import numpy
print('numpy: %s' % numpy.__version__)
# matplotlib
import matplotlib
print('matplotlib: %s' % matplotlib.__version__)
# pandas
import pandas
print('pandas: %s' % pandas.__version__)
# statsmodels
import statsmodels
print('statsmodels: %s' % statsmodels.__version__)
# scikit-learn
import sklearn
print('sklearn: %s' % sklearn.__version__)

# check the versions of key python libraries

# scipy

import scipy

print('scipy: %s' % scipy.__version__)

# numpy

import numpy

print('numpy: %s' % numpy.__version__)

# matplotlib

import matplotlib

print('matplotlib: %s' % matplotlib.__version__)

# pandas

import pandas

print('pandas: %s' % pandas.__version__)

# statsmodels

import statsmodels

print('statsmodels: %s' % statsmodels.__version__)

# scikit-learn

import sklearn

print('sklearn: %s' % sklearn.__version__)

The results on my workstation used to write this tutorial are as follows:

scipy: 1.5.4
numpy: 1.18.5
matplotlib: 3.3.3
pandas: 1.1.4
statsmodels: 0.12.1
sklearn: 0.23.2

scipy: 1.5.4

numpy: 1.18.5

matplotlib: 3.3.3

pandas: 1.1.4

statsmodels: 0.12.1

sklearn: 0.23.2

2. Problem Description

The problem is to predict the number of monthly sales of champagne for the Perrin Freres label (named for a region in France).

The dataset provides the number of monthly sales of champagne from January 1964 to September 1972, or just under 10 years of data.

The values are a count of millions of sales and there are 105 observations.

The dataset is credited to Makridakis and Wheelwright, 1989.

Download the dataset.

Download the dataset as a CSV file and place it in your current working directory with the filename “champagne.csv“.

3. Test Harness

We must develop a test harness to investigate the data and evaluate candidate models.

This involves two steps:

Defining a Validation Dataset.
Developing a Method for Model Evaluation.

3.1 Validation Dataset

The dataset is not current. This means that we cannot easily collect updated data to validate the model.

Therefore we will pretend that it is September 1971 and withhold the last one year of data from analysis and model selection.

This final year of data will be used to validate the final model.

The code below will load the dataset as a Pandas Series and split into two, one for model development (dataset.csv) and the other for validation (validation.csv).

# separate out a validation dataset
from pandas import read_csv
series = read_csv('champagne.csv', header=0, index_col=0, parse_dates=True, squeeze=True)
split_point = len(series) - 12
dataset, validation = series[0:split_point], series[split_point:]
print('Dataset %d, Validation %d' % (len(dataset), len(validation)))
dataset.to_csv('dataset.csv', header=False)
validation.to_csv('validation.csv', header=False)

# separate out a validation dataset

from pandas import read_csv

series = read_csv('champagne.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

split_point = len(series) - 12

dataset, validation = series[0:split_point], series[split_point:]

print('Dataset %d, Validation %d' % (len(dataset), len(validation)))

dataset.to_csv('dataset.csv', header=False)

validation.to_csv('validation.csv', header=False)

Running the example creates two files and prints the number of observations in each.

Dataset 93, Validation 12

1	Dataset 93, Validation 12

The specific contents of these files are:

dataset.csv: Observations from January 1964 to September 1971 (93 observations)
validation.csv: Observations from October 1971 to September 1972 (12 observations)

The validation dataset is about 11% of the original dataset.

Note that the saved datasets do not have a header line, therefore we do not need to cater for this when working with these files later.

3.2. Model Evaluation

Model evaluation will only be performed on the data in dataset.csv prepared in the previous section.

Model evaluation involves two elements:

Performance Measure.
Test Strategy.

3.2.1 Performance Measure

The observations are a count of champagne sales in millions of units.

We will evaluate the performance of predictions using the root mean squared error (RMSE). This will give more weight to predictions that are grossly wrong and will have the same units as the original data.

Any transforms to the data must be reversed before the RMSE is calculated and reported to make the performance between different methods directly comparable.

We can calculate the RMSE using the helper function from the scikit-learn library mean_squared_error() that calculates the mean squared error between a list of expected values (the test set) and the list of predictions. We can then take the square root of this value to give us an RMSE score.

For example:

from sklearn.metrics import mean_squared_error
from math import sqrt
...
test = ...
predictions = ...
mse = mean_squared_error(test, predictions)
rmse = sqrt(mse)
print('RMSE: %.3f' % rmse)

from sklearn.metrics import mean_squared_error

from math import sqrt

...

test = ...

predictions = ...

mse = mean_squared_error(test, predictions)

rmse = sqrt(mse)

print('RMSE: %.3f' % rmse)

3.2.2 Test Strategy

Candidate models will be evaluated using walk-forward validation.

This is because a rolling-forecast type model is required from the problem definition. This is where one-step forecasts are needed given all available data.

The walk-forward validation will work as follows:

The first 50% of the dataset will be held back to train the model.
The remaining 50% of the dataset will be iterated and test the model.
For each step in the test dataset:
- A model will be trained.
- A one-step prediction made and the prediction stored for later evaluation.
- The actual observation from the test dataset will be added to the training dataset for the next iteration.
The predictions made during the iteration of the test dataset will be evaluated and an RMSE score reported.

Given the small size of the data, we will allow a model to be re-trained given all available data prior to each prediction.

We can write the code for the test harness using simple NumPy and Python code.

Firstly, we can split the dataset into train and test sets directly. We’re careful to always convert a loaded dataset to float32 in case the loaded data still has some String or Integer data types.

# prepare data
X = series.values
X = X.astype('float32')
train_size = int(len(X) * 0.50)
train, test = X[0:train_size], X[train_size:]

# prepare data

X = series.values

X = X.astype('float32')

train_size = int(len(X) * 0.50)

train, test = X[0:train_size], X[train_size:]

Next, we can iterate over the time steps in the test dataset. The train dataset is stored in a Python list as we need to easily append a new observation each iteration and NumPy array concatenation feels like overkill.

The prediction made by the model is called yhat for convention, as the outcome or observation is referred to as y and yhat (a ‘y‘ with a mark above) is the mathematical notation for the prediction of the y variable.

The prediction and observation are printed each observation for a sanity check prediction in case there are issues with the model.

# walk-forward validation
history = [x for x in train]
predictions = list()
for i in range(len(test)):
	# predict
	yhat = ...
	predictions.append(yhat)
	# observation
	obs = test[i]
	history.append(obs)
	print('>Predicted=%.3f, Expected=%3.f' % (yhat, obs))

# walk-forward validation

history = [x for x in train]

predictions = list()

for i in range(len(test)):

# predict

yhat = ...

predictions.append(yhat)

# observation

obs = test[i]

history.append(obs)

print('>Predicted=%.3f, Expected=%3.f' % (yhat, obs))

4. Persistence

The first step before getting bogged down in data analysis and modeling is to establish a baseline of performance.

This will provide both a template for evaluating models using the proposed test harness and a performance measure by which all more elaborate predictive models can be compared.

The baseline prediction for time series forecasting is called the naive forecast, or persistence.

This is where the observation from the previous time step is used as the prediction for the observation at the next time step.

We can plug this directly into the test harness defined in the previous section.

The complete code listing is provided below.

from pandas import read_csv
from sklearn.metrics import mean_squared_error
from math import sqrt
# load data
series = read_csv('dataset.csv', header=None, index_col=0, parse_dates=True, squeeze=True)
# prepare data
X = series.values
X = X.astype('float32')
train_size = int(len(X) * 0.50)
train, test = X[0:train_size], X[train_size:]
# walk-forward validation
history = [x for x in train]
predictions = list()
for i in range(len(test)):
	# predict
	yhat = history[-1]
	predictions.append(yhat)
	# observation
	obs = test[i]
	history.append(obs)
	print('>Predicted=%.3f, Expected=%3.f' % (yhat, obs))
# report performance
mse = mean_squared_error(test, predictions)
rmse = sqrt(mse)
print('RMSE: %.3f' % rmse)

from pandas import read_csv

from sklearn.metrics import mean_squared_error

from math import sqrt

# load data

series = read_csv('dataset.csv', header=None, index_col=0, parse_dates=True, squeeze=True)

# prepare data

X = series.values

X = X.astype('float32')

train_size = int(len(X) * 0.50)

train, test = X[0:train_size], X[train_size:]

# walk-forward validation

history = [x for x in train]

predictions = list()

for i in range(len(test)):

# predict

yhat = history[-1]

predictions.append(yhat)

# observation

obs = test[i]

history.append(obs)

print('>Predicted=%.3f, Expected=%3.f' % (yhat, obs))

# report performance

mse = mean_squared_error(test, predictions)

rmse = sqrt(mse)

print('RMSE: %.3f' % rmse)

Running the test harness prints the prediction and observation for each iteration of the test dataset.

The example ends by printing the RMSE for the model.

In this case, we can see that the persistence model achieved an RMSE of 3186.501. This means that on average, the model was wrong by about 3,186 million sales for each prediction made.

...
>Predicted=4676.000, Expected=5010
>Predicted=5010.000, Expected=4874
>Predicted=4874.000, Expected=4633
>Predicted=4633.000, Expected=1659
>Predicted=1659.000, Expected=5951
RMSE: 3186.501

...

>Predicted=4676.000, Expected=5010

>Predicted=5010.000, Expected=4874

>Predicted=4874.000, Expected=4633

>Predicted=4633.000, Expected=1659

>Predicted=1659.000, Expected=5951

RMSE: 3186.501

We now have a baseline prediction method and performance; now we can start digging into our data.

5. Data Analysis

We can use summary statistics and plots of the data to quickly learn more about the structure of the prediction problem.

In this section, we will look at the data from five perspectives:

Summary Statistics.
Line Plot.
Seasonal Line Plots
Density Plots.
Box and Whisker Plot.

5.1 Summary Statistics

Summary statistics provide a quick look at the limits of observed values. It can help to get a quick idea of what we are working with.

The example below calculates and prints summary statistics for the time series.

from pandas import read_csv
series = read_csv('dataset.csv', header=None, index_col=0, parse_dates=True, squeeze=True)
print(series.describe())

from pandas import read_csv

series = read_csv('dataset.csv', header=None, index_col=0, parse_dates=True, squeeze=True)

print(series.describe())

Running the example provides a number of summary statistics to review.

Some observations from these statistics include:

The number of observations (count) matches our expectation, meaning we are handling the data correctly.
The mean is about 4,641, which we might consider our level in this series.
The standard deviation (average spread from the mean) is relatively large at 2,486 sales.
The percentiles along with the standard deviation do suggest a large spread to the data.

count       93.000000
mean      4641.118280
std       2486.403841
min       1573.000000
25%       3036.000000
50%       4016.000000
75%       5048.000000
max      13916.000000

count 93.000000

mean 4641.118280

std 2486.403841

min 1573.000000

25% 3036.000000

50% 4016.000000

75% 5048.000000

max 13916.000000

5.2 Line Plot

A line plot of a time series can provide a lot of insight into the problem.

The example below creates and shows a line plot of the dataset.

from pandas import read_csv
from matplotlib import pyplot
series = read_csv('dataset.csv', header=None, index_col=0, parse_dates=True, squeeze=True)
series.plot()
pyplot.show()

from pandas import read_csv

from matplotlib import pyplot

series = read_csv('dataset.csv', header=None, index_col=0, parse_dates=True, squeeze=True)

series.plot()

pyplot.show()

Run the example and review the plot. Note any obvious temporal structures in the series.

Some observations from the plot include:

There may be an increasing trend of sales over time.
There appears to be systematic seasonality to the sales for each year.
The seasonal signal appears to be growing over time, suggesting a multiplicative relationship (increasing change).
There do not appear to be any obvious outliers.
The seasonality suggests that the series is almost certainly non-stationary.

Champagne Sales Line Plot

There may be benefit in explicitly modeling the seasonal component and removing it. You may also explore using differencing with one or two levels in order to make the series stationary.

The increasing trend or growth in the seasonal component may suggest the use of a log or other power transform.

5.3 Seasonal Line Plots

We can confirm the assumption that the seasonality is a yearly cycle by eyeballing line plots of the dataset by year.

The example below takes the 7 full years of data as separate groups and creates one line plot for each. The line plots are aligned vertically to help spot any year-to-year pattern.

from pandas import read_csv
from pandas import DataFrame
from pandas import Grouper
from matplotlib import pyplot
series = read_csv('dataset.csv', header=None, index_col=0, parse_dates=True, squeeze=True)
groups = series['1964':'1970'].groupby(Grouper(freq='A'))
years = DataFrame()
pyplot.figure()
i = 1
n_groups = len(groups)
for name, group in groups:
	pyplot.subplot((n_groups*100) + 10 + i)
	i += 1
	pyplot.plot(group)
pyplot.show()

from pandas import read_csv

from pandas import DataFrame

from pandas import Grouper

from matplotlib import pyplot

series = read_csv('dataset.csv', header=None, index_col=0, parse_dates=True, squeeze=True)

groups = series['1964':'1970'].groupby(Grouper(freq='A'))

years = DataFrame()

pyplot.figure()

i = 1

n_groups = len(groups)

for name, group in groups:

pyplot.subplot((n_groups*100) + 10 + i)

i += 1

pyplot.plot(group)

pyplot.show()

Running the example creates the stack of 7 line plots.

We can clearly see a dip each August and a rise from each August to December. This pattern appears the same each year, although at different levels.

This will help with any explicitly season-based modeling later.

Seasonal Per Year Line Plots

It might have been easier if all season line plots were added to the one graph to help contrast the data for each year.

5.4 Density Plot

Reviewing plots of the density of observations can provide further insight into the structure of the data.

The example below creates a histogram and density plot of the observations without any temporal structure.

from pandas import read_csv
from matplotlib import pyplot
series = read_csv('dataset.csv', header=None, index_col=0, parse_dates=True, squeeze=True)
pyplot.figure(1)
pyplot.subplot(211)
series.hist()
pyplot.subplot(212)
series.plot(kind='kde')
pyplot.show()

from pandas import read_csv

from matplotlib import pyplot

series = read_csv('dataset.csv', header=None, index_col=0, parse_dates=True, squeeze=True)

pyplot.figure(1)

pyplot.subplot(211)

series.hist()

pyplot.subplot(212)

series.plot(kind='kde')

pyplot.show()

Run the example and review the plots.

Some observations from the plots include:

The distribution is not Gaussian.
The shape has a long right tail and may suggest an exponential distribution

This lends more support to exploring some power transforms of the data prior to modeling.

5.5 Box and Whisker Plots

We can group the monthly data by year and get an idea of the spread of observations for each year and how this may be changing.

We do expect to see some trend (increasing mean or median), but it may be interesting to see how the rest of the distribution may be changing.

The example below groups the observations by year and creates one box and whisker plot for each year of observations. The last year (1971) only contains 9 months and may not be a useful comparison with the 12 months of observations for other years. Therefore, only data between 1964 and 1970 was plotted.

from pandas import read_csv
from pandas import DataFrame
from pandas import Grouper
from matplotlib import pyplot
series = read_csv('dataset.csv', header=None, index_col=0, parse_dates=True, squeeze=True)
groups = series['1964':'1970'].groupby(Grouper(freq='A'))
years = DataFrame()
for name, group in groups:
	years[name.year] = group.values
years.boxplot()
pyplot.show()

from pandas import read_csv

from pandas import DataFrame

from pandas import Grouper

from matplotlib import pyplot

series = read_csv('dataset.csv', header=None, index_col=0, parse_dates=True, squeeze=True)

groups = series['1964':'1970'].groupby(Grouper(freq='A'))

years = DataFrame()

for name, group in groups:

years[name.year] = group.values

years.boxplot()

pyplot.show()

Running the example creates 7 box and whisker plots side-by-side, one for each of the 7 years of selected data.

Some observations from reviewing the plots include:

The median values for each year (red line) may show an increasing trend.
The spread or middle 50% of the data (blue boxes) does appear reasonably stable.
There are outliers each year (black crosses); these may be the tops or bottoms of the seasonal cycle.
The last year, 1970, does look different from the trend in prior years

Champagne Sales Box and Whisker Plots

The observations suggest perhaps some growth trend over the years and outliers that may be a part of the seasonal cycle.

This yearly view of the data is an interesting avenue and could be pursued further by looking at summary statistics from year-to-year and changes in summary stats from year-to-year.

6. ARIMA Models

In this section, we will develop Autoregressive Integrated Moving Average, or ARIMA, models for the problem.

We will approach modeling by both manual and automatic configuration of the ARIMA model. This will be followed by a third step of investigating the residual errors of the chosen model.

As such, this section is broken down into 3 steps:

Manually Configure the ARIMA.
Automatically Configure the ARIMA.
Review Residual Errors.

6.1 Manually Configured ARIMA

The ARIMA(p,d,q) model requires three parameters and is traditionally configured manually.

Analysis of the time series data assumes that we are working with a stationary time series.

The time series is almost certainly non-stationary. We can make it stationary this by first differencing the series and using a statistical test to confirm that the result is stationary.

The seasonality in the series is seemingly year-to-year. Seasonal data can be differenced by subtracting the observation from the same time in the previous cycle, in this case the same month in the previous year. This does mean that we will lose the first year of observations as there is no prior year to difference with.

The example below creates a deseasonalized version of the series and saves it to file stationary.csv.

from pandas import read_csv
from pandas import Series
from statsmodels.tsa.stattools import adfuller
from matplotlib import pyplot

# create a differenced series
def difference(dataset, interval=1):
	diff = list()
	for i in range(interval, len(dataset)):
		value = dataset[i] - dataset[i - interval]
		diff.append(value)
	return Series(diff)

series = read_csv('dataset.csv', header=None, index_col=0, parse_dates=True, squeeze=True)
X = series.values
X = X.astype('float32')
# difference data
months_in_year = 12
stationary = difference(X, months_in_year)
stationary.index = series.index[months_in_year:]
# check if stationary
result = adfuller(stationary)
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
	print('\t%s: %.3f' % (key, value))
# save
stationary.to_csv('stationary.csv', header=False)
# plot
stationary.plot()
pyplot.show()

from pandas import read_csv

from pandas import Series

from statsmodels.tsa.stattools import adfuller

from matplotlib import pyplot

# create a differenced series

def difference(dataset, interval=1):

diff = list()

for i in range(interval, len(dataset)):

value = dataset[i] - dataset[i - interval]

diff.append(value)

return Series(diff)

series = read_csv('dataset.csv', header=None, index_col=0, parse_dates=True, squeeze=True)

X = series.values

X = X.astype('float32')

# difference data

months_in_year = 12

stationary = difference(X, months_in_year)

stationary.index = series.index[months_in_year:]

# check if stationary

result = adfuller(stationary)

print('ADF Statistic: %f' % result[0])

print('p-value: %f' % result[1])

print('Critical Values:')

for key, value in result[4].items():

print('\t%s: %.3f' % (key, value))

# save

stationary.to_csv('stationary.csv', header=False)

# plot

stationary.plot()

pyplot.show()

Running the example outputs the result of a statistical significance test of whether the differenced series is stationary. Specifically, the augmented Dickey-Fuller test.

The results show that the test statistic value -7.134898 is smaller than the critical value at 1% of -3.515. This suggests that we can reject the null hypothesis with a significance level of less than 1% (i.e. a low probability that the result is a statistical fluke).

Rejecting the null hypothesis means that the process has no unit root, and in turn that the time series is stationary or does not have time-dependent structure.

ADF Statistic: -7.134898
p-value: 0.000000
Critical Values:
	5%: -2.898
	1%: -3.515
	10%: -2.586

ADF Statistic: -7.134898

p-value: 0.000000

Critical Values:

5%: -2.898

1%: -3.515

10%: -2.586

For reference, the seasonal difference operation can be inverted by adding the observation for the same month the year before. This is needed in the case that predictions are made by a model fit on seasonally differenced data. The function to invert the seasonal difference operation is listed below for completeness.

# invert differenced value
def inverse_difference(history, yhat, interval=1):
	return yhat + history[-interval]

# invert differenced value

def inverse_difference(history, yhat, interval=1):

return yhat + history[-interval]

A plot of the differenced dataset is also created.

The plot does not show any obvious seasonality or trend, suggesting the seasonally differenced dataset is a good starting point for modeling.

We will use this dataset as an input to the ARIMA model. It also suggests that no further differencing may be required, and that the d parameter may be set to 0.

Seasonal Differenced Champagne Sales Line Plot

The next first step is to select the lag values for the Autoregression (AR) and Moving Average (MA) parameters, p and q respectively.

We can do this by reviewing Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots.

Note, we are now using the seasonally differenced stationary.csv as our dataset.

The example below creates ACF and PACF plots for the series.

from pandas import read_csv
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf
from matplotlib import pyplot
series = read_csv('stationary.csv', header=None, index_col=0, parse_dates=True, squeeze=True)
pyplot.figure()
pyplot.subplot(211)
plot_acf(series, ax=pyplot.gca())
pyplot.subplot(212)
plot_pacf(series, ax=pyplot.gca())
pyplot.show()

from pandas import read_csv

from statsmodels.graphics.tsaplots import plot_acf

from statsmodels.graphics.tsaplots import plot_pacf

from matplotlib import pyplot

series = read_csv('stationary.csv', header=None, index_col=0, parse_dates=True, squeeze=True)

pyplot.figure()

pyplot.subplot(211)

plot_acf(series, ax=pyplot.gca())

pyplot.subplot(212)

plot_pacf(series, ax=pyplot.gca())

pyplot.show()

Run the example and review the plots for insights into how to set the p and q variables for the ARIMA model.

Below are some observations from the plots.

The ACF shows a significant lag for 1 month.
The PACF shows a significant lag for 1 month, with perhaps some significant lag at 12 and 13 months.
Both the ACF and PACF show a drop-off at the same point, perhaps suggesting a mix of AR and MA.

A good starting point for the p and q values is also 1.

The PACF plot also suggests that there is still some seasonality present in the differenced data.

We may consider a better model of seasonality, such as modeling it directly and explicitly removing it from the model rather than seasonal differencing.

ACF and PACF Plots of Seasonally Differenced Champagne Sales

This quick analysis suggests an ARIMA(1,0,1) on the stationary data may be a good starting point.

The historic observations will be seasonally differenced prior to the fitting of each ARIMA model. The differencing will be inverted for all predictions made to make them directly comparable to the expected observation in the original sale count units.

Experimentation shows that this configuration of ARIMA does not converge and results in errors by the underlying library. Further experimentation showed that adding one level of differencing to the stationary data made the model more stable. The model can be extended to ARIMA(1,1,1).

We will also disable the automatic addition of a trend constant from the model by setting the ‘trend‘ argument to ‘nc‘ for no constant in the call to fit(). From experimentation, I find that this can result in better forecast performance on some problems.

The example below demonstrates the performance of this ARIMA model on the test harness.

# evaluate manually configured ARIMA model
from pandas import read_csv
from sklearn.metrics import mean_squared_error
from statsmodels.tsa.arima.model import ARIMA
from math import sqrt

# create a differenced series
def difference(dataset, interval=1):
	diff = list()
	for i in range(interval, len(dataset)):
		value = dataset[i] - dataset[i - interval]
		diff.append(value)
	return diff

# invert differenced value
def inverse_difference(history, yhat, interval=1):
	return yhat + history[-interval]

# load data
series = read_csv('dataset.csv', header=None, index_col=0, parse_dates=True, squeeze=True)
# prepare data
X = series.values
X = X.astype('float32')
train_size = int(len(X) * 0.50)
train, test = X[0:train_size], X[train_size:]
# walk-forward validation
history = [x for x in train]
predictions = list()
for i in range(len(test)):
	# difference data
	months_in_year = 12
	diff = difference(history, months_in_year)
	# predict
	model = ARIMA(diff, order=(1,1,1))
	model_fit = model.fit()
	yhat = model_fit.forecast()[0]
	yhat = inverse_difference(history, yhat, months_in_year)
	predictions.append(yhat)
	# observation
	obs = test[i]
	history.append(obs)
	print('>Predicted=%.3f, Expected=%.3f' % (yhat, obs))
# report performance
rmse = sqrt(mean_squared_error(test, predictions))
print('RMSE: %.3f' % rmse)

# evaluate manually configured ARIMA model

from pandas import read_csv

from sklearn.metrics import mean_squared_error

from statsmodels.tsa.arima.model import ARIMA

from math import sqrt

# create a differenced series

def difference(dataset, interval=1):

diff = list()

for i in range(interval, len(dataset)):

value = dataset[i] - dataset[i - interval]

diff.append(value)

return diff

# invert differenced value

def inverse_difference(history, yhat, interval=1):

return yhat + history[-interval]

# load data

series = read_csv('dataset.csv', header=None, index_col=0, parse_dates=True, squeeze=True)

# prepare data

X = series.values

X = X.astype('float32')

train_size = int(len(X) * 0.50)

train, test = X[0:train_size], X[train_size:]

# walk-forward validation

history = [x for x in train]

predictions = list()

for i in range(len(test)):

# difference data

months_in_year = 12

diff = difference(history, months_in_year)

# predict

model = ARIMA(diff, order=(1,1,1))

model_fit = model.fit()

yhat = model_fit.forecast()[0]

yhat = inverse_difference(history, yhat, months_in_year)

predictions.append(yhat)

# observation

obs = test[i]

history.append(obs)

print('>Predicted=%.3f, Expected=%.3f' % (yhat, obs))

# report performance

rmse = sqrt(mean_squared_error(test, predictions))

print('RMSE: %.3f' % rmse)

Note, you may see a warning message from the underlying linear algebra library; this can be ignored for now.

Running this example results in an RMSE of 956.942, which is dramatically better than the persistence RMSE of 3186.501.

...
>Predicted=3157.018, Expected=5010
>Predicted=4615.082, Expected=4874
>Predicted=4624.998, Expected=4633
>Predicted=2044.097, Expected=1659
>Predicted=5404.428, Expected=5951
RMSE: 956.942

...

>Predicted=3157.018, Expected=5010

>Predicted=4615.082, Expected=4874

>Predicted=4624.998, Expected=4633

>Predicted=2044.097, Expected=1659

>Predicted=5404.428, Expected=5951

RMSE: 956.942

This is a great start, but we may be able to get improved results with a better configured ARIMA model.

6.2 Grid Search ARIMA Hyperparameters

The ACF and PACF plots suggest that an ARIMA(1,0,1) or similar may be the best that we can do.

To confirm this analysis, we can grid search a suite of ARIMA hyperparameters and check that no models result in better out of sample RMSE performance.

In this section, we will search values of p, d, and q for combinations (skipping those that fail to converge), and find the combination that results in the best performance on the test set. We will use a grid search to explore all combinations in a subset of integer values.

Specifically, we will search all combinations of the following parameters:

p: 0 to 6.
d: 0 to 2.
q: 0 to 6.

This is (7 * 3 * 7), or 147, potential runs of the test harness and will take some time to execute.

It may be interesting to evaluate MA models with a lag of 12 or 13 as were noticed as potentially interesting from reviewing the ACF and PACF plots. Experimentation suggested that these models may not be stable, resulting in errors in the underlying mathematical libraries.

The complete worked example with the grid search version of the test harness is listed below.

# grid search ARIMA parameters for time series
import warnings
from pandas import read_csv
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error
from math import sqrt
import numpy

# create a differenced series
def difference(dataset, interval=1):
	diff = list()
	for i in range(interval, len(dataset)):
		value = dataset[i] - dataset[i - interval]
		diff.append(value)
	return numpy.array(diff)

# invert differenced value
def inverse_difference(history, yhat, interval=1):
	return yhat + history[-interval]

# evaluate an ARIMA model for a given order (p,d,q) and return RMSE
def evaluate_arima_model(X, arima_order):
	# prepare training dataset
	X = X.astype('float32')
	train_size = int(len(X) * 0.50)
	train, test = X[0:train_size], X[train_size:]
	history = [x for x in train]
	# make predictions
	predictions = list()
	for t in range(len(test)):
		# difference data
		months_in_year = 12
		diff = difference(history, months_in_year)
		model = ARIMA(diff, order=arima_order)
		model_fit = model.fit()
		yhat = model_fit.forecast()[0]
		yhat = inverse_difference(history, yhat, months_in_year)
		predictions.append(yhat)
		history.append(test[t])
	# calculate out of sample error
	rmse = sqrt(mean_squared_error(test, predictions))
	return rmse

# evaluate combinations of p, d and q values for an ARIMA model
def evaluate_models(dataset, p_values, d_values, q_values):
	dataset = dataset.astype('float32')
	best_score, best_cfg = float("inf"), None
	for p in p_values:
		for d in d_values:
			for q in q_values:
				order = (p,d,q)
				try:
					rmse = evaluate_arima_model(dataset, order)
					if rmse < best_score:
						best_score, best_cfg = rmse, order
					print('ARIMA%s RMSE=%.3f' % (order,rmse))
				except:
					continue
	print('Best ARIMA%s RMSE=%.3f' % (best_cfg, best_score))

# load dataset
series = read_csv('dataset.csv', header=None, index_col=0, parse_dates=True, squeeze=True)
# evaluate parameters
p_values = range(0, 7)
d_values = range(0, 3)
q_values = range(0, 7)
warnings.filterwarnings("ignore")
evaluate_models(series.values, p_values, d_values, q_values)

# grid search ARIMA parameters for time series

import warnings

from pandas import read_csv

from statsmodels.tsa.arima.model import ARIMA

from sklearn.metrics import mean_squared_error

from math import sqrt

import numpy

# create a differenced series

def difference(dataset, interval=1):

diff = list()

for i in range(interval, len(dataset)):

value = dataset[i] - dataset[i - interval]

diff.append(value)

return numpy.array(diff)

# invert differenced value

def inverse_difference(history, yhat, interval=1):

return yhat + history[-interval]

# evaluate an ARIMA model for a given order (p,d,q) and return RMSE

def evaluate_arima_model(X, arima_order):

# prepare training dataset

X = X.astype('float32')

train_size = int(len(X) * 0.50)

train, test = X[0:train_size], X[train_size:]

history = [x for x in train]

# make predictions

predictions = list()

for t in range(len(test)):

# difference data

months_in_year = 12

diff = difference(history, months_in_year)

model = ARIMA(diff, order=arima_order)

model_fit = model.fit()

yhat = model_fit.forecast()[0]

yhat = inverse_difference(history, yhat, months_in_year)

predictions.append(yhat)

history.append(test[t])

# calculate out of sample error

rmse = sqrt(mean_squared_error(test, predictions))

return rmse

# evaluate combinations of p, d and q values for an ARIMA model

def evaluate_models(dataset, p_values, d_values, q_values):

dataset = dataset.astype('float32')

best_score, best_cfg = float("inf"), None

for p in p_values:

for d in d_values:

for q in q_values:

order = (p,d,q)

try:

rmse = evaluate_arima_model(dataset, order)

if rmse < best_score:

best_score, best_cfg = rmse, order

print('ARIMA%s RMSE=%.3f' % (order,rmse))

except:

continue

print('Best ARIMA%s RMSE=%.3f' % (best_cfg, best_score))

# load dataset

series = read_csv('dataset.csv', header=None, index_col=0, parse_dates=True, squeeze=True)

# evaluate parameters

p_values = range(0, 7)

d_values = range(0, 3)

q_values = range(0, 7)

warnings.filterwarnings("ignore")

evaluate_models(series.values, p_values, d_values, q_values)

Running the example runs through all combinations and reports the results on those that converge without error. The example takes a little over 2 minutes to run on modern hardware.

The results show that the best configuration discovered was ARIMA(0, 0, 1) with an RMSE of 939.464, slightly lower than the manually configured ARIMA from the previous section. This difference may or may not be statistically significant.

...
ARIMA(5, 1, 2) RMSE=1003.200
ARIMA(5, 2, 1) RMSE=1053.728
ARIMA(6, 0, 0) RMSE=996.466
ARIMA(6, 1, 0) RMSE=1018.211
ARIMA(6, 1, 1) RMSE=1023.762
Best ARIMA(0, 0, 1) RMSE=939.464

...

ARIMA(5, 1, 2) RMSE=1003.200

ARIMA(5, 2, 1) RMSE=1053.728

ARIMA(6, 0, 0) RMSE=996.466

ARIMA(6, 1, 0) RMSE=1018.211

ARIMA(6, 1, 1) RMSE=1023.762

Best ARIMA(0, 0, 1) RMSE=939.464

We will select this ARIMA(0, 0, 1) model going forward.

6.3 Review Residual Errors

A good final check of a model is to review residual forecast errors.

Ideally, the distribution of residual errors should be a Gaussian with a zero mean.

We can check this by using summary statistics and plots to investigate the residual errors from the ARIMA(0, 0, 1) model. The example below calculates and summarizes the residual forecast errors.

# summarize ARIMA forecast residuals
from pandas import read_csv
from pandas import DataFrame
from statsmodels.tsa.arima.model import ARIMA
from matplotlib import pyplot

# create a differenced series
def difference(dataset, interval=1):
	diff = list()
	for i in range(interval, len(dataset)):
		value = dataset[i] - dataset[i - interval]
		diff.append(value)
	return diff

# invert differenced value
def inverse_difference(history, yhat, interval=1):
	return yhat + history[-interval]

# load data
series = read_csv('dataset.csv', header=None, index_col=0, parse_dates=True, squeeze=True)
# prepare data
X = series.values
X = X.astype('float32')
train_size = int(len(X) * 0.50)
train, test = X[0:train_size], X[train_size:]
# walk-forward validation
history = [x for x in train]
predictions = list()
for i in range(len(test)):
	# difference data
	months_in_year = 12
	diff = difference(history, months_in_year)
	# predict
	model = ARIMA(diff, order=(0,0,1))
	model_fit = model.fit()
	yhat = model_fit.forecast()[0]
	yhat = inverse_difference(history, yhat, months_in_year)
	predictions.append(yhat)
	# observation
	obs = test[i]
	history.append(obs)
# errors
residuals = [test[i]-predictions[i] for i in range(len(test))]
residuals = DataFrame(residuals)
print(residuals.describe())
# plot
pyplot.figure()
pyplot.subplot(211)
residuals.hist(ax=pyplot.gca())
pyplot.subplot(212)
residuals.plot(kind='kde', ax=pyplot.gca())
pyplot.show()

# summarize ARIMA forecast residuals

from pandas import read_csv

from pandas import DataFrame

from statsmodels.tsa.arima.model import ARIMA

from matplotlib import pyplot

# create a differenced series

def difference(dataset, interval=1):

diff = list()

for i in range(interval, len(dataset)):

value = dataset[i] - dataset[i - interval]

diff.append(value)

return diff

# invert differenced value

def inverse_difference(history, yhat, interval=1):

return yhat + history[-interval]

# load data

series = read_csv('dataset.csv', header=None, index_col=0, parse_dates=True, squeeze=True)

# prepare data

X = series.values

X = X.astype('float32')

train_size = int(len(X) * 0.50)

train, test = X[0:train_size], X[train_size:]

# walk-forward validation

history = [x for x in train]

predictions = list()

for i in range(len(test)):

# difference data

months_in_year = 12

diff = difference(history, months_in_year)

# predict

model = ARIMA(diff, order=(0,0,1))

model_fit = model.fit()

yhat = model_fit.forecast()[0]

yhat = inverse_difference(history, yhat, months_in_year)

predictions.append(yhat)

# observation

obs = test[i]

history.append(obs)

# errors

residuals = [test[i]-predictions[i] for i in range(len(test))]

residuals = DataFrame(residuals)

print(residuals.describe())

# plot

pyplot.figure()

pyplot.subplot(211)

residuals.hist(ax=pyplot.gca())

pyplot.subplot(212)

residuals.plot(kind='kde', ax=pyplot.gca())

pyplot.show()

Running the example first describes the distribution of the residuals.

We can see that the distribution has a right shift and that the mean is non-zero at 165.904728.

This is perhaps a sign that the predictions are biased.

count    47.000000
mean    165.904728
std     934.696199
min   -2164.247449
25%    -289.651596
50%     191.759548
75%     732.992187
max    2367.304748

count 47.000000

mean 165.904728

std 934.696199

min -2164.247449

25% -289.651596

50% 191.759548

75% 732.992187

max 2367.304748

The distribution of residual errors is also plotted.

The graphs suggest a Gaussian-like distribution with a bumpy left tail, providing further evidence that perhaps a power transform might be worth exploring.

Residual Forecast Error Density Plots

We could use this information to bias-correct predictions by adding the mean residual error of 165.904728 to each forecast made.

The example below performs this bias correlation.

# plots of residual errors of bias corrected forecasts
from pandas import read_csv
from pandas import DataFrame
from statsmodels.tsa.arima.model import ARIMA
from matplotlib import pyplot
from sklearn.metrics import mean_squared_error
from math import sqrt

# create a differenced series
def difference(dataset, interval=1):
	diff = list()
	for i in range(interval, len(dataset)):
		value = dataset[i] - dataset[i - interval]
		diff.append(value)
	return diff

# invert differenced value
def inverse_difference(history, yhat, interval=1):
	return yhat + history[-interval]

# load data
series = read_csv('dataset.csv', header=None, index_col=0, parse_dates=True, squeeze=True)
# prepare data
X = series.values
X = X.astype('float32')
train_size = int(len(X) * 0.50)
train, test = X[0:train_size], X[train_size:]
# walk-forward validation
history = [x for x in train]
predictions = list()
bias = 165.904728
for i in range(len(test)):
	# difference data
	months_in_year = 12
	diff = difference(history, months_in_year)
	# predict
	model = ARIMA(diff, order=(0,0,1))
	model_fit = model.fit()
	yhat = model_fit.forecast()[0]
	yhat = bias + inverse_difference(history, yhat, months_in_year)
	predictions.append(yhat)
	# observation
	obs = test[i]
	history.append(obs)
# report performance
rmse = sqrt(mean_squared_error(test, predictions))
print('RMSE: %.3f' % rmse)
# errors
residuals = [test[i]-predictions[i] for i in range(len(test))]
residuals = DataFrame(residuals)
print(residuals.describe())
# plot
pyplot.figure()
pyplot.subplot(211)
residuals.hist(ax=pyplot.gca())
pyplot.subplot(212)
residuals.plot(kind='kde', ax=pyplot.gca())
pyplot.show()

# plots of residual errors of bias corrected forecasts

from pandas import read_csv

from pandas import DataFrame

from statsmodels.tsa.arima.model import ARIMA

from matplotlib import pyplot

from sklearn.metrics import mean_squared_error

from math import sqrt

# create a differenced series

def difference(dataset, interval=1):

diff = list()

for i in range(interval, len(dataset)):

value = dataset[i] - dataset[i - interval]

diff.append(value)

return diff

# invert differenced value

def inverse_difference(history, yhat, interval=1):

return yhat + history[-interval]

# load data

series = read_csv('dataset.csv', header=None, index_col=0, parse_dates=True, squeeze=True)

# prepare data

X = series.values

X = X.astype('float32')

train_size = int(len(X) * 0.50)

train, test = X[0:train_size], X[train_size:]

# walk-forward validation

history = [x for x in train]

predictions = list()

bias = 165.904728

for i in range(len(test)):

# difference data

months_in_year = 12

diff = difference(history, months_in_year)

# predict

model = ARIMA(diff, order=(0,0,1))

model_fit = model.fit()

yhat = model_fit.forecast()[0]

yhat = bias + inverse_difference(history, yhat, months_in_year)

predictions.append(yhat)

# observation

obs = test[i]

history.append(obs)

# report performance

rmse = sqrt(mean_squared_error(test, predictions))

print('RMSE: %.3f' % rmse)

# errors

residuals = [test[i]-predictions[i] for i in range(len(test))]

residuals = DataFrame(residuals)

print(residuals.describe())

# plot

pyplot.figure()

pyplot.subplot(211)

residuals.hist(ax=pyplot.gca())

pyplot.subplot(212)

residuals.plot(kind='kde', ax=pyplot.gca())

pyplot.show()

The performance of the predictions is improved very slightly from 939.464 to 924.699, which may or may not be significant.

The summary of the forecast residual errors shows that the mean was indeed moved to a value very close to zero.

RMSE: 924.699

count  4.700000e+01
mean   4.965016e-07
std    9.346962e+02
min   -2.330152e+03
25%   -4.555563e+02
50%    2.585482e+01
75%    5.670875e+02
max    2.201400e+03

RMSE: 924.699

count 4.700000e+01

mean 4.965016e-07

std 9.346962e+02

min -2.330152e+03

25% -4.555563e+02

50% 2.585482e+01

75% 5.670875e+02

max 2.201400e+03

Finally, density plots of the residual error do show a small shift towards zero.

It is debatable whether this bias correction is worth it, but we will use it for now.

Bias-Corrected Residual Forecast Error Density Plots

It is also a good idea to check the time series of the residual errors for any type of autocorrelation. If present, it would suggest that the model has more opportunity to model the temporal structure in the data.

The example below re-calculates the residual errors and creates ACF and PACF plots to check for any significant autocorrelation.

# ACF and PACF plots of residual errors of bias corrected forecasts
from pandas import read_csv
from pandas import DataFrame
from statsmodels.tsa.arima.model import ARIMA
from matplotlib import pyplot
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf

# create a differenced series
def difference(dataset, interval=1):
	diff = list()
	for i in range(interval, len(dataset)):
		value = dataset[i] - dataset[i - interval]
		diff.append(value)
	return diff

# invert differenced value
def inverse_difference(history, yhat, interval=1):
	return yhat + history[-interval]

# load data
series = read_csv('dataset.csv', header=None, index_col=0, parse_dates=True, squeeze=True)
# prepare data
X = series.values
X = X.astype('float32')
train_size = int(len(X) * 0.50)
train, test = X[0:train_size], X[train_size:]
# walk-forward validation
history = [x for x in train]
predictions = list()
for i in range(len(test)):
	# difference data
	months_in_year = 12
	diff = difference(history, months_in_year)
	# predict
	model = ARIMA(diff, order=(0,0,1))
	model_fit = model.fit()
	yhat = model_fit.forecast()[0]
	yhat = inverse_difference(history, yhat, months_in_year)
	predictions.append(yhat)
	# observation
	obs = test[i]
	history.append(obs)
# errors
residuals = [test[i]-predictions[i] for i in range(len(test))]
residuals = DataFrame(residuals)
print(residuals.describe())
# plot
pyplot.figure()
pyplot.subplot(211)
plot_acf(residuals, ax=pyplot.gca())
pyplot.subplot(212)
plot_pacf(residuals, ax=pyplot.gca())
pyplot.show()

# ACF and PACF plots of residual errors of bias corrected forecasts

from pandas import read_csv

from pandas import DataFrame

from statsmodels.tsa.arima.model import ARIMA

from matplotlib import pyplot

from statsmodels.graphics.tsaplots import plot_acf

from statsmodels.graphics.tsaplots import plot_pacf

# create a differenced series

def difference(dataset, interval=1):

diff = list()

for i in range(interval, len(dataset)):

value = dataset[i] - dataset[i - interval]

diff.append(value)

return diff

# invert differenced value

def inverse_difference(history, yhat, interval=1):

return yhat + history[-interval]

# load data

series = read_csv('dataset.csv', header=None, index_col=0, parse_dates=True, squeeze=True)

# prepare data

X = series.values

X = X.astype('float32')

train_size = int(len(X) * 0.50)

train, test = X[0:train_size], X[train_size:]

# walk-forward validation

history = [x for x in train]

predictions = list()

for i in range(len(test)):

# difference data

months_in_year = 12

diff = difference(history, months_in_year)

# predict

model = ARIMA(diff, order=(0,0,1))

model_fit = model.fit()

yhat = model_fit.forecast()[0]

yhat = inverse_difference(history, yhat, months_in_year)

predictions.append(yhat)

# observation

obs = test[i]

history.append(obs)

# errors

residuals = [test[i]-predictions[i] for i in range(len(test))]

residuals = DataFrame(residuals)

print(residuals.describe())

# plot

pyplot.figure()

pyplot.subplot(211)

plot_acf(residuals, ax=pyplot.gca())

pyplot.subplot(212)

plot_pacf(residuals, ax=pyplot.gca())

pyplot.show()

The results suggest that what little autocorrelation is present in the time series has been captured by the model.

Residual Forecast Error ACF and PACF Plots

7. Model Validation

After models have been developed and a final model selected, it must be validated and finalized.

Validation is an optional part of the process, but one that provides a ‘last check’ to ensure we have not fooled or misled ourselves.

This section includes the following steps:

Finalize Model: Train and save the final model.
Make Prediction: Load the finalized model and make a prediction.
Validate Model: Load and validate the final model.

7.1 Finalize Model

Finalizing the model involves fitting an ARIMA model on the entire dataset, in this case on a transformed version of the entire dataset.

Once fit, the model can be saved to file for later use.

The example below trains an ARIMA(0,0,1) model on the dataset and saves the whole fit object and the bias to file.

The example below saves the fit model to file in the correct state so that it can be loaded successfully later.

# save finalized model
from pandas import read_csv
from statsmodels.tsa.arima.model import ARIMA
import numpy

# create a differenced series
def difference(dataset, interval=1):
	diff = list()
	for i in range(interval, len(dataset)):
		value = dataset[i] - dataset[i - interval]
		diff.append(value)
	return diff

# load data
series = read_csv('dataset.csv', header=None, index_col=0, parse_dates=True, squeeze=True)
# prepare data
X = series.values
X = X.astype('float32')
# difference data
months_in_year = 12
diff = difference(X, months_in_year)
# fit model
model = ARIMA(diff, order=(0,0,1))
model_fit = model.fit()
# bias constant, could be calculated from in-sample mean residual
bias = 165.904728
# save model
model_fit.save('model.pkl')
numpy.save('model_bias.npy', [bias])

# save finalized model

from pandas import read_csv

from statsmodels.tsa.arima.model import ARIMA

import numpy

# create a differenced series

def difference(dataset, interval=1):

diff = list()

for i in range(interval, len(dataset)):

value = dataset[i] - dataset[i - interval]

diff.append(value)

return diff

# load data

series = read_csv('dataset.csv', header=None, index_col=0, parse_dates=True, squeeze=True)

# prepare data

X = series.values

X = X.astype('float32')

# difference data

months_in_year = 12

diff = difference(X, months_in_year)

# fit model

model = ARIMA(diff, order=(0,0,1))

model_fit = model.fit()

# bias constant, could be calculated from in-sample mean residual

bias = 165.904728

# save model

model_fit.save('model.pkl')

numpy.save('model_bias.npy', [bias])

Running the example creates two local files:

model.pkl This is the ARIMAResult object from the call to ARIMA.fit(). This includes the coefficients and all other internal data returned when fitting the model.
model_bias.npy This is the bias value stored as a one-row, one-column NumPy array.

7.2 Make Prediction

A natural case may be to load the model and make a single forecast.

This is relatively straightforward and involves restoring the saved model and the bias and calling the forecast() method. To invert the seasonal differencing, the historical data must also be loaded.

The example below loads the model, makes a prediction for the next time step, and prints the prediction.

# load finalized model and make a prediction
from pandas import read_csv
from statsmodels.tsa.arima.model import ARIMAResults
import numpy

# invert differenced value
def inverse_difference(history, yhat, interval=1):
	return yhat + history[-interval]

series = read_csv('dataset.csv', header=None, index_col=0, parse_dates=True, squeeze=True)
months_in_year = 12
model_fit = ARIMAResults.load('model.pkl')
bias = numpy.load('model_bias.npy')
yhat = float(model_fit.forecast()[0])
yhat = bias + inverse_difference(series.values, yhat, months_in_year)
print('Predicted: %.3f' % yhat)

# load finalized model and make a prediction

from pandas import read_csv

from statsmodels.tsa.arima.model import ARIMAResults

import numpy

# invert differenced value

def inverse_difference(history, yhat, interval=1):

return yhat + history[-interval]

series = read_csv('dataset.csv', header=None, index_col=0, parse_dates=True, squeeze=True)

months_in_year = 12

model_fit = ARIMAResults.load('model.pkl')

bias = numpy.load('model_bias.npy')

yhat = float(model_fit.forecast()[0])

yhat = bias + inverse_difference(series.values, yhat, months_in_year)

print('Predicted: %.3f' % yhat)

Running the example prints the prediction of about 6794.

Predicted: 6794.773

1	Predicted: 6794.773

If we peek inside validation.csv, we can see that the value on the first row for the next time period is 6981.

The prediction is in the right ballpark.

7.3 Validate Model

We can load the model and use it in a pretend operational manner.

In the test harness section, we saved the final 12 months of the original dataset in a separate file to validate the final model.

We can load this validation.csv file now and use it see how well our model really is on “unseen” data.

There are two ways we might proceed:

Load the model and use it to forecast the next 12 months. The forecast beyond the first one or two months will quickly start to degrade in skill.
Load the model and use it in a rolling-forecast manner, updating the transform and model for each time step. This is the preferred method as it is how one would use this model in practice as it would achieve the best performance.

As with model evaluation in previous sections, we will make predictions in a rolling-forecast manner. This means that we will step over lead times in the validation dataset and take the observations as an update to the history.

# load and evaluate the finalized model on the validation dataset
from pandas import read_csv
from matplotlib import pyplot
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.arima.model import ARIMAResults
from sklearn.metrics import mean_squared_error
from math import sqrt
import numpy

# create a differenced series
def difference(dataset, interval=1):
	diff = list()
	for i in range(interval, len(dataset)):
		value = dataset[i] - dataset[i - interval]
		diff.append(value)
	return diff

# invert differenced value
def inverse_difference(history, yhat, interval=1):
	return yhat + history[-interval]

# load and prepare datasets
dataset = read_csv('dataset.csv', header=None, index_col=0, parse_dates=True, squeeze=True)
X = dataset.values.astype('float32')
history = [x for x in X]
months_in_year = 12
validation = read_csv('validation.csv', header=None, index_col=0, parse_dates=True, squeeze=True)
y = validation.values.astype('float32')
# load model
model_fit = ARIMAResults.load('model.pkl')
bias = numpy.load('model_bias.npy')
# make first prediction
predictions = list()
yhat = float(model_fit.forecast()[0])
yhat = bias + inverse_difference(history, yhat, months_in_year)
predictions.append(yhat)
history.append(y[0])
print('>Predicted=%.3f, Expected=%.3f' % (yhat, y[0]))
# rolling forecasts
for i in range(1, len(y)):
	# difference data
	months_in_year = 12
	diff = difference(history, months_in_year)
	# predict
	model = ARIMA(diff, order=(0,0,1))
	model_fit = model.fit()
	yhat = model_fit.forecast()[0]
	yhat = bias + inverse_difference(history, yhat, months_in_year)
	predictions.append(yhat)
	# observation
	obs = y[i]
	history.append(obs)
	print('>Predicted=%.3f, Expected=%.3f' % (yhat, obs))
# report performance
rmse = sqrt(mean_squared_error(y, predictions))
print('RMSE: %.3f' % rmse)
pyplot.plot(y)
pyplot.plot(predictions, color='red')
pyplot.show()

# load and evaluate the finalized model on the validation dataset

from pandas import read_csv

from matplotlib import pyplot

from statsmodels.tsa.arima.model import ARIMA

from statsmodels.tsa.arima.model import ARIMAResults

from sklearn.metrics import mean_squared_error

from math import sqrt

import numpy

# create a differenced series

def difference(dataset, interval=1):

diff = list()

for i in range(interval, len(dataset)):

value = dataset[i] - dataset[i - interval]

diff.append(value)

return diff

# invert differenced value

def inverse_difference(history, yhat, interval=1):

return yhat + history[-interval]

# load and prepare datasets

dataset = read_csv('dataset.csv', header=None, index_col=0, parse_dates=True, squeeze=True)

X = dataset.values.astype('float32')

history = [x for x in X]

months_in_year = 12

validation = read_csv('validation.csv', header=None, index_col=0, parse_dates=True, squeeze=True)

y = validation.values.astype('float32')

# load model

model_fit = ARIMAResults.load('model.pkl')

bias = numpy.load('model_bias.npy')

# make first prediction

predictions = list()

yhat = float(model_fit.forecast()[0])

yhat = bias + inverse_difference(history, yhat, months_in_year)

predictions.append(yhat)

history.append(y[0])

print('>Predicted=%.3f, Expected=%.3f' % (yhat, y[0]))

# rolling forecasts

for i in range(1, len(y)):

# difference data

months_in_year = 12

diff = difference(history, months_in_year)

# predict

model = ARIMA(diff, order=(0,0,1))

model_fit = model.fit()

yhat = model_fit.forecast()[0]

yhat = bias + inverse_difference(history, yhat, months_in_year)

predictions.append(yhat)

# observation

obs = y[i]

history.append(obs)

print('>Predicted=%.3f, Expected=%.3f' % (yhat, obs))

# report performance

rmse = sqrt(mean_squared_error(y, predictions))

print('RMSE: %.3f' % rmse)

pyplot.plot(y)

pyplot.plot(predictions, color='red')

pyplot.show()

Running the example prints each prediction and expected value for the time steps in the validation dataset.

The final RMSE for the validation period is predicted at 361.110 million sales.

This is much better than the expectation of an error of a little more than 924 million sales per month.

>Predicted=6794.773, Expected=6981
>Predicted=10101.763, Expected=9851
>Predicted=13219.067, Expected=12670
>Predicted=3996.535, Expected=4348
>Predicted=3465.934, Expected=3564
>Predicted=4522.683, Expected=4577
>Predicted=4901.336, Expected=4788
>Predicted=5190.094, Expected=4618
>Predicted=4930.190, Expected=5312
>Predicted=4944.785, Expected=4298
>Predicted=1699.409, Expected=1413
>Predicted=6085.324, Expected=5877
RMSE: 361.110

>Predicted=6794.773, Expected=6981

>Predicted=10101.763, Expected=9851

>Predicted=13219.067, Expected=12670

>Predicted=3996.535, Expected=4348

>Predicted=3465.934, Expected=3564

>Predicted=4522.683, Expected=4577

>Predicted=4901.336, Expected=4788

>Predicted=5190.094, Expected=4618

>Predicted=4930.190, Expected=5312

>Predicted=4944.785, Expected=4298

>Predicted=1699.409, Expected=1413

>Predicted=6085.324, Expected=5877

RMSE: 361.110

A plot of the predictions compared to the validation dataset is also provided.

At this scale on the plot, the 12 months of forecast sales figures look fantastic.

Forecast for Champagne Sales Validation Dataset

Summary

In this tutorial, you discovered the steps and the tools for a time series forecasting project with Python.

We have covered a lot of ground in this tutorial; specifically:

How to develop a test harness with a performance measure and evaluation method and how to quickly develop a baseline forecast and skill.
How to use time series analysis to raise ideas for how to best model the forecast problem.
How to develop an ARIMA model, save it, and later load it to make predictions on new data.

How did you do? Do you have any questions about this tutorial?
Ask your questions in the comments below and I will do my best to answer.

222 Responses to Time Series Forecast Study with Python: Monthly Sales of French Champagne

Benson Dube February 20, 2017 at 8:46 pm #

Thank You Jason.

Would you recommend this example to a starter. I am totally new to programming and Machine Learning, and I am currently learning Python sytnax and understanding the basic terms around algorithms used.

Regards,

Benson

Reply
- Jason Brownlee February 21, 2017 at 9:34 am #
  
  Hi Benson, this tutorial is an example for a beginner in time series forecasting, but does assume you know your way around Python first (or you can pick it up fast).
  
  Reply
  - Benson March 1, 2017 at 6:45 am #
    
    A surely steep learning curve taking you out of your comfort zone, but that’s the way to learn. ?
    
    Reply
    - Jason Brownlee March 1, 2017 at 8:48 am #
      
      For sure. Dive in!
      
      Reply
  - laurenciatitan November 14, 2018 at 1:57 am #
    
    helo jason will i be able to contact you and discuss, need explaination..
    
    Reply
    - Jason Brownlee November 14, 2018 at 7:33 am #
      
      You can use the contact page to contact me any time via email:
      https://machinelearningmastery.com/contact/
      
      Reply
Viral Mehta February 24, 2017 at 4:56 am #

Jason, thanks for very detailed instructions on how to construct the model. When I add a few additional periods in the validation set (manually) for a short-term forecast beyond the dataset, the model won’t run until I provide some ‘fake’ targets (expected y). However, when I provide some fake targets, I see that model quickly adjusts to those targets. I tried different levels of y values and I see model predicted accordingly, shouldn’t the predictions be independent of what targets it sees?

Reply
- Jason Brownlee February 24, 2017 at 10:14 am #
  
  You should not need fake targets Viral.
  
  You can forecast a new out of sample data point by training the model on all of the available data and predicting one-step. E.g.:
  
  yhat = model_fit.forecast()[0]
  
  1
  
  yhat = model_fit.forecast()[0]
  
  Reply
  - Viral Mehta February 25, 2017 at 2:14 am #
    
    Jason, much appreciate your response. What I am trying to convey is that the model should predict same outputs regardless of the observations. For example, when I change y values in my validation set (last 12 months that the model hasn’t seen), my predictions change and the model gives me a very nice RMSE every time. If the model was trained properly, it should have same output for next twelve months regardless of my y values in the validation set. I think you will see this effect very quickly if you also change your y values from validation set. Unless the model is only good for one period forward and needs to continuously adjust based on observed values of last period.
    
    Reply
    - Jason Brownlee February 25, 2017 at 6:00 am #
      
      In this specific case we are using walk-forward validation with a model trained for one-step forecasts.
      
      This means as we walk over the validation dataset, observations are moved from validation into training and the model is refit. This is to simulate a new observation being available after a prediction is made for a time step and us being able to update our model in response.
      
      I think what you are talking about is a multi-step forecast, e.g. fitting the model on all of the training data, then forecasting the next n time steps. This is not the experimental setup here, but you are welcome to try it.
      
      Reply
  - Benson Dube March 15, 2017 at 9:11 am #
    
    Hello Jason,
    
    If you don’t mind me asking, where exactly should the line below fit?:
    
    yhat = model_fit.forecast()[0]
    
    Reply
    - Jason Brownlee March 16, 2017 at 7:57 am #
      
      Hi Benson,
      
      Fit the ARIMA model on all of your data.
      
      After that, make a forecast for the next time step by calling the forecast() function.
      
      Reply
Juanlu February 27, 2017 at 10:15 am #

Time to upgrade to matplotlib 2.0, the colors are nicer 🙂

Reply
- Jason Brownlee February 28, 2017 at 8:09 am #
  
  I agree Juanlu! I have recently upgraded myself.
  
  Reply
Hugo March 24, 2017 at 2:38 am #

Dear Jason,
Thanks for the post. Clear, concise and working.

Have you considered to forecast the TS using a SARIMA model instead of substracting the seasonality and adding it latter? As a matter of fact, statsmodel has it integrated in its dev version. (http://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.html)

I am wondering if it is possible to take into account hierarchy, like the forecast of monthly sales divided in sub-forecasts of french champagne destination markets, using python.

Thanks and keep posting!

Reply
- Jason Brownlee March 24, 2017 at 7:59 am #
  
  Hi Hugo, yes I am familiar with SARIMA, but I decided to keep the example limited to ARIMA.
  
  Yes, as long as you have the data to train/verify the model. I would consider starting with independent models for each case, using all available data and tease out how to optimize it further after that.
  
  Reply
Luis Ashurei May 2, 2017 at 7:25 pm #

Thanks for the wonderful post Jason,

Two quick question:

1.
When I’m doing Grid Search ARIMA Hyperparameters with your example, it’s quite slow on my machine, spent about 1 minutes. However the parameter range and data are not large at all. Is it slow specific on my end? I’m concerning the model is too slow to use.

2.
Does ARIMA support multi-step forecast? E.g. What if I keep use predicted value into next forecast, will it overfitting?

Thanks!

Reply
- Jason Brownlee May 3, 2017 at 7:34 am #
  
  Your timing sounds fine. If it is too slow for you, consider working with a sub-sample of your data.
  
  You can do multi-step forecasting directly (forecast() function) or by using the ARIMA model recursively:
  https://machinelearningmastery.com/multi-step-time-series-forecasting/
  
  Reply
  - Luis May 3, 2017 at 6:50 pm #
    
    Thanks in advice.
    
    Reply
Pramod Gupta July 3, 2017 at 7:58 pm #

Hello Jason

Thanks for this awesome hands-on on Time series. However I have a query.

I extracted year and month from date column as my features for model. Then I built a Linear Regression model and a Random Forest model. My final prediction was a weighted average of both these models. I got an rmse of 366 (similar to yours i.e. 361 on validation data).

Can this approach be the alternative for an ARIMA model? What can be the possible drawbacks of this approach?

Appreciate your comments
Thanks

Reply
- Jason Brownlee July 6, 2017 at 10:01 am #
  
  Nice work. ARIMA may be a simpler model if model complexity is important on a project.
  
  Reply
Roberto July 5, 2017 at 7:26 am #

Hi just a silly note have you consider using itertools for your evaluate_models() function?
https://pastebin.com/w0apb6Tg

in python nested loops are not so readable.

Reply
- Jason Brownlee July 6, 2017 at 10:21 am #
  
  Thanks for the suggestion.
  
  Reply
Rishi Patil July 8, 2017 at 9:15 am #

Jason, I just finished reading “Introduction to Time Series Forecasting in Python” in detail and had two questions:
* Are you planning to come out with a “Intermediate course in Time Series Forecasting in Python” soon?
* You spend the first few chapters discussing how to reframe a time-series problem into a typical Machine Learning problem using Lag features. However, in later chapters, you only use ARIMA models, that obviate the necessity of using explicitly generated lag features. What’s the best way to use explicitly generated timeseries features with other ML algorithms (such as SVM or XGBoost)?
(I tried out a few AR lags with XGB and got decent results using but couldn’t figure out how to incorporate the MA parts).
Would appreciate any insights.

Reply
- Jason Brownlee July 9, 2017 at 10:50 am #
  
  I may have a more advanced TS book in the future.
  
  Great question. You would have to calculate the MA features manually.
  
  Reply
Jan August 2, 2017 at 8:34 pm #

Hi Jason,

can it be that TimeGrouper() does not work if there are months missing in a year? Also there are no docs available for TimeGrouper(). May you can use pd.Grouper in your future examples?

Regards

Reply
- Jason Brownlee August 3, 2017 at 6:49 am #
  
  I think it does work with missing data, allowing you to resample.
  
  TimeGrouper does not have docs, but Grouper does and is a good start:
  https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Grouper.html
  
  Reply

Sambit August 17, 2017 at 5:09 am #

Hi,
Thanks for this great link on time series.

I am not able to acces the dataset stored at:- https://datamarket.com/data/set/22r5/perrin-freres-monthly-champagne-sales-millions-64-72#!ds=22r5&display=line

Regards,
Sambit

Jason Brownlee August 17, 2017 at 6:49 am #

I’m sorry to hear that, here is the full dataset:

"Month","Sales"
"1964-01",2815
"1964-02",2672
"1964-03",2755
"1964-04",2721
"1964-05",2946
"1964-06",3036
"1964-07",2282
"1964-08",2212
"1964-09",2922
"1964-10",4301
"1964-11",5764
"1964-12",7312
"1965-01",2541
"1965-02",2475
"1965-03",3031
"1965-04",3266
"1965-05",3776
"1965-06",3230
"1965-07",3028
"1965-08",1759
"1965-09",3595
"1965-10",4474
"1965-11",6838
"1965-12",8357
"1966-01",3113
"1966-02",3006
"1966-03",4047
"1966-04",3523
"1966-05",3937
"1966-06",3986
"1966-07",3260
"1966-08",1573
"1966-09",3528
"1966-10",5211
"1966-11",7614
"1966-12",9254
"1967-01",5375
"1967-02",3088
"1967-03",3718
"1967-04",4514
"1967-05",4520
"1967-06",4539
"1967-07",3663
"1967-08",1643
"1967-09",4739
"1967-10",5428
"1967-11",8314
"1967-12",10651
"1968-01",3633
"1968-02",4292
"1968-03",4154
"1968-04",4121
"1968-05",4647
"1968-06",4753
"1968-07",3965
"1968-08",1723
"1968-09",5048
"1968-10",6922
"1968-11",9858
"1968-12",11331
"1969-01",4016
"1969-02",3957
"1969-03",4510
"1969-04",4276
"1969-05",4968
"1969-06",4677
"1969-07",3523
"1969-08",1821
"1969-09",5222
"1969-10",6872
"1969-11",10803
"1969-12",13916
"1970-01",2639
"1970-02",2899
"1970-03",3370
"1970-04",3740
"1970-05",2927
"1970-06",3986
"1970-07",4217
"1970-08",1738
"1970-09",5221
"1970-10",6424
"1970-11",9842
"1970-12",13076
"1971-01",3934
"1971-02",3162
"1971-03",4286
"1971-04",4676
"1971-05",5010
"1971-06",4874
"1971-07",4633
"1971-08",1659
"1971-09",5951
"1971-10",6981
"1971-11",9851
"1971-12",12670
"1972-01",4348
"1972-02",3564
"1972-03",4577
"1972-04",4788
"1972-05",4618
"1972-06",5312
"1972-07",4298
"1972-08",1413
"1972-09",5877

100

101

102

103

104

105

106

"Month","Sales"

"1964-01",2815

"1964-02",2672

"1964-03",2755

"1964-04",2721

"1964-05",2946

"1964-06",3036

"1964-07",2282

"1964-08",2212

"1964-09",2922

"1964-10",4301

"1964-11",5764

"1964-12",7312

"1965-01",2541

"1965-02",2475

"1965-03",3031

"1965-04",3266

"1965-05",3776

"1965-06",3230

"1965-07",3028

"1965-08",1759

"1965-09",3595

"1965-10",4474

"1965-11",6838

"1965-12",8357

"1966-01",3113

"1966-02",3006

"1966-03",4047

"1966-04",3523

"1966-05",3937

"1966-06",3986

"1966-07",3260

"1966-08",1573

"1966-09",3528

"1966-10",5211

"1966-11",7614

"1966-12",9254

"1967-01",5375

"1967-02",3088

"1967-03",3718

"1967-04",4514

"1967-05",4520

"1967-06",4539

"1967-07",3663

"1967-08",1643

"1967-09",4739

"1967-10",5428

"1967-11",8314

"1967-12",10651

"1968-01",3633

"1968-02",4292

"1968-03",4154

"1968-04",4121

"1968-05",4647

"1968-06",4753

"1968-07",3965

"1968-08",1723

"1968-09",5048

"1968-10",6922

"1968-11",9858

"1968-12",11331

"1969-01",4016

"1969-02",3957

"1969-03",4510

"1969-04",4276

"1969-05",4968

"1969-06",4677

"1969-07",3523

"1969-08",1821

"1969-09",5222

"1969-10",6872

"1969-11",10803

"1969-12",13916

"1970-01",2639

"1970-02",2899

"1970-03",3370

"1970-04",3740

"1970-05",2927

"1970-06",3986

"1970-07",4217

"1970-08",1738

"1970-09",5221

"1970-10",6424

"1970-11",9842

"1970-12",13076

"1971-01",3934

"1971-02",3162

"1971-03",4286

"1971-04",4676

"1971-05",5010

"1971-06",4874

"1971-07",4633

"1971-08",1659

"1971-09",5951

"1971-10",6981

"1971-11",9851

"1971-12",12670

"1972-01",4348

"1972-02",3564

"1972-03",4577

"1972-04",4788

"1972-05",4618

"1972-06",5312

"1972-07",4298

"1972-08",1413

"1972-09",5877

sarrauste August 24, 2017 at 1:15 am #

hi,
i am having the following error:

X = X.astype(‘float32’)
ValueError: could not convert string to float: ‘1972-09’

can you please help me
ty

Reply
- Jason Brownlee August 24, 2017 at 6:44 am #
  
  Looks like you may have missed some lines of code. Confirm you have them all.
  
  Reply
  - Nico October 29, 2018 at 5:02 am #
    
    Hi Jason, I am having the same error. I checked the lines and I have them all. How can python convert ‘yyyy-mm’ to float32?
    I dowload the csv from the link of datamarket you gave above.
    What am i missing?
    Thnanx
    
    Reply
    - Jason Brownlee October 29, 2018 at 6:04 am #
      
      I have some suggestions here:
      https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
      
      Reply
sandip August 28, 2017 at 8:09 pm #

Hi Jason , Nice approach for time series forecasting.Thanks for this Article.

Just had a one small doubt,Here we are tuning our ARIMA parameter on test data using RMSE as a measure but generally we tune our parameter on the training(train+validation)data and then we check whether those parameters are well generalized or not by using test data. is it right?
Because my concern is that if we choose our parameter on that particular test data,whether those parameter will generalized to coming new test data or not? I feel that there will be biasness while choosing parameter because we are specifically choosing parameter those giving less RMSE for that test data.Here we are not checking whether our model is working/fitted well for our train data or not?
If i am wrong just correct me.Let me know the logic behind this parameter tuning on the basis of test data?

Reply
- Jason Brownlee August 29, 2017 at 5:03 pm #
  
  Yes, that is a less biased way to prepare a model.
  
  More here:
  https://machinelearningmastery.com/difference-test-validation-datasets/
  
  Reply
sandip August 29, 2017 at 5:58 pm #

Thank you so much Jason.

Reply
- Jason Brownlee August 30, 2017 at 6:13 am #
  
  You’re welcome sandip.
  
  Reply
Anthy August 31, 2017 at 2:24 am #

Hi Jason,
Thank you for this detailed and clear tutorial.
I have a question concerning fiding ARIMA’s parameters. I didn’t understand why we have an imbricated “for” loop and we do consider all the p values and not all the q values.

Please ask if the question is not that clear.

Thank you in advance.

Reply
- Jason Brownlee August 31, 2017 at 6:23 am #
  
  Please review the example again, we grid search p, d and q values.
  
  Reply
Ella Zhao September 2, 2017 at 4:26 am #

Hi Jason,

Thanks for this awesome example! It helped me a lot! However if you don’t mind, I have a question on the prediction loop.

Why did you use history.append(obs) instead of history.append(yhat) when doing the loop for prediction? Sees like you add the observation (real data in test) to history. And yhat = bias + inverse_difference(history, yhat, months_in_year) is based on the history data. but actually we don’t have those observation data when solving practical problem.

I’ve tried to use history.append(yhat) in my model, but the result is worse than using history.append(obs).

Appreciate your comments

Thanks,

Ella

Reply
- Jason Brownlee September 2, 2017 at 6:16 am #
  
  Good question, because in this scenario, we are assuming that each new real observation is available after we make a prediction.
  
  Pretend we are running this loop one iteration per day, and we get yesterdays observation this morning so we can accurately predict tomorrow.
  
  Reply
  - Ella Zhao September 2, 2017 at 11:46 pm #
    
    Thank you so much for your explanation!!
    
    Reply
    - Jason Brownlee September 3, 2017 at 5:45 am #
      
      You’re welcome.
      
      Reply
Alaoui September 3, 2017 at 12:27 am #

Hello,

Thanks a lot for this work.
I have a point to discuss on concerning the walk forward validation.
I saw that you fill the history part by a value from the test set at each iteration :
…..
# observation
obs = test[i]
history.append(obs)
….

Do we really have to do this or we have to use the new yhat computed for the next predictions?
Indeed when we have to do a future prediction on some range date we don’t have the historical value….
Regards and thanks in advance for your feedback

Simo

Reply
- Jason Brownlee September 3, 2017 at 5:48 am #
  
  It depends on your model and project goals.
  
  In this case, we are assuming that the true observations are available after each modeling step and can be used in the formulation of the next prediction, a reasonable assumption, but not required in general.
  
  If you use outputs as inputs (e.g. recursive multi-step forecasting), error will compound and things will go more crazy, sooner (e.g. model skill will drop). See this post:
  https://machinelearningmastery.com/multi-step-time-series-forecasting/
  
  Reply
Alaoui September 3, 2017 at 12:49 am #

Sorry, I just read the Ella question…So for long range future prediction we have ti use the yhat…
Thanks

Reply
- Jason Brownlee September 3, 2017 at 5:48 am #
  
  You can if you choose.
  
  Reply
Sujeet September 16, 2017 at 12:22 pm #

Hi,

I am getting NAN after these steps:

from pandas import Series
series = Series.from_csv(‘champagne.csv’, header=0)
print(len(series))
split_point = len(series) – 12
print(split_point)
dataset, validation = series[0:split_point], series[split_point:]
print(dataset)
print(‘Dataset %d, Validation %d’ % (len(dataset), len(validation)))
dataset.to_csv(‘dataset.csv’)
validation.to_csv(‘validation.csv’)

Reply
- Jason Brownlee September 17, 2017 at 5:22 am #
  
  I’m sorry to hear that.
  
  Ensure that you have copied all of the code and that you have removed the footer from the data file. Also confirm that your environment is up to date.
  
  Reply
Jaryn September 20, 2017 at 3:24 am #

Hi Jason,

This is extremely useful. Thank you very much!

I was wondering if you could provide the full code to extend the forecast for 1 year ahead. I know you mentioned above that there is a forecast function but when I run:
yhat = model_fit.forecast()[0]
pyplot.plot(y)
pyplot.plot(yhat, color=’green’)
pyplot.plot(predictions, color=’red’)
pyplot.show()

I don’t get any green lines in my code. I’m sure I’m missing a lot of things here, but I don’t know what.

Thank you again, much appreciated!

Reply
- Jason Brownlee September 20, 2017 at 6:01 am #
  
  This post will help you with making longer-term forecasts:
  https://machinelearningmastery.com/make-sample-forecasts-arima-python/
  
  Reply
Kunal Shrivastava September 20, 2017 at 4:54 pm #

Hi Jason,
I am using this modelling steps to model my problem. When I am passing best parameter which i have got from Grid Search for doing prediction, I am getting Linear Algebra error and sometimes not converging
Due to this I am not able to predict the values.

My second concern is if this is happening for Grid search, how can i automate the script in production environment. Lets say i have to forecast some values for 50000 different sites then what is best way to achieve this goal ?

Reply
- Jason Brownlee September 21, 2017 at 5:36 am #
  
  Sorry to hear that, perhaps you have not set all of the parameters the same as you used in the grid search?
  
  I have some notes about models in production here that may help:
  https://machinelearningmastery.com/deploy-machine-learning-model-to-production/
  
  Reply
Tri Bien September 24, 2017 at 7:57 pm #

Hi Jason,

Thank for your post, it is really helpful for me. It is great!
I have a small question:
In the post, you use 2 data set: dataset.csv and validation.csv
+ In dataset.csv, you split it and use Walk-forward validation -> I totally agree.
+ In validation.csv, you still re-train ARIMA model with walk-forward validation (model_fit = model.fit(trend=’nc’, disp=0) – line 45 in section 7.3) -> I think it is unseen data for testing, so we don’t train model here and only test the model’s performance. Is it correct?

Reply
- Jason Brownlee September 25, 2017 at 5:38 am #
  
  In the case of walk forward validation, we are assuming that the real observations are available after the time of prediction.
  
  You can make different assumptions when evaluating your model.
  
  Reply
Navid September 30, 2017 at 1:50 am #

Hi, Jason
First of all, thank you for your wonderful tutorial.
Until now (part 5.4) I just have one doubt: why you plotted a histogram/kde without removing the trend first? If the goal is to see how the distribution shape is similar to Gaussian distribution, doesn’t a trend changes the distribution of data?

Thanks
Navid

Reply
- Jason Brownlee September 30, 2017 at 7:44 am #
  
  Great question, at that point we are early in the analysis. Generally, I’d recommend looking at density plots before and after making the series stationary.
  
  Trend removal generally won’t impact the form of the distribution if we are doing simple linear operations.
  
  Reply
Nick October 19, 2017 at 12:46 am #

Hi Jason,
Thanks for your awesome post. Could you explain how to set the ‘interval’ in the function difference if I only have 1-year data? My dataset is recorded in half an hour, from June 2016 to June 2017. These data number are large in summer and small in winter, i.e. in winter it is between 0-2000, but in summer it is between 5000-14000.

Reply
Shashank Hegde October 22, 2017 at 3:46 pm #

Hi Jason,
Thank you for the post.I want to implement ‘ARIMA’ function instead of using built-in function.
Do you know where I can find the algorithm to implement ‘ARIMA’ function as well understanding that in detail manner?

Reply
- Jason Brownlee October 23, 2017 at 5:41 am #
  
  A good textbook on the math is this one: http://amzn.to/2zvfIcB
  
  Reply
Hara October 28, 2017 at 3:38 am #

Hi Jason,

Thanks for the brief tutorial on Time Series forecasting. I am receiving the error “Given a pandas object and the index does not contain dates” when running the ARIMA model code snippet.

Reply
- Jason Brownlee October 28, 2017 at 5:16 am #
  
  Be sure you copied all of the sample code exactly including indenting, and also be sure you have prepared the data, including removing the footer information from the file.
  
  Reply
Makis December 18, 2017 at 8:27 pm #

Dear Jason,

Thanks for your amazing tut!

I have one problem only when i run the rolling forecasts.

I use a different dataset which has temperature values for every minute of the day. I train the model with the first four days and i use the last day as a test dataset. My problem might be because i add the test observation to the history and my arima returns the following error: raise ValueError(“The computed initial AR coefficients are not ”
ValueError: The computed initial AR coefficients are not stationary
You should induce stationarity, choose a different model order, or you can
pass your own start_params.

for i in range(0, len(x_test)):

# difference data
diff = difference(history, minutes)
# predict
model = importer.ARIMA(diff, order=(5, 0, 1))
model_fit = model.fit(trend=’nc’, disp=0)
yhat = model_fit.forecast()[0]
yhat = inverse_difference(history, yhat, minutes)
predictions.append(yhat)
# observation
obs = x_test[i]
history.append(obs)
print(‘Predicted=%.6f, Expected=%.6f’ % (yhat, obs))

My x_test dataset contains 1440 rows and i get the error on the 1423 iteration of the loop. Until iteration 1423 the each arima model does not have issues.

Your help is precious to me.

Thanks again!

Kindest Regards,
Makis

Reply
- Jason Brownlee December 19, 2017 at 5:17 am #
  
  Have you confirmed that your data set is stationary? Perhaps there is still some trend or seasonal structure?
  
  I have many posts on detrending and seasonal adjustment that may help.
  
  Reply
  - Makis December 20, 2017 at 12:54 am #
    
    Dear Jason,
    
    Thanks for your response!
    
    I have used a different hyperparameter (Arima 5,1,1) instead, and everything worked. I don’t know if this is the right thing, since 5,0,1 and 5,1,1 had exactly the same RMSE. What is your opinion about that? And please, one final question, my results now are extremely good, with an RMSE less than 0,01. The prediction line in the plot is almost over the the real data line. Does this has to do with the history.append(obs)? I’m not sure i understand correctly why the test observation is added to the history. And what is the difference of doing a prediction in a for loop with a new model for every step compared by using the steps parameter in 1 model?
    
    Sorry for the long questions!
    
    Your tutorials are even better than the books i’m currently reading!
    
    Cheers from Greece.
    
    Reply
    - Jason Brownlee December 20, 2017 at 5:47 am #
      
      Well done. I’d recommend using the model that is skillful and stable.
      
      Reply
      - Makis December 20, 2017 at 8:22 am #
        
        Hey Jason! What i’m trying to say is, maybe the good predictions are because i append the y observation to the history. Why we do this? what is the purpose? If we do not append the y observation then the results are going to be still valid?
      - Jason Brownlee December 20, 2017 at 3:49 pm #
        
        In the test setup we are assuming that real observations are made after each prediction that we can in turn use to make the next prediction.
        
        Your specific framing of the problem may differ and I would encourage you to design a test harness to capture that.
      - Makis December 20, 2017 at 10:42 pm #
        
        Hey Jason, for once more thanks for your feedback!
        I will follow your suggestion right now 🙂
        
        Thanks for everything
        
        Kind Regards,
        Makis
      - Makis December 20, 2017 at 10:57 pm #
        
        Hey Jason, one last question.
        
        Since i do the rolling forecast manner and i introduce a new model for every iteration, why do i need to save a model and do a the first prediction based on that?
        
        Can i simply do the rolling forecast loop to start from 0 instead of 1?
      - Jason Brownlee December 21, 2017 at 5:26 am #
        
        Sure.
Amir December 19, 2017 at 6:51 pm #

Hi Jason,

Thank you so much for this amazing tutorial. It really helps me to learn time series forecasting and prepare for my new job. I’m a total beginner in data analysis/science as I’m currently making transition from engineering to data analysis.

But the problem that I’m facing with this tutorial is as I’m running the code, I’m stuck with the final step, “Validate the Model”. I’ve tried to re-copy and re-run the sample code many times but it didn’t seem to work. Here is the error.

>Predicted=10101.763, Expected=9851
>Predicted=13219.067, Expected=12670
>Predicted=3996.535, Expected=4348
>Predicted=3465.934, Expected=3564
>Predicted=4522.683, Expected=4577
>Predicted=4901.336, Expected=4788
>Predicted=5190.094, Expected=4618
>Predicted=4930.190, Expected=5312
>Predicted=4944.785, Expected=4298
>Predicted=1699.409, Expected=1413
>Predicted=6085.324, Expected=5877
>Predicted=7135.720, Expected=nan

ValueError Traceback (most recent call last)
in ()
56
57 # Report performance
—> 58 mse = mean_squared_error(y, predictions)
59 rmse = sqrt(mse)
60 print(‘RMSE: %.3f’ % rmse)

/Users/amir/Library/Enthought/Canopy/edm/envs/User/lib/python3.5/site-packages/sklearn/metrics/regression.py in mean_squared_error(y_true, y_pred, sample_weight, multioutput)
229 “””
230 y_type, y_true, y_pred, multioutput = _check_reg_targets(
–> 231 y_true, y_pred, multioutput)
232 output_errors = np.average((y_true – y_pred) ** 2, axis=0,
233 weights=sample_weight)

/Users/amir/Library/Enthought/Canopy/edm/envs/User/lib/python3.5/site-packages/sklearn/metrics/regression.py in _check_reg_targets(y_true, y_pred, multioutput)
73 “””
74 check_consistent_length(y_true, y_pred)
—> 75 y_true = check_array(y_true, ensure_2d=False)
76 y_pred = check_array(y_pred, ensure_2d=False)
77

/Users/amir/Library/Enthought/Canopy/edm/envs/User/lib/python3.5/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
405 % (array.ndim, estimator_name))
406 if force_all_finite:
–> 407 _assert_all_finite(array)
408
409 shape_repr = _shape_repr(array.shape)

/Users/amir/Library/Enthought/Canopy/edm/envs/User/lib/python3.5/site-packages/sklearn/utils/validation.py in _assert_all_finite(X)
56 and not np.isfinite(X).all()):
57 raise ValueError(“Input contains NaN, infinity”
—> 58 ” or a value too large for %r.” % X.dtype)
59
60

ValueError: Input contains NaN, infinity or a value too large for dtype(‘float32’).

Could you please help me with this?

Thank you so much

Amir

Reply
- Jason Brownlee December 20, 2017 at 5:41 am #
  
  I’m not sure of the cause of your fault, sorry.
  
  Ensure your libraries are up to date and that you have copied all of the code exactly?
  
  Reply
- Ankit Tripathi April 30, 2018 at 7:24 pm #
  
  The csv file contains some jargon text which should be deleted before reading the file as list.
  
  Reply
Satyajit Pattnaik December 28, 2017 at 5:03 pm #

Hi Jason,

In the prediction area, where you have added the observation to history and then running the loop to find the ARIMA results.
i.e.

model = ARIMA(history, order=(4,1,2))
model_fit = model.fit(disp=0)
output = model_fit.forecast()
yhat = output[0]
predictions.append(yhat)
obs = test[t]
history.append(obs)

Here, what if we append yhat to history, when i am appending yhat to history, my results are really bad, pls help..

Because as per my model, we need to predict the test data by using only the training data, so we cannot use obs to be appended in history, hope you get my point.

Reply
- Jason Brownlee December 29, 2017 at 5:18 am #
  
  Yes, that would be called a recursive multi-step forecast. It is challenging. You can learn a little more about it here:
  https://machinelearningmastery.com/multi-step-time-series-forecasting/
  
  Reply
  - Satyajit Pattnaik December 29, 2017 at 4:06 pm #
    
    Concept wise i understand what is Recursive Multi step forecasting, that’s the same i have used in my earlier reply code, i have appended obs in history, so that means everytime my loop runs, it takes the existing training data + next observation, so that should work and should predict correctly, but my results are really bad..
    
    Or do you mean to say, for this we need to plot the ACF, PACF graph in a a loop to determine the pdq values, and then run the ARIMA function on the pdq values, if that so, pls help us in finding a way to determine pdq values in loop
    
    Reply
    - Jason Brownlee December 30, 2017 at 5:18 am #
      
      Generally using forecasts in a recursive model is challenging, you may need to get creative with your model/models.
      
      Reply
Udi January 8, 2018 at 8:57 pm #

Hello Jason,

Thank you for your clear and straightforward post!
I have the same question as Nick above – about the choice of differentiation interval.
In your example you set it to 12 according to the expected cycle of the data.
However, in a more problematic case the data does not seem to imply a clear cycle (ACF and PACF graphs notwithstanding).
In my case, I’ve found out that setting the interval to 12 yielded better results than my default, which was 1. I can understand why choosing a small interval would be generally bad – random noise is too dominant. I have more difficulty understanding how to calibrate the ideal interval value for my data (except brute force, that is. Maybe I shouldn’t calibrate? That could induce overfit. Anyway, I still should find a generally decent value).

Reply
- Jason Brownlee January 9, 2018 at 5:29 am #
  
  If the amount of data is relatively small, consider using a grid search across difference values to see what works.
  
  Reply
Udi January 8, 2018 at 9:09 pm #

And another question. I applied a grid search in order to choose the hyperparameters of my ARIMA model. I fear that bias correction could be counter productive in this case.

Do you have any insights on this subject? Would you perform both (grid search followed by bias correction)? Is the answer data dependent?

Reply
- Jason Brownlee January 9, 2018 at 5:29 am #
  
  I would only perform bias correction if it lifted model skill (e.g. if the model was biased).
  
  Reply
Nirmal Kanti Sahoo January 29, 2018 at 11:33 am #

Hi Jason,

I have around more than 1000 of products. I have sales yearly history for last 7 years for each 1000 product. I want to develop a model by which I can predict sales amount for next 3 years.
Could you please help me how do I proceed with evaluate, prediction, validation and interpret the same.

Thanking in advance.

Reply
- Jason Brownlee January 30, 2018 at 9:45 am #
  
  Perhaps start here:
  https://machinelearningmastery.com/start-here/#timeseries
  
  Reply
Udi February 18, 2018 at 9:32 pm #

(Noob’s parallelization question, probably)

Does anybody know whether statmodels ARIMA uses multi-threading or any other kind of parallelization?

I’m trying to run an analysis based on multiple ARIMA fits on my laptop and thread parallelization increases total runtime rather than decrease it. It seems a single ARIMA fit (part of a single thread) uses several processors at once….

Thanks,
Udi

Reply
- Jason Brownlee February 19, 2018 at 9:05 am #
  
  I think it is single threaded.
  
  Reply
Raul February 28, 2018 at 3:49 am #

Hi Jason,

This is an excellent article, and it is super helpful. I had two questions for extension that may have been asked but after reading through the comments, I’m not sure if the same advice applies to me. My question is most similar to Nirmals.

1. I have a dataset with multiple “wines”, each with their own historical sales data

2. This dataset has other variables that aren’t just time related.

For example,

Month 1 Sales Month 2 Sales Month 3 Sales online Reviews,
Wine A
Wine B
Wine C
Wine D

I’m wondering if there’s an extension of this Time Series model that can take into account other variables and other instances of historical data. Let me know if I should clarify.

Thank you for taking the time to clarify.

Reply
- Jason Brownlee February 28, 2018 at 6:10 am #
  
  Yes, perhaps you could model each series separately, or groups of series or all together, or all 3 approaches and ensemble the results.
  
  Also, I’d recommend moving on to other models, such as ML or even MLP models.
  
  I hope to cover this topic in more detail soon.
  
  Reply
  - Raul February 28, 2018 at 7:01 am #
    
    Thanks for the quick reply.
    
    I’ll try out your first method.
    But do you have any good ML/MLP models/tutorials to start with? It’s okay if you don’t! I noticed you have a nice article on multivariate time series here:
    https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/
    
    I haven’t read through it yet, but I think it only takes into account historical data of one instance of multiple variables.
    
    I think it would be interesting to find out how to ensemble the results of my
    #1. one instance of historical data (Wine A’s last 3 month sales)
    #2. one instance of non-historical variables (Wine A’s online reviews, type, etc.)
    
    To me, # 1’s output is a “variable” to be used in the ML model for #2. If that makes sense, do you think that’s the right way of going about things?
    
    Thanks for being so responsive!
    
    Reply
    - Jason Brownlee March 1, 2018 at 6:01 am #
      
      I hope to cover this topic in more detail soon – maybe write a book on the topic working one dataset from many different directions.
      
      Reply
Anchal April 6, 2018 at 4:20 am #

Thank you Jason! for really making ML practitioners like me being awesome in ML.

In this blog you have mentioned that the results suggest that what little autocorrelation is present in the time series has been captured by the model.

What will be the next steps if there is high autocorrelation?

Thanks in advance.

Reply
- Jason Brownlee April 6, 2018 at 6:35 am #
  
  Great question!
  
  Some ideas:
  – Perhaps try alternate models.
  – Perhaps improve the data preparation.
  – Perhaps try downstream models to further correct the forecast.
  
  Reply
Ankit Tripathi May 10, 2018 at 5:55 pm #

Hey Jason,

That was a very well informed article! I am trying to forecast on a weekly data. Any tips for improving the model since weekly data is hard to forecast. Also should “months_in_year = 12” used in differencing be changed to “weeks in month=4” to accommodate for weekly frequency?

Thanks

Reply
- Jason Brownlee May 11, 2018 at 6:34 am #
  
  My best tips are to try lots of data preparation techniques and tune the ARIMA using a grid search.
  
  Reply
Maria Dossin May 26, 2018 at 9:47 am #

Hi Jason,
Thank you for an amazing tutorial!
Just one quick question: can you please provide solution for the first scenario (not rolling forward forecast):
– Load the model and use it to forecast the next 12 months. The forecast beyond the first one or two months will quickly start to degrade in skill.
Thank you~

Reply
- Jason Brownlee May 27, 2018 at 6:43 am #
  
  Yes, this will help:
  https://machinelearningmastery.com/make-sample-forecasts-arima-python/
  
  Reply
Piyasi Choudhury June 2, 2018 at 11:38 am #

Hi Jason

My time series data is non stationary as well and I tried upto 3rd degree of differentiation after which I cant proceed any more as the data is exhausted. It has 3 calculated points and on doing Fuller test for checking stationarity, it errors out “maxlag should be < nobs". What can I do here?

Reply
- Jason Brownlee June 3, 2018 at 6:20 am #
  
  Sounds like you might be running out of data in your series. Perhaps you need more observations?
  
  Reply
KACEM June 3, 2018 at 11:54 am #

Hello, thank you.

i wonder if we can say that python give good forecasts than other forecast logiciels like Aperia.

To forecast sales we have to base ourselves in historic and baseline but in fact, in industry there are many factors that complexify the calculation of our forecast like the marketing actions, exceptional customer order, … How can we consider these perturbation in our forecasts to ensure good value of accuracy.

Thank you for answering.

Reply
- Jason Brownlee June 4, 2018 at 6:21 am #
  
  What is Aperia?
  
  Additional elements could be used as exogenous variables in the model.
  
  Reply
cathy June 23, 2018 at 10:00 am #

Jason,

The dataset I have requires to double difference in order to make it stationary, so i used:

diff1 = difference(history, 12)
diff2 = difference(diff1, 12),

it worked and made it stationary with ADF test, however, how do I reverse it back please?

Thank you

Reply
- Jason Brownlee June 24, 2018 at 7:27 am #
  
  Add the values back. It requires that you keep the original and the first diffed data.
  
  Reply
  - Cathy June 24, 2018 at 10:28 am #
    
    Thanks, so would the function look like this?
    
    yhat = inverse_difference(diff1, yhat, interval)
    yhat = inverse_difference(history, yhat, interval) ?
    
    Thank you!
    
    Reply
    - Jason Brownlee June 25, 2018 at 6:16 am #
      
      Correct.
      
      Reply
Rui July 27, 2018 at 4:11 am #

First of all, this is a great article, well written and detailed. I have a question regarding on the use of the test set on all of the analysis. What about putting this model in production? You wouldn’t have the ‘test[i]’ (or ‘y[i]’) for each iteration to add to the list ‘history’ to have a truly generalized prediction.
My point is that that instead of adding the values from the test set (‘y[i]’, ‘test[i]’) you’d rather add the predictions being made to the training set in order to do a true random walk.
Thanks again for the resource and all the help putting out this content.

Reply
- Jason Brownlee July 27, 2018 at 5:57 am #
  
  A final model will be created by training it on all available data and then using it to make predictions in the future.
  
  You can learn more about final models here:
  https://machinelearningmastery.com/train-final-machine-learning-model/
  
  You can learn more about making out of sample predictions here:
  https://machinelearningmastery.com/make-sample-forecasts-arima-python/
  
  Reply
James Lucas July 27, 2018 at 2:21 pm #

I keep getting errors
—————————————————————————
TypeError Traceback (most recent call last)
in ()
9 obs = test[i]
10 history.append(obs)
—> 11 print(‘>Predicted=%.3f, Expected=%3.f’ % (yhat, obs))

TypeError: must be real number, not ellipsis
IE: # predict
yhat = …
those 3 dots (ellipsis) are throwing errors.

The code has … throughout. Any idea how to fix this error?

Reply
- Jason Brownlee July 28, 2018 at 6:28 am #
  
  The code with “…” is just example code, it is not for execution. Skip it.
  
  Perhaps a more careful read of the tutorial is in order James?
  
  Reply
Ben Kesby August 3, 2018 at 2:53 am #

I’m having an issue with the PACF plot for the residuals in part 6.3 Plotting The Residuals. I have followed your code exactly with the ACF plot resulting in a perfect match of the one displayed here tower the PACF is plotting values nearing 8 at around 36 – 38 lags. Any idea what might be causing this?

Reply
JP August 24, 2018 at 8:17 pm #

Hi Jason,
Thanks so much for this tutorial. One thing though. As series.from_csv is depricated, the date format gets lost when opening the dataset, for exemple when trying to generate the seasonal line plots. Are you aware of a workaround that would keep the date format while using read_csv instead of from_csv?
Thanks!

Reply
- Jason Brownlee August 25, 2018 at 5:48 am #
  
  Yes you can use pandas.read_csv() with the same arguments.
  
  Reply
Alexandra September 12, 2018 at 8:40 pm #

How can I improve the readability of Seasonal Per Year Line Plots (especially when it comes to axes) ? I would be grateful for your help.

Reply
- Jason Brownlee September 13, 2018 at 8:01 am #
  
  Perhaps create one plot per figure?
  
  Reply
  - Alexandra September 13, 2018 at 4:44 pm #
    
    I’d rather have a comparison between all the subplots.
    
    Reply
    - Jason Brownlee September 14, 2018 at 6:33 am #
      
      Perhaps plot them all on the same figure?
      Perhaps calculate the different and plot that?
      
      Reply
Chintan September 18, 2018 at 2:16 am #

Hi Jason,
I was able to predict and chart looks nice. Just wondering how can we predict for the future, I mean to say if I wanted to see for next month prediction how are we able to do that?

Thanks,
Chintan

Reply
- Jason Brownlee September 18, 2018 at 6:20 am #
  
  I show how here:
  https://machinelearningmastery.com/make-sample-forecasts-arima-python/
  
  Reply
Pranay October 9, 2018 at 9:26 pm #

Hi Jason, I deal with daily set data. What is the issue when I try months_in_year = 364?? I mean it’s not throwing best_arima out when I set months_in_year = 364. May I know the reason?

Reply
- Jason Brownlee October 10, 2018 at 6:09 am #
  
  Why are you making that change? I don’t follow?
  
  Reply
Patrick October 30, 2018 at 2:07 am #

Dear Jason,

thank you so much for your tutorial. I would like to apply it to a process, forecasting some process variables. The time interval is not monthly based, as in your example, but is much shorter, like some days and the data collection is about every three seconds.

Now I would like to ask some questions.

Do you think that this model could work equally well ?

What should I put in place of ‘months_in_year’ ? For now I put 7, which is the number of different days in my dataset.

Is it normal that with 2000+ points the matrix analysis regarding the seeking of the optimal parameters is taking ages ? (e.g. one triplet after 30 min) If yes, how could I improve the code speed ?

Thank you again,
Best wishes,
Patrick.

Reply
- Jason Brownlee October 30, 2018 at 6:08 am #
  
  Try it and see.
  
  2K points may be too many for the method, perhaps try reducing to a few hundred at max.
  
  Reply
Markus December 25, 2018 at 12:20 am #

Hi

Where does the column ‘A’comes from by the code above which is:

groups = series[‘1964′:’1970’].groupby(TimeGrouper(‘A’))
y

And printing out groups contains only the first 6 months of each year, why?

Reply
- Jason Brownlee December 25, 2018 at 7:23 am #
  
  Right there:
  https://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
  
  Reply
Markus December 25, 2018 at 3:10 am #

Hi

What would be wrong if instead of defining the difference function by yourself, we would pass over d with the value of 12:

model = ARIMA(diff, order=(0,12,1))

Wouldn’t that be the same?

Thanks!

Reply
- Jason Brownlee December 25, 2018 at 7:24 am #
  
  It should be.
  
  Reply
Anthony January 10, 2019 at 10:33 pm #

Hi Jeson in running your code I get this error:
raise ValueError(“The computed initial MA coefficients are not ”
ValueError: The computed initial MA coefficients are not invertible
You should induce invertibility, choose a different model order, or you can
pass your own start_params.
How can I solve it? Thank you

Reply
- Jason Brownlee January 11, 2019 at 7:47 am #
  
  Sorry to hear that.
  
  Perhaps confirm your statsmodels and other libraries are up to date?
  Perhaps confirm that you copied all of the code in order?
  Perhaps try an alternative model configuration?
  
  Reply
  - Anthony January 15, 2019 at 9:50 pm #
    
    Ok I changed ARIMA parameters and now it works thank you!
    
    Reply
    - Jason Brownlee January 16, 2019 at 5:47 am #
      
      Glad it hear it!
      
      Reply
aravind January 12, 2019 at 9:29 pm #

This is my data: like….

Name Month Qty Unit
Wire Rods Total 2007-JAN 93798 t
Wire Rods Total 2007-FEB 86621 t
Wire Rods Total 2007-MAR 93118 t

My code is :

import pandas as pd
from sklearn.metrics import mean_squared_error
from math import sqrt
# load data
path_to_file = “C:/Users\ARAVIND\Desktop\jupyter notebook\project\datasets.csv”
data = pd.read_csv(path_to_file, encoding=’utf-8′)
# prepare data
X = data.values
X = X.astype(‘float32’)
train_size = int(len(X) * 0.50)
train, test = X[0:train_size], X[train_size:]
# walk-forward validation
history = [x for x in train]
predictions = list()
for i in range(len(test)):
# predict
yhat = history[-1]
predictions.append(yhat)
# observation
obs = test[i]
history.append(obs)
print(‘>Predicted=%.3f, Expected=%3.f’ % (yhat, obs))
# report performance
mse = mean_squared_error(test, predictions)
rmse = sqrt(mse)
print(‘RMSE: %.3f’ % rmse)

when i run this command it shows “Value error: could not convert string to float”… so could anyone tell how to convert string to float according to my dataset.. i want to convert the columns which is “Name, Month and Unit” to float.

Reply
- Jason Brownlee January 13, 2019 at 5:41 am #
  
  Perhaps remove the text and date data?
  
  Reply
  - aravind January 14, 2019 at 3:36 pm #
    
    ok sir.. thank you
    
    Reply
Mridul March 20, 2019 at 5:41 pm #

Hi Jason, Thanks for this amazing tutorial.

However, I get the below error when I am trying to run it on a time series:

ValueError: The computed initial MA coefficients are not invertible
You should induce invertibility, choose a different model order, or you can
pass your own start_params.

Reply
- Mridul March 20, 2019 at 5:45 pm #
  
  I just saw that this has already been answered and it worked when I followed that answer. Thanks!
  
  Reply
- Jason Brownlee March 21, 2019 at 7:59 am #
  
  Perhaps try a different configuration of the q/d/p variables for the model?
  
  Reply
Kaws March 20, 2019 at 7:42 pm #

This is really a helpful tutorial. Thank you Jason!!

And I have a small question. I got TypeError: a float is required error after i executed this code –

history = [x for x in train]
predictions = list()
for i in range(len(test)):
# predict
yhat = …
predictions.append(yhat)
# observation
obs = test[i]
history.append(obs)
print(‘>Predicted=%.3f, Expected=%3.f’ % (yhat, obs))

Can you help me out?

Reply
- Jason Brownlee March 21, 2019 at 8:02 am #
  
  Perhaps confirm that you have loaded the data correctly as a float?
  
  Reply
Varun April 15, 2019 at 4:23 pm #

Hi Jason,

What should be the approach when we need to provide long term forecasts ~ 12 months with exogenous variables using a technique like ARIMAX? Should we forecast the covariates and then add it in the model?

Regards,
Varun

Reply
- Jason Brownlee April 16, 2019 at 6:45 am #
  
  Perhaps you can frame your model to predict +12 months given only the observations available?
  
  Reply
  - Varun April 26, 2019 at 7:22 pm #
    
    Hi Jason,
    
    Just to be clear, if I add regressors and train the model, I would require future values right? E.g. xreg argument in auto.arima etc. How can I forecast +12 months without utilizing regressors that I have used for training?
    
    Reply
    - Jason Brownlee April 27, 2019 at 6:28 am #
      
      You could train a new productive model that only requires t-12 data to make predictions.
      
      Reply
Park1 April 24, 2019 at 8:14 pm #

Hi!

It’s amazing, thank you!

Could you give all files of this project. Cannot build it on my own.

Reply
- Jason Brownlee April 25, 2019 at 8:10 am #
  
  You can copy them from the article, here’s how:
  https://machinelearningmastery.com/faq/single-faq/how-do-i-copy-code-from-a-tutorial
  
  Reply
Prince Tiwari May 15, 2019 at 9:31 pm #

Hi!

It’s amazing, thank you!

Actually i want develop a model which determine how many calls we can expect to come into our call center on a daily basis ?

Reply
- Jason Brownlee May 16, 2019 at 6:31 am #
  
  Sounds great.
  
  Perhaps start here:
  https://machinelearningmastery.com/start-here/#timeseries
  
  Reply
Prince Tiwari May 22, 2019 at 10:12 pm #

Hi Jason Brownlee ,

Thank you so much for responded, i need to one more help

Could you please explain how to use Seasonality in ARIMA with some example

Thanks

Reply
- Jason Brownlee May 23, 2019 at 6:03 am #
  
  You can use the blog search box.
  
  Here is an example:
  https://machinelearningmastery.com/sarima-for-time-series-forecasting-in-python/
  
  Here is another:
  https://machinelearningmastery.com/how-to-grid-search-sarima-model-hyperparameters-for-time-series-forecasting-in-python/
  
  Reply
Adi June 27, 2019 at 2:37 am #

Hi Jason,

That’s a very helpful article. My time series is a daily Point of sale data (eg. how many pepsi bottles get bought everyday in a walmart). There are missing dates when pepsi was not sold at all. What I want to forecast is the bottles of pepsi sold on each day for the next 3/4 days. What might be the best approach/ algorithm?

Thanks.

Reply
- Jason Brownlee June 27, 2019 at 7:59 am #
  
  Probably a linear model like SARIMA or ETS.
  
  I have some suggestions here:
  https://machinelearningmastery.com/how-to-develop-a-skilful-time-series-forecasting-model/
  
  Reply
Nagaraj July 9, 2019 at 4:21 pm #

Hi Jason,
I’m getting the error like this, what does this error mean?

TypeError Traceback (most recent call last)
in
test = …
predictions = …
mse = mean_squared_error(test, predictions)
rmse = sqrt(mse)
print(‘RMSE: %.3f’ % rmse)

TypeError: Expected sequence or array-like, got

Reply
- Jason Brownlee July 10, 2019 at 8:03 am #
  
  That is surprising, did you copy of all of the code exactly?
  
  Reply
Rohit Singh Adhikari August 2, 2019 at 3:01 am #

Hi Jason,
I need to build a forecast having addition predictor, what should i use and it’s week forecasting i need to do?

Reply
- Jason Brownlee August 2, 2019 at 6:54 am #
  
  Perhaps follow the process outlined in the above tutorial?
  
  Reply
Ray August 7, 2019 at 9:58 am #

Hi Jason, nice article.
I have a question in the last code sample for validation.

1) What is the purpose of making the first prediction?
# load model
#make first prediction

2) I remove the above two sections of the code and got the exact result

Does mean only order (0,0,1) and Bias(165.904728) matters and there is no need to save and load the model?

Reply
- Jason Brownlee August 7, 2019 at 2:20 pm #
  
  It is an example of how we might use bias correction, in general.
  
  Reply
Kuba August 16, 2019 at 9:53 pm #

Hello,
I got exactly same results of the Augmented Dickey-Fuller test, however my PACF plot looks much different, It has a lot of spikes from lag 50 to 80, one even reaching -120.
Any ideas or previous experience why it can look that strange?
Thank you for sharing your knowledge with us.

Reply
- Jason Brownlee August 17, 2019 at 5:44 am #
  
  They changed the plot recently.
  
  Try scaling the number of time steps way down in the plot.
  
  Reply
Jamie August 26, 2019 at 3:22 pm #

I have tried with earnest to work through your tutorial. Unfortunately, I am running into errors.
One being this error. This could be the whole bone to my issues. As for the life of me, I could not resolve it using your code alone. I had to tinker with it – see explanation.

series = Series.from_csv(‘dataset.csv’)
AttributeError: type object ‘Series’ has no attribute ‘from_csv’

Which I resloved by implementing

panda import pd
series = pd.read_csv(‘dataset.csv’)
series.iloc[0]

Once I got over that hurdle – I then ran into this hurdle.

dataset = dataset.astype(‘float32’)
ValueError: could not convert string to float: ‘1964-01’

More than like its something I am doing wrong.

Would it be at all possible to have access to the full code? My attempts in copying and pasting obviously are not helping me.

I think I have possibly left some intrinsic part of the code out. Or I have totally confused myself.

Reply
- Jason Brownlee August 27, 2019 at 6:27 am #
  
  No problem, changed to:
  
  from pandas import read_csv series = read_csv('champagne.csv', header=0, index_col=0) ...
  
  1
  2
  3
  
  from pandas import read_csv
  series = read_csv('champagne.csv', header=0, index_col=0)
  ...
  
  I have updated all examples in the tutorial.
  
  Reply
  - Jamie August 27, 2019 at 9:42 am #
    
    That simple – gees… I will try that. Thanks for taking the time to answer my problem. Is your book available on amazon?
    
    Reply
    - Jason Brownlee August 27, 2019 at 2:08 pm #
      
      My books are not on Amazon, only on my website, I explain why here (they take a massive cut):
      https://machinelearningmastery.com/faq/single-faq/why-arent-your-books-on-amazon
      
      Reply
    - Jamie August 27, 2019 at 3:59 pm #
      
      It worked like a charm. Thanks.
      
      Another issue that I have found is with the code for 5.3 Seasonal Line Plots.
      
      from pandas import TimeGrouper
      from pandas import DataFrame
      
      It seems that pandas don’t support TimeGrouper,DataFrame any more!!
      https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html#deprecations
      Removed the previously deprecated TimeGrouper (GH16942)
      Removed the previously deprecated DataFrame.reindex_axis and Series.reindex_axis (GH17842)
      
      Reply
      - Jason Brownlee August 28, 2019 at 6:29 am #
        
        Well done!
        
        You can use:
        
        ... groups = series['1964':'1970'].groupby(Grouper(freq='A'))
        
        1
        2
        
        ...
        groups = series['1964':'1970'].groupby(Grouper(freq='A'))
        
        I updated the example, thanks!
Arjun December 5, 2019 at 4:42 pm #

Hi jason,
I was wondering if you have used fbprophet for sales prediction. We were fetching data directly from postgresql and we seem to be running into an error

Out of bounds nanosecond timestamp: 1-08-11 00:00:00

This seems to be something related with pandas version compatibility.
Could you please look into it and try to find the problem behind it?

Reply
- Jason Brownlee December 6, 2019 at 5:12 am #
  
  I have not, sorry.
  
  Reply
Mitra February 6, 2020 at 7:40 pm #

hey Jason,

I’m running the
series=read_csv(r’D:\industrial engineering\Thesis\monthly_champagne_sales.csv’,header=0,index_col=0)
code and what I’m getting is a data frame, not a Series what should I do?

Reply
- Jason Brownlee February 7, 2020 at 8:13 am #
  
  Try adding squeeze=True argument.
  
  Reply
Aviral Kumar March 17, 2020 at 1:56 am #

dear sir
I am running following code in a rolling window framework however, i am not able to see results that come from the analysis. It displays only one value. Can you please let me know what and where i need to fix so that i can get those results:

##Entropy
from entropy import *
import numpy as np

np.random.seed(1234567)
x = np.random.rand(3000)
n = len(x)
result = list()
block = 250
for a in range(1, n-block+1):
DATA = x[a:a+249]
results = perm_entropy(DATA, order=3, normalize=True)
result.append(results)
return Series(result)

Reply
- Jason Brownlee March 17, 2020 at 8:17 am #
  
  This is a common question that I answer here:
  https://machinelearningmastery.com/faq/single-faq/can-you-read-review-or-debug-my-code
  
  Reply
Ron April 24, 2020 at 12:24 pm #

Hey Jason,

I would like to ask you about TS with repeating dates. For example, I have a dataset with different type of deals and the deals have different time periods depending on their agreement. Which means, when a stationary graph is plotted there can be more than one observation (revenue in my case) on each date. For example, 2/11/2018 could have multiple revenue from different deals. Therefore, the graph sketch is very confusing in the sense that there are many/few repetition on the same date. My idea is to sum the revenues up by date but would that affect the accuracy of the model as the prediction should be by deals? Your thoughts and reply would be highly appreciated. Regards

Reply
- Jason Brownlee April 24, 2020 at 1:22 pm #
  
  Hmmm, it’s hard to say without diving into the data, but perhaps explore preparing your own framing of the problem.
  
  This will help as a first step:
  https://machinelearningmastery.com/how-to-define-your-machine-learning-problem/
  
  And this:
  https://machinelearningmastery.com/time-series-forecasting-supervised-learning/
  
  Reply
Jithu May 3, 2020 at 11:26 pm #

i was working on forecasting and this function was used for future forcasting and ended up with NotImplemented Error

def forcasting_future_months(monthly, no_of_months):
monthly_perdict = monthly.reset_index()
mon = monthly_perdict[‘Year_Month’]

mon = mon + pd.DateOffset(months = no_of_months)

future_dates = mon[-no_of_months -1:]
monthly_perdict = monthly_perdict.set_index(‘Year_Month’)
future = pd.DataFrame(index=future_dates, columns= monthly_perdict.columns)
monthly_perdict = pd.concat([monthly_perdict, future])
monthly_perdict[‘forecast’] = results.predict(start = 222, end = 233, dynamic= True)
monthly_perdict[[‘No of accidents’, ‘forecast’]].iloc[-no_of_months – 12:].plot(figsize=(12, 8))
plt.show()
return monthly_perdict[-no_of_months:]

Reply
- Jithu May 3, 2020 at 11:28 pm #
  
  My kaaggle note error is seen here
  
  https://www.kaggle.com/jithu10/kerala-road-accidents-kerala-police-2001-2019
  
  Reply
- Jason Brownlee May 4, 2020 at 6:23 am #
  
  This is a common question that I answer here:
  https://machinelearningmastery.com/faq/single-faq/can-you-read-review-or-debug-my-code
  
  Reply
Peter Lam May 6, 2020 at 1:28 pm #

X = dataset.values.astype(‘float32’)应该是：
X = series.values.astype(‘float32’)

Reply
Jonathan May 17, 2020 at 10:34 pm #

Hi Jason,
Thank you so much for publishing this article. Most articles that I have found simply tell a user what to enter into each field for their specific example, without really getting in what a parameter means or how it is used. As a result, I haven’t been able to get my arms around what to enter for my situation. Your article does a nice job of explaining how you came up with your parameters. I do have a question, though. What should I do differently for a small dataset? I only have about 30 months of data for what I am trying to predict. For the purposes of going through this tutorial, I faked some data. In doing so, it is providing me with negative sales predictions. Is that enough data to provide an accurate forecast?

One additional note, in 7.3 you have the code:

series = read_csv(‘dataset.csv’, header=None, index_col=0, parse_dates=True, squeeze=True)
X = dataset.values.astype(‘float32’)

I believe that it should be:

series = read_csv(‘dataset.csv’, header=None, index_col=0, parse_dates=True, squeeze=True)
X = series.values.astype(‘float32’)

Thanks again for the great tutorial!

Reply
- Jason Brownlee May 18, 2020 at 6:18 am #
  
  Thanks!
  
  Fixed.
  
  Not much would change, other than carefully chose how you evaluate models.
  
  Reply
Manish Bajpai June 3, 2020 at 6:21 pm #

Hi Jason,

Absolutely great exercise. Loved it…
Just one quick question:
How do I generate a forecast for next 12 intervals and generate a plot with training, test and predicted together

Reply
- Jason Brownlee June 4, 2020 at 6:12 am #
  
  Thanks!
  
  Call forecast() and specify 12 time steps.
  
  Reply
carlos June 17, 2020 at 1:34 am #

HI EveryBody

After read your great post (as usual) I have a couple of questions.

First is about residuals. You have calculated manually and then use a dataframe to work with them later. But, I have read arima.fit results have a member called residuals. I’ve been working in an different serie using those residuals, before read your article. Then I did that manually as you did. BUT both are different, and not such slightly different.. Should both be similar? Because I think they should and as many times I’m looking for an bug in my code

Second is about differences. when you make a diff to the serie you lose some elements, depending on lags. Well you can preserve the original data, so it is possible to integrate.It is academically perfect. But what to do when you have just only the final model? I mean when you recover new data and want to make a new prediction, the model will bring you predictions over transformed data you need to reescale. If you have used standardization you can get values to detransform but How you can integrate ? there information lost…

Thank you.

Reply
- Jason Brownlee June 17, 2020 at 6:26 am #
  
  Perhaps check what ARIMA is storing exactly in the residuals property, e.g. inspect the code or API.
  
  New data must be prepared in an identical manner as training data, e.g. differencing, scaling, etc.
  
  Reply
Vidya August 5, 2020 at 8:40 pm #

Hi Jason .

Couldn’t understand why bias is added to the predicted value.
“yhat = bias + inverse_difference(history, yhat, months_in_year)” .
Also , is there a possibility of not getting the exact values as shown in this post , when I run the code ? Any reasons why ?

Thanks so much!

Reply
- Jason Brownlee August 6, 2020 at 6:12 am #
  
  Yes, we can expect different results given the stochastic nature of the algorithm and the differences in precision given different hardware.
  
  Reply
Vidya August 6, 2020 at 12:22 pm #

Thanks . Can u pls let me know on why is bias added to the predicted value ?

Reply
- Jason Brownlee August 6, 2020 at 1:53 pm #
  
  In an attempt to lift skill of the predictions.
  
  Reply
Vidya August 8, 2020 at 8:02 pm #

Thanks Jason.

Reply
- Jason Brownlee August 9, 2020 at 5:39 am #
  
  You’re welcome.
  
  Reply
Brisco September 10, 2020 at 1:27 am #

Hey Jason I like your example I just have a question about putting in my own data. So my data has a count of visits for everyday of the year for 2019 and so far in 2020. I am wondering if since this does everyday instead of months months_in_year = 12 do I change months_in_year = 12 to days_in_year = 365?

Reply
- Jason Brownlee September 10, 2020 at 6:32 am #
  
  Perhaps try it and see?
  
  Reply
Teddy October 6, 2020 at 6:30 am #

Hey Jason ,am impressed by your commitment, and would you mind to show me about features/attributes of sales dataset to forecast Future sales of a supper market using deep learning/LSTM algorithm

Reply
- Jason Brownlee October 6, 2020 at 7:01 am #
  
  Sure, start here:
  https://machinelearningmastery.com/start-here/#deep_learning_time_series
  
  Reply
Sanjay Shukla November 7, 2020 at 3:22 am #

Hi Jason – Can the same python code be used if the dataset has historical data for multiple wines (or products)?

Reply
- Jason Brownlee November 7, 2020 at 6:30 am #
  
  Sure. But one model per series for ARIMA models.
  
  Reply
Tarun January 12, 2021 at 4:56 am #

Hi Jason,

Thanks for your tutorials. Very helpful indeed. After differencing, I made the series stationary and now for the ARIMA, I wanna plot the ACF and the PACF. These are my codes –

1) Code for differencing to make the data stationary

from pandas import read_csv
from statsmodels.tsa.stattools import adfuller
from numpy import diff
X = diff(X)
result = adfuller(X)
print(‘ADF Statistic: %f’ % result[0])
print(‘p-value: %f’ % result[1])
for key, value in result[4].items():
print(‘\t%s: %.3f’ % (key, value))

2) Code to plot the ACF and the PACF

from pandas import read_csv
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf
from matplotlib import pyplot
series = read_csv(‘X.csv’, header=None, index_col=0, parse_dates=True, squeeze=True)
pyplot.figure()
pyplot.subplot(211)
plot_acf(series, ax=pyplot.gca())
pyplot.subplot(212)
plot_pacf(series, ax=pyplot.gca())
pyplot.show()

After running the second code, I am getting an error which is –

FileNotFoundError Traceback (most recent call last)
in
3 from statsmodels.graphics.tsaplots import plot_pacf
4 from matplotlib import pyplot
—-> 5 series = read_csv(‘X.csv’, header=None, index_col=0, parse_dates=True, squeeze=True)
6 pyplot.figure()
7 pyplot.subplot(211)

~\anaconda3\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
674 )
675
–> 676 return _read(filepath_or_buffer, kwds)
677
678 parser_f.__name__ = name

~\anaconda3\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
446
447 # Create the parser.
–> 448 parser = TextFileReader(fp_or_buf, **kwds)
449
450 if chunksize or iterator:

~\anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, f, engine, **kwds)
878 self.options[“has_index_names”] = kwds[“has_index_names”]
879
–> 880 self._make_engine(self.engine)
881
882 def close(self):

~\anaconda3\lib\site-packages\pandas\io\parsers.py in _make_engine(self, engine)
1112 def _make_engine(self, engine=”c”):
1113 if engine == “c”:
-> 1114 self._engine = CParserWrapper(self.f, **self.options)
1115 else:
1116 if engine == “python”:

~\anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, src, **kwds)
1889 kwds[“usecols”] = self.usecols
1890
-> 1891 self._reader = parsers.TextReader(src, **kwds)
1892 self.unnamed_cols = self._reader.unnamed_cols
1893

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()

FileNotFoundError: [Errno 2] File X.csv does not exist: ‘X.csv’

Please let me know where am I wrong.

Reply
- Jason Brownlee January 12, 2021 at 7:53 am #
  
  Looks like your file cannot be loaded.
  
  Perhaps check that your file is in the same directory as your code or in the path you expect.
  
  Reply
Tarun January 14, 2021 at 9:28 am #

Hi Jason,

Please help me out. I am following your tutorials. These are my codes.

#1 (it worked)

from pandas import read_csv
from matplotlib import pyplot
series = read_csv(‘D:/Management Books/BSE Index Daily Closing.csv’, header=0, parse_dates=True, index_col=0, squeeze=True)
diff = series.diff()
pyplot.plot(diff)
pyplot.show()

#2 (it worked but with an error)

from pandas import read_csv
from numpy import diff
result = adfuller(diff(X))
print(‘ADF Statistic: %f’ % result[0])
print(‘p-value: %f’ % result[1])
for key, value in result[4].items():
print(‘\t%s: %.3f’ % (key, value))
diff.to_csv(‘diff.csv’, header = False)

AttributeError Traceback (most recent call last)
in
6 for key, value in result[4].items():
7 print(‘\t%s: %.3f’ % (key, value))
—-> 8 diff.to_csv(‘diff.csv’, header = False)

AttributeError: ‘function’ object has no attribute ‘to_csv’

#3 (showing an error)

from pandas import read_csv
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf
from matplotlib import pyplot
X.to_csv(‘X.csv’, header=False)
series = read_csv(‘X.csv’, header=None, index_col=0, parse_dates=True, squeeze=True)
pyplot.figure()
pyplot.subplot(211)
plot_acf(series, ax=pyplot.gca())
pyplot.subplot(212)
plot_pacf(series, ax=pyplot.gca())
pyplot.show()

AttributeError Traceback (most recent call last)
in
3 from statsmodels.graphics.tsaplots import plot_pacf
4 from matplotlib import pyplot
—-> 5 X.to_csv(‘X.csv’, header=False)
6 series = read_csv(‘X.csv’, header=None, index_col=0, parse_dates=True, squeeze=True)
7 pyplot.figure()

AttributeError: ‘numpy.ndarray’ object has no attribute ‘to_csv’

I have seen the solved examples and then trying it out.

Please help me.

Reply
- Jason Brownlee January 14, 2021 at 9:48 am #
  
  Sorry to hear that you’re having trouble.
  
  Perhaps try coping the complete example listed at the end of a given section?
  
  Also perhaps these tips will help:
  https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
  
  Reply
  - Tarun January 14, 2021 at 4:35 pm #
    
    I didn’t replicate your codes. Moreover, the first few codes worked. I wonder why is this not working.
    
    Reply
Sonia January 23, 2021 at 3:26 am #

Hi, I’m new to python and trying to learn from your blog.
Would you mind to explain your codes for the following? I don’t understand what is obs. Is yhat the predicted result?
obs = test[i]
history.append(obs)
print(‘>Predicted=%.3f, Expected=%3.f’ % (yhat, obs))

Thank you so much for this very educational step by step blog!!

Reply
- Jason Brownlee January 23, 2021 at 7:09 am #
  
  yhat is a prediction made by the predictive model.
  
  In the code – we are storing the real observation in history – we are pretending the real observation just became available after we made a prediction so we add it to training data on the next iteration.
  
  Reply
Tarun February 10, 2021 at 10:00 pm #

Hello,

I have a query. Can I forecast if I have a dataset of 10 years only. If yes, in that case how will I divide my dataset into Training and Testing dataset ?

Please let me know.

Reply
- Jason Brownlee February 11, 2021 at 5:55 am #
  
  Yes.
  
  Divide the data any way you want – that gives you confidence the model has enough data to train and evaluate.
  
  Reply
Tarun February 11, 2021 at 4:24 am #

Hi,

I have a request. I am looking for a Python code for implementing Ljung Box test for plotting ACF and PACF. Please let me know.

Reply
- Jason Brownlee February 11, 2021 at 5:58 am #
  
  Not off hand, perhaps this will help to get you started:
  https://machinelearningmastery.com/gentle-introduction-autocorrelation-partial-autocorrelation/
  
  Reply
Neha March 2, 2021 at 4:59 am #

Hi, thanks for such an informative article!!!

I have experienced that walk forward validation with range of p,d,q parameters is taking hours and google colab. Any tweak possible?

Reply
- Jason Brownlee March 2, 2021 at 5:47 am #
  
  You’re welcome.
  
  These suggestions may help:
  https://machinelearningmastery.com/faq/single-faq/how-do-i-speed-up-the-training-of-my-model
  
  Reply
Michel March 29, 2021 at 12:42 pm #

Hi Hope you can guide me, i’m new to this ML.

in 4. Persistence when i’m applying the following Code i get an error:

Code Applied :

# report performance
mse = mean_squared_error(test, predictions)
rmse = sqrt(mse)
print(‘RMSE: %.3f’ % rmse)

Error :
—————————————————————————
NameError Traceback (most recent call last)
in
1 # report performance
—-> 2 mse = mean_squared_error(test, predictions)
3 rmse = sqrt(mse)
4 print(‘RMSE: %.3f’ % rmse)

NameError: name ‘mean_squared_error’ is not defined

All previous steps worked correctly. can you please advise what seem to be the issue ?

Thanks in advacne.

Reply
- Jason Brownlee March 30, 2021 at 5:55 am #
  
  The error suggests that you have not imported the function.
  
  Perhaps you skipped some lines of code?
  
  Reply
Anonymous April 13, 2021 at 10:55 am #

Hi, I don’t know a better way to get in touch with you. I think I found someone plagiarizing your work here: https://towardsdatascience.com/time-series-analysis-modeling-validation-386378cd3369. I will delete this comment after you read it. Thank you.

Reply
- Jason Brownlee April 14, 2021 at 6:16 am #
  
  Thanks for letting me know, it happens all the time.
  
  Reply
JR May 19, 2021 at 1:08 am #

Thanks Jason, great post

Quick question, I am not getting the same results when running the manual ARIMA model. Your results of 956.942 vs. my 961. Everythign else matched up until that point but I was having issues with autocorrelation plots only showing 21 observations vs 81 as well. Any help is greatly appreciated.

# evaluate manually configured ARIMA model
from pandas import read_csv
from sklearn.metrics import mean_squared_error
from statsmodels.tsa.arima.model import ARIMA
from math import sqrt

# create a differenced series
def difference(dataset, interval=1):
diff = list()
for i in range(interval, len(dataset)):
value = dataset[i] – dataset[i – interval]
diff.append(value)
return diff

# invert differenced value
def inverse_difference(history, yhat, interval=1):
return yhat + history[-interval]

# load data
series = read_csv(‘dataset.csv’, header=None, index_col=0, parse_dates=True, squeeze=True)
# prepare data
X = series.values
X = X.astype(‘float32’)
train_size = int(len(X) * 0.50)
train, test = X[0:train_size], X[train_size:]
# walk-forward validation
history = [x for x in train]
predictions = list()
for i in range(len(test)):
# difference data
months_in_year = 12
diff = difference(history, months_in_year)
# predict
model = ARIMA(diff, order=(1,1,1))
model_fit = model.fit()
yhat = model_fit.forecast()[0]
yhat = inverse_difference(history, yhat, months_in_year)
predictions.append(yhat)
# observation
obs = test[i]
history.append(obs)
print(‘>Predicted=%.3f, Expected=%.3f’ % (yhat, obs))
# report performance
rmse = sqrt(mean_squared_error(test, predictions))
print(‘RMSE: %.3f’ % rmse)

———- Results
>Predicted=8076.987, Expected=8314.000
>Predicted=9747.154, Expected=10651.000
>Predicted=5994.362, Expected=3633.000
>Predicted=3820.287, Expected=4292.000
>Predicted=4041.968, Expected=4154.000
>Predicted=4990.405, Expected=4121.000
>Predicted=5129.641, Expected=4647.000
>Predicted=5031.196, Expected=4753.000
>Predicted=4133.285, Expected=3965.000
>Predicted=2095.321, Expected=1723.000
>Predicted=5216.271, Expected=5048.000
>Predicted=5866.317, Expected=6922.000
>Predicted=8591.060, Expected=9858.000
>Predicted=11028.649, Expected=11331.000
>Predicted=4090.352, Expected=4016.000
>Predicted=4767.109, Expected=3957.000
>Predicted=4656.326, Expected=4510.000
>Predicted=4577.708, Expected=4276.000
>Predicted=5108.656, Expected=4968.000
>Predicted=5202.831, Expected=4677.000
>Predicted=4423.982, Expected=3523.000
>Predicted=2162.388, Expected=1821.000
>Predicted=5463.233, Expected=5222.000
>Predicted=7331.345, Expected=6872.000
>Predicted=10258.650, Expected=10803.000
>Predicted=11732.476, Expected=13916.000
>Predicted=4552.498, Expected=2639.000
>Predicted=4578.764, Expected=2899.000
>Predicted=4914.578, Expected=3370.000
>Predicted=4545.624, Expected=3740.000
>Predicted=5229.775, Expected=2927.000
>Predicted=4287.729, Expected=3986.000
>Predicted=3153.080, Expected=4217.000
>Predicted=1827.778, Expected=1738.000
>Predicted=5134.493, Expected=5221.000
>Predicted=6806.278, Expected=6424.000
>Predicted=10643.889, Expected=9842.000
>Predicted=13606.245, Expected=13076.000
>Predicted=2265.432, Expected=3934.000
>Predicted=2936.318, Expected=3162.000
>Predicted=3341.109, Expected=4286.000
>Predicted=3881.790, Expected=4676.000
>Predicted=3156.601, Expected=5010.000
>Predicted=4693.892, Expected=4874.000
>Predicted=4663.923, Expected=4633.000
>Predicted=2045.499, Expected=1659.000
>Predicted=5440.863, Expected=5951.000
RMSE: 961.548

Reply
- Jason Brownlee May 19, 2021 at 6:35 am #
  
  This is a common question that I answer here:
  https://machinelearningmastery.com/faq/single-faq/why-do-i-get-different-results-each-time-i-run-the-code
  
  Reply
Archie July 1, 2021 at 5:18 pm #

TypeError: only size-1 arrays can be converted to Python scalars

I am getting this error when running the persistence model for french champagne

Reply
- Jason Brownlee July 2, 2021 at 5:21 am #
  
  Sorry to hear that, perhaps these tips will help:
  https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
  
  Reply
Archie July 1, 2021 at 5:55 pm #

no worries it was the way the data was imported

Reply
- Jason Brownlee July 2, 2021 at 5:21 am #
  
  Happy to hear that!
  
  Reply
Saheed February 7, 2022 at 7:07 pm #

Hi Jason, thanks a lot god this. I would like to know the reason for using seasonally differenced data for the ARIMA model. I have followed the previous example (robbery dataset) and the differences wasn’t used with ARIMA model.

What’s the reason behind this please??

Reply
- James Carmichael February 9, 2022 at 12:17 pm #
  
  Hi Saheed…Hopefully the following will add clarity:
  
  https://stats.stackexchange.com/questions/195038/assessing-seasonality-when-to-use-seasonal-arima-instead-of-non-seasonal-arima
  
  https://machinelearningmastery.com/sarima-for-time-series-forecasting-in-python/
  
  Reply
Jishan March 24, 2022 at 5:55 am #

Nice tutorial. By the way, What if I have unequal time length in my data. Say I have sales data for 8 month from 2019, 6 months from 2020. and 10 months from 2021. How can I build model using unequal number of months? Thank you so much for your time!

Reply
- James Carmichael March 24, 2022 at 11:32 am #
  
  Hi Jishan…Some time series data is discontiguous.
  
  This means that the interval between the observations is not consistent, but may vary.
  
  You can learn more about contiguous vs discontiguous time series datasets in this post:
  
  Taxonomy of Time Series Forecasting Problems
  There are many ways to handle data in this form and you must discover the approach that works well or best for your specific dataset and chosen model.
  
  The most common approach is to frame the discontiguous time series as contiguous and the observations for the newly observation times as missing (e.g. a contiguous time series with missing values).
  
  Some ideas you may want to explore include:
  
  Ignore the discontiguous nature of the problem and model the data as-is.
  Resample the data (e.g. upsample) to have a consistent interval between observations.
  Impute the observations to form a consistent interval.
  Pad the observations for form a consistent interval and use a Masking layer to ignore the padded values.
  
  Reply
Jishan March 25, 2022 at 1:56 am #

Hi James, Thanks for your reply. Problem here that store opening depends on the weather condition of a year. It doesn’t operate same time of each year. It varies! It was open only r 8 month from 2019, 6 months from 2020. and 10 months from 2021. How can we treat this as missing value problem? Any workout tutorial would be great! I appreciate your time!Thanks!

Reply
jinYang April 27, 2022 at 12:42 pm #

Hello, I would like to ask a qion, because my data set is to record the plant harvest data, and the month is relatively fixed, so each year only two months of data, for example, the 2021 has 48 days of data, there are 50 days in 2022, so based on that, should I fill in all the dates that don’t have data with zeros? Does this also mean a seasonal figure? If I make a prediction using SARIMAX, what should I do with my cycle?

Reply

Navigation

Time Series Forecast Study with Python: Monthly Sales of French Champagne

Overview

Stop learning Time Series Forecasting the slow way!

1. Environment

2. Problem Description

3. Test Harness

3.1 Validation Dataset

3.2. Model Evaluation

3.2.1 Performance Measure

3.2.2 Test Strategy

4. Persistence

5. Data Analysis

5.1 Summary Statistics

5.2 Line Plot

5.3 Seasonal Line Plots

5.4 Density Plot

5.5 Box and Whisker Plots

6. ARIMA Models

6.1 Manually Configured ARIMA

6.2 Grid Search ARIMA Hyperparameters

6.3 Review Residual Errors

7. Model Validation

7.1 Finalize Model

7.2 Make Prediction

7.3 Validate Model

Summary

Want to Develop Time Series Forecasts with Python?

Develop Your Own Forecasts in Minutes

Finally Bring Time Series Forecasting to
Your Own Projects

More On This Topic

222 Responses to Time Series Forecast Study with Python: Monthly Sales of French Champagne

Leave a Reply Click here to cancel reply.

Navigation

Overview

Stop learning Time Series Forecasting the slow way!

1. Environment

2. Problem Description

3. Test Harness

3.1 Validation Dataset

3.2. Model Evaluation

3.2.1 Performance Measure

3.2.2 Test Strategy

4. Persistence

5. Data Analysis

5.1 Summary Statistics

5.2 Line Plot

5.3 Seasonal Line Plots

5.4 Density Plot

5.5 Box and Whisker Plots

6. ARIMA Models

6.1 Manually Configured ARIMA

6.2 Grid Search ARIMA Hyperparameters

6.3 Review Residual Errors

7. Model Validation

7.1 Finalize Model

7.2 Make Prediction

7.3 Validate Model

Summary

Want to Develop Time Series Forecasts with Python?

Develop Your Own Forecasts in Minutes

Finally Bring Time Series Forecasting to Your Own Projects

More On This Topic

222 Responses to Time Series Forecast Study with Python: Monthly Sales of French Champagne

Leave a Reply Click here to cancel reply.

Finally Bring Time Series Forecasting to
Your Own Projects