Time Series Forecasting with Python 7-Day Mini-Course

By Jason Brownlee on December 10, 2020 in Time Series 51

From Developer to Time Series Forecaster in 7 Days.

Python is one of the fastest-growing platforms for applied machine learning.

In this mini-course, you will discover how you can get started, build accurate models and confidently complete predictive modeling time series forecasting projects using Python in 7 days.

This is a big and important post. You might want to bookmark it.

Kick-start your project with my new book Time Series Forecasting With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Updated Apr/2019: Updated the links to datasets.
Updated Aug/2019: Updated data loading to use new API.
Updated Apr/2020: Changed AR to AutoReg due to API change.
Updated Dec/2020: Updated ARIMA API to the latest version of statsmodels.

Time Series Forecasting with Python 7-Day Mini-Course
Photo by Raquel M, some rights reserved.

Who Is This Mini-Course For?

Before we get started, let’s make sure you are in the right place.

The list below provides some general guidelines as to who this course was designed for.

Don’t panic if you don’t match these points exactly, you might just need to brush up in one area or another to keep up.

You’re a Developer: This is a course for developers. You are a developer of some sort. You know how to read and write code. You know how to develop and debug a program.
You know Python: This is a course for Python people. You know the Python programming language, or you’re a skilled enough developer that you can pick it up as you go along.
You know some Machine Learning: This is a course for novice machine learning practitioners. You know some basic practical machine learning, or you can figure it out quickly.

This mini-course is neither a textbook on Python or a textbook on time series forecasting.

It will take you from a developer that knows a little machine learning to a developer who can get time series forecasting results using the Python ecosystem, the rising platform for professional machine learning.

Note: This mini-course assumes you have a working Python 2 or 3 SciPy environment with at least NumPy, Pandas, scikit-learn and statsmodels installed.

Mini-Course Overview

This mini-course is broken down into 7 lessons.

You could complete one lesson per day (recommended) or complete all of the lessons in one day (hardcore). It really depends on the time you have available and your level of enthusiasm.

Below are 7 lessons that will get you started and productive with machine learning in Python:

Lesson 01: Time Series as Supervised Learning.
Lesson 02: Load Time Series Data.
Lesson 03: Data Visualization.
Lesson 04: Persistence Forecast Model.
Lesson 05: Autoregressive Forecast Model.
Lesson 06: ARIMA Forecast Model.
Lesson 07: Hello World End-to-End Project.

Each lesson could take you 60 seconds or up to 30 minutes. Take your time and complete the lessons at your own pace. Ask questions and even post results in the comments below.

The lessons expect you to go off and find out how to do things. I will give you hints, but part of the point of each lesson is to force you to learn where to go to look for help on and about the Python platform for time series (hint, I have all of the answers directly on this blog, use the search feature).

I do provide more help in the early lessons because I want you to build up some confidence and inertia.

Post your results in the comments, I’ll cheer you on!

Hang in there, don’t give up.

Stop learning Time Series Forecasting the slow way!

Take my free 7-day email course and discover how to get started (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Lesson 01: Time Series as Supervised Learning

Time series problems are different to traditional prediction problems.

The addition of time adds an order to observations that both must be preserved and can provide additional information for learning algorithms.

A time series dataset may look like the following:

Time, Observation
day1, obs1
day2, obs2
day3, obs3

Time, Observation

day1, obs1

day2, obs2

day3, obs3

We can reframe this data as a supervised learning problem with inputs and outputs to be predicted. For example:

Input,	Output
?,		obs1
obs1,	obs2
obs2,	obs3
obs3,	?

Input, Output

?, obs1

obs1, obs2

obs2, obs3

obs3, ?

You can see that the reframing means we have to discard some rows with missing data.

Once it is reframed, we can then apply all of our favorite learning algorithms like k-Nearest Neighbors and Random Forest.

For more help, see the post:

Time Series Forecasting as Supervised Learning

Lesson 02: Load Time Series Data

Before you can develop forecast models, you must load and work with your time series data.

Pandas provides tools to load data in CSV format.

In this lesson, you will download a standard time series dataset, load it in Pandas and explore it.

Download the daily female births dataset in CSV format and save it with the filename “daily-births.csv“.

Download the dataset

You can load a time series dataset as a Pandas Series and specify the header row at line zero, as follows:

from pandas import read_csv
series = read_csv('daily-births.csv', header=0, index_col=0)

1 2	from pandas import read_csv series = read_csv('daily-births.csv', header=0, index_col=0)

Get used to exploring loaded time series data in Python:

Print the first few rows using the head() function.
Print the dimensions of the dataset using the size attribute.
Query the dataset using a date-time string.
Print summary statistics of the observations.

For more help, see the post:

How to Load and Explore Time Series Data in Python

Lesson 03: Data Visualization

Data visualization is a big part of time series forecasting.

Line plots of observations over time are popular, but there is a suite of other plots that you can use to learn more about your problem.

In this lesson, you must download a standard time series dataset and create 6 different types of plots.

Download the monthly shampoo sales dataset in CSV format and save it with the filename “shampoo-sales.csv“.

Download the dataset

Now create the following 6 types of plots:

Line Plots.
Histograms and Density Plots.
Box and Whisker Plots by year or quarter.
Heat Maps.
Lag Plots or Scatter Plots.
Autocorrelation Plots.

Below is an example of a simple line plot to get you started:

from pandas import read_csv
from matplotlib import pyplot
series = read_csv('shampoo-sales.csv', header=0, index_col=0)
series.plot()
pyplot.show()

from pandas import read_csv

from matplotlib import pyplot

series = read_csv('shampoo-sales.csv', header=0, index_col=0)

series.plot()

pyplot.show()

For more help, see the post:

Time Series Data Visualization with Python

Lesson 04: Persistence Forecast Model

It is important to establish a baseline forecast.

The simplest forecast you can make is to use the current observation (t) to predict the observation at the next time step (t+1).

This is called the naive forecast or the persistence forecast and may be the best possible model on some time series forecast problems.

In this lesson, you will make a persistence forecast for a standard time series forecast problem.

Download the daily female births dataset in CSV format and save it with the filename “daily-births.csv“.

Download the dataset

You can implement the persistence forecast as a single line function, as follows:

# persistence model
def model_persistence(x):
	return x

# persistence model

def model_persistence(x):

return x

Write code to load the dataset and use the persistence forecast to make a prediction for each time step in the dataset. Note, that you will not be able to make a forecast for the first time step in the dataset as there is no previous observation to use.

Store all of the predictions in a list. You can calculate a Root Mean Squared Error (RMSE) for the predictions compared to the actual observations as follows:

from sklearn.metrics import mean_squared_error
from math import sqrt
predictions = []
actual = series.values[1:]
rmse = sqrt(mean_squared_error(actual, predictions))

from sklearn.metrics import mean_squared_error

from math import sqrt

predictions = []

actual = series.values[1:]

rmse = sqrt(mean_squared_error(actual, predictions))

For more help, see the post:

How to Make Baseline Predictions for Time Series Forecasting with Python

Lesson 05: Autoregressive Forecast Model

Autoregression means developing a linear model that uses observations at previous time steps to predict observations at future time step (“auto” means self in ancient Greek).

Autoregression is a quick and powerful time series forecasting method.

The statsmodels Python library provides the autoregression model in the AutoReg class.

In this lesson, you will develop an autoregressive forecast model for a standard time series dataset.

Download the monthly shampoo sales dataset in CSV format and save it with the filename “shampoo-sales.csv“.

Download the dataset

You can fit an AR model as follows:

model = AutoReg(dataset, lags=2)
model_fit = model.fit()

1 2	model = AutoReg(dataset, lags=2) model_fit = model.fit()

You can predict the next out of sample observation with a fit AR model as follows:

prediction = model_fit.predict(start=len(dataset), end=len(dataset))

1	prediction = model_fit.predict(start=len(dataset), end=len(dataset))

You may want to experiment by fitting the model on half of the dataset and predicting one or more of the second half of the series, then compare the predictions to the actual observations.

For more help, see the post:

Autoregression Models for Time Series Forecasting With Python

Lesson 06: ARIMA Forecast Model

The ARIMA is a classical linear model for time series forecasting.

It combines the autoregressive model (AR), differencing to remove trends and seasonality, called integrated (I) and the moving average model (MA) which is an old name given to a model that forecasts the error, used to correct predictions.

The statsmodels Python library provides the ARIMA class.

In this lesson, you will develop an ARIMA model for a standard time series dataset.

Download the monthly shampoo sales dataset in CSV format and save it with the filename “shampoo-sales.csv“.

Download the dataset

The ARIMA class requires an order(p,d,q) that is comprised of three arguments p, d and q for the AR lags, number of differences and MA lags.

You can fit an ARIMA model as follows:

model = ARIMA(dataset, order=(0,1,0))
model_fit = model.fit()

1 2	model = ARIMA(dataset, order=(0,1,0)) model_fit = model.fit()

You can make a one-step out-of-sample forecast for a fit ARIMA model as follows:

outcome = model_fit.forecast()[0]

1	outcome = model_fit.forecast()[0]

The shampoo dataset has a trend so I’d recommend a d value of 1. Experiment with different p and q values and evaluate the predictions from resulting models.

For more help, see the post:

How to Create an ARIMA Model for Time Series Forecasting with Python

Lesson 07: Hello World End-to-End Project

You now have the tools to work through a time series problem and develop a simple forecast model.

In this lesson, you will use the skills learned from all of the prior lessons to work through a new time series forecasting problem.

Download the monthy car sales dataset in CSV format and save it with the filename “monthly-car-sales.csv“.

Download the dataset

Split the data, perhaps extract the last 1 or 2 years to a separate file. Work through the problem and develop forecasts for the missing data, including:

Load and explore the dataset.
Visualize the dataset.
Develop a persistence model.
Develop an autoregressive model.
Develop an ARIMA model.
Visualize forecasts and summarize forecast error.

For an example of working through a project, see the post:

Time Series Forecast Study with Python: Monthly Sales of French Champagne

The End!
(Look How Far You Have Come)

You made it. Well done!

Take a moment and look back at how far you have come.

You discovered:

How to frame a time series forecasting problem as supervised learning.
How to load and explore time series data with Pandas.
How to plot and visualize time series data a number of different ways.
How to develop a naive forecast called the persistence model as a baseline.
How to develop an autoregressive forecast model using lagged observations.
How to develop an ARIMA model including autoregression, integration and moving average elements.
How to pull all of these elements together into an end-to-end project.

Don’t make light of this, you have come a long way in a short amount of time.

This is just the beginning of your time series forecasting journey with Python. Keep practicing and developing your skills.

Summary

How Did You Go With The Mini-Course?
Did you enjoy this mini-course?

Do you have any questions? Were there any sticking points?
Let me know. Leave a comment below.

51 Responses to Time Series Forecasting with Python 7-Day Mini-Course

Luca May 5, 2017 at 2:29 am #

Hi

Thanks for so many articles in your blog. Really appreciate.

I have a question that I see sometimes we use a fixed-parameter model (e.g. parameters in ARIMA model is always fixed), while other times use an iterative way to determine the model parameters in each iteration of a test data sample. Are there any differences or reasons behind that? and when fixed model is useful and when to use an iterative way?

my understanding from the examples are: iterative way of modeling ARIMA seems more appropriate to seasonal and trending dataset, right?

Thanks a lot

Reply
- Jason Brownlee May 5, 2017 at 7:32 am #
  
  In general, I would suggest evaluating a suite of different models for a problem and see what works best.
  
  Reply
Gururaj August 13, 2017 at 12:21 pm #

Thanks Jason for these helpful articles. I have a general question. When we have a very high number of sensors, are there simpler methods to model them over building a model for each sensor using its own time series data?

If we have some intuition that we may find groups of sensors which may exhibit similar behaviour, is there a method to cluster them and validate, given the individual time series data?

Reply
- Jason Brownlee August 14, 2017 at 6:23 am #
  
  Good question Gururaj, sorry I have not worked on consolidating sensor data, I can’t give you expert advice.
  
  Reply
- Aiswariya August 28, 2019 at 5:26 am #
  
  These articles are very helpful… I like to know whether it is possible to forecast the time(dependent) variable with date as independent variable. For example with the past data of my arrival at the office can I predict at what time I will be able to mark the attendance tomorrow ?
  
  Reply
  - Jason Brownlee August 28, 2019 at 6:43 am #
    
    Date is not needed, we work from the variable directly with univariate forecasting.
    
    Reply
joseph September 10, 2017 at 4:02 pm #

Thanks for the course.
I intend on doing the course
I would like to know:
do you have anomaly detection course?
are hidden markov models and recurrent nn fit this area(time series)?
thanks
joseph

Reply
- Jason Brownlee September 11, 2017 at 12:05 pm #
  
  Not at this stage, perhaps in the future.
  
  Reply
  - Prashant Gupta June 5, 2018 at 4:16 pm #
    
    Yes, LSTM (a type of rnn) has been using for a while for time series problems
    
    Reply
Johnny Castro September 25, 2018 at 8:20 pm #

Hi Jason, I am really enjoying the course. You make it easy to learn ML faster than via other curricula.
I wanted to ask where/if you have the answers to these lessons (using the same datsets).
Specifically I am having trouble with “Lesson 03” (box plots and on) in grouping by year and quarter the data from “shampoo sales”.
The function “TimeGrouper” has been deprecated. I am using “Grouper” but have been unable to replicate results.

Reply
- Jason Brownlee September 26, 2018 at 6:15 am #
  
  I have blog posts on each, you can use the search or start here:
  https://machinelearningmastery.com/start-here/#timeseries
  
  Reply
Mia Cloe September 27, 2018 at 9:04 pm #

Hello. Thank you for creating this time series mini-course, I am learning a lot of things.
One thing that I’m wondering is that how hard it is if someone tries to code ARIMA computation without using APIs such as statsmodel. Do people usually use APIs in time series forecasting?

Reply
- Jason Brownlee September 28, 2018 at 6:14 am #
  
  Yes, people usually use APIs. Coding from scratch is only a good idea if you want to learn how it works in more detail or you have special operational requirements.
  
  Reply
Ismaello January 24, 2020 at 3:28 am #

This lessons are wonderful, thanks. I have a dude about a problem.

I have a 4 time series dataset.
1 dataset with data for hour. For last month
1 dataset with data for last 12 months. Data for day.
1 dataset with data per month. 3 last years.

How can i use this dataset?

Reply
- Jason Brownlee January 24, 2020 at 7:55 am #
  
  Thanks.
  
  What problem are you having exactly?
  
  Reply
  - ismael January 24, 2020 at 6:52 pm #
    
    In this example, our dataset is dayN, obs.
    
    In a real case, I have a few datasets like:
    
    diary_dataset_last30days
    day1, obs
    day2, obs
    
    monthly_dataset_actual_year
    month1, obs
    month2, obs
    month3, obs
    
    years_dataset_last5_years
    year1, obs
    year2, obs
    year3, obs
    
    I have data from days, hour too, week, years. What can i do to use this data for forecasting?
    
    What do you think?
    
    Reply
    - Jason Brownlee January 25, 2020 at 8:34 am #
      
      Perhaps prototype some models and see?
      
      Reply
MF February 3, 2020 at 6:56 pm #

Hi Jason,
Thanks for so many articles in your blog. Deeply appreciate. Really.
In this Lesson 03: Data Visualization, I am trying to practice plotting BoxPlot.
——–
But got this error : TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of ‘Index’
——–
While running the below code :
from pandas import DataFrame
from pandas import Grouper
series = pd.read_csv(‘shampoo-sales.csv’, header=0, index_col=0, parse_dates=True, squeeze=True)
groups = series.groupby(Grouper(freq=’M’))
years = DataFrame()
for name, group in groups:
years[name.year] = group.values
years.boxplot()
plt.show()
——–
Hope you can help. Thanks a lot.

Reply
- Jason Brownlee February 4, 2020 at 7:49 am #
  
  Perhaps confirm that the dates/index was parsed correctly.
  
  Reply
Winda April 8, 2020 at 4:34 pm #

Thanks for your sample

Reply
- Jason Brownlee April 9, 2020 at 7:54 am #
  
  You’re welcome!
  
  Reply

Dominique April 23, 2020 at 10:27 pm #

Hi Jason,

For lesson #04: Persistence Forecast Model.

I adapted your code below.

I got a MSE of 83 for which I have difficulty to interpret: do you have an hint?

Thanks,
Dominique

from pandas import read_csv
from pandas import datetime
from matplotlib import pyplot
from pandas import DataFrame
from pandas import concat
from sklearn.metrics import mean_squared_error

def parser(x): return datetime.strptime('190'+x, '%Y-%m')

import pandas as pd
series = pd.read_csv('daily-total-female-births.csv', delimiter=';')
#series = read_csv('daily-total-female-births.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
series.plot()
pyplot.show()

# Step 1: Defined the supervised learning problem
# Create lagged dataset

values = DataFrame(series.values)
dataframe = concat([values.shift(1), values], axis=1)
dataframe.columns = [ 'DATE', 'AGE', 'DATE+1', 'AGE+1']
print(dataframe.head(5))

# Step 2: Train and tests set
# split into train and test sets
X = dataframe.values
train_size = int(len(X) * 0.66)
train, test = X[1:train_size], X[train_size:]
train_X, train_y = train[:,1], train[:,3]
test_X, test_y = test[:,1], test[:,3]

# Step 3: persistence algorithm
# persistence model
def model_persistence(x):
    return x

# Step 4: make and evaluate forecast
# walk-forward validation
predictions = list()
for x in test_X:
    yhat = model_persistence(x)
    predictions.append(yhat)
test_score = mean_squared_error(test_y, predictions)
print('Test MSE: %.3f' % test_score)

# plot predictions and expected results
pyplot.plot(train_y)
pyplot.plot([None for i in train_y] + [x for x in test_y])
pyplot.plot([None for i in train_y] + [x for x in predictions])
pyplot.show()

from pandas import read_csv

from pandas import datetime

from matplotlib import pyplot

from pandas import DataFrame

from pandas import concat

from sklearn.metrics import mean_squared_error

def parser(x): return datetime.strptime('190'+x, '%Y-%m')

import pandas as pd

series = pd.read_csv('daily-total-female-births.csv', delimiter=';')

#series = read_csv('daily-total-female-births.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)

series.plot()

pyplot.show()

# Step 1: Defined the supervised learning problem

# Create lagged dataset

values = DataFrame(series.values)

dataframe = concat([values.shift(1), values], axis=1)

dataframe.columns = [ 'DATE', 'AGE', 'DATE+1', 'AGE+1']

print(dataframe.head(5))

# Step 2: Train and tests set

# split into train and test sets

X = dataframe.values

train_size = int(len(X) * 0.66)

train, test = X[1:train_size], X[train_size:]

train_X, train_y = train[:,1], train[:,3]

test_X, test_y = test[:,1], test[:,3]

# Step 3: persistence algorithm

# persistence model

def model_persistence(x):

return x

# Step 4: make and evaluate forecast

# walk-forward validation

predictions = list()

for x in test_X:

yhat = model_persistence(x)

predictions.append(yhat)

test_score = mean_squared_error(test_y, predictions)

print('Test MSE: %.3f' % test_score)

# plot predictions and expected results

pyplot.plot(train_y)

pyplot.plot([None for i in train_y] + [x for x in test_y])

pyplot.plot([None for i in train_y] + [x for x in predictions])

pyplot.show()

Jason Brownlee April 24, 2020 at 5:41 am #

Well done!

Reply

Dominique April 25, 2020 at 3:47 am #

Hi Jason,

Lesson #3 Data Visualization.

For this one I had to update the shampoo dataset as no year was present: I was obliged to reformat the shampoo dataset by adding a year in the first column so that I can plot by year and quarter.

Your post proved of great help.

The code below:

import pandas as pd
from pandas import read_csv
from matplotlib import pyplot
from pandas import Grouper
from pandas import DataFrame
import matplotlib.pyplot as plt
import seaborn as sns
from pandas.plotting import lag_plot
from pandas.plotting import autocorrelation_plot

# Python version
import sys
print('Python: {}'.format(sys.version))

# Load birth data using read_csv
import pandas as pd
sr = pd.read_csv('daily-total-female-births.csv', delimiter=';')
print(type(sr))


# Print the first few rows using the head() function.
print(f'\nThe type is: {type(sr)}')

print(sr.head(10))

# Print the dimensions of the dataset using the size attribute.
print(f'\nSize is:{sr.Date.size}')


# Query the dataset using a date-time string.
print(f'\nAll dates before March:')
print(sr[sr.Date <= '1959-01-03'])

# Print summary statistics of the observations.
print(f'\nSummary:')
print(sr.describe())

# Data Visualization

sr.plot()
pyplot.show()

# 1 - Printing line plots
# I corrected the original file by adding a year
series = pd.read_csv('shampoocorrected.csv', delimiter=';')

series.plot()
pyplot.show()

# 1 - bis printing dots instead of connected lines
series.plot(style='k.')
pyplot.show()

# 2 - Printing an histogram
series.hist()
pyplot.show()

# 2 bis - Printing using a density plot
series.plot(kind='kde')
pyplot.show()

# 3 - Box and Whisker plots by year or quarter.

df = pd.DataFrame(series)
print(df)

df['date'] = pd.to_datetime(df['Month'], errors='ignore')
groups = df.groupby(Grouper(key='date', freq='Y'))

# by year
df=df.set_index('date')
df['Year']=df.index.year
sns.boxplot(data=df, x='Year', y='Sales')
pyplot.show()

# by quarter
df['Quarter']=df.index.quarter
sns.boxplot(data=df, x='Quarter', y='Sales')
pyplot.show()

# 4 - Heatmaps
series = read_csv('shampoocorrected.csv', delimiter=';', header=0, index_col=0)
# Comparing the months of year =1971
pyplot.matshow(series, interpolation=None, aspect='auto')
pyplot.show()

# 5 - Scatter plot
lag_plot(series)
pyplot.show()


# 6 - Autocorrelation plots
autocorrelation_plot(series)
pyplot.show()

import pandas as pd

from pandas import read_csv

from matplotlib import pyplot

from pandas import Grouper

from pandas import DataFrame

import matplotlib.pyplot as plt

import seaborn as sns

from pandas.plotting import lag_plot

from pandas.plotting import autocorrelation_plot

# Python version

import sys

print('Python: {}'.format(sys.version))

# Load birth data using read_csv

import pandas as pd

sr = pd.read_csv('daily-total-female-births.csv', delimiter=';')

print(type(sr))

# Print the first few rows using the head() function.

print(f'\nThe type is: {type(sr)}')

print(sr.head(10))

# Print the dimensions of the dataset using the size attribute.

print(f'\nSize is:{sr.Date.size}')

# Query the dataset using a date-time string.

print(f'\nAll dates before March:')

print(sr[sr.Date <= '1959-01-03'])

# Print summary statistics of the observations.

print(f'\nSummary:')

print(sr.describe())

# Data Visualization

sr.plot()

pyplot.show()

# 1 - Printing line plots

# I corrected the original file by adding a year

series = pd.read_csv('shampoocorrected.csv', delimiter=';')

series.plot()

pyplot.show()

# 1 - bis printing dots instead of connected lines

series.plot(style='k.')

pyplot.show()

# 2 - Printing an histogram

series.hist()

pyplot.show()

# 2 bis - Printing using a density plot

series.plot(kind='kde')

pyplot.show()

# 3 - Box and Whisker plots by year or quarter.

df = pd.DataFrame(series)

print(df)

df['date'] = pd.to_datetime(df['Month'], errors='ignore')

groups = df.groupby(Grouper(key='date', freq='Y'))

# by year

df=df.set_index('date')

df['Year']=df.index.year

sns.boxplot(data=df, x='Year', y='Sales')

pyplot.show()

# by quarter

df['Quarter']=df.index.quarter

sns.boxplot(data=df, x='Quarter', y='Sales')

pyplot.show()

# 4 - Heatmaps

series = read_csv('shampoocorrected.csv', delimiter=';', header=0, index_col=0)

# Comparing the months of year =1971

pyplot.matshow(series, interpolation=None, aspect='auto')

pyplot.show()

# 5 - Scatter plot

lag_plot(series)

pyplot.show()

# 6 - Autocorrelation plots

autocorrelation_plot(series)

pyplot.show()

Jason Brownlee April 25, 2020 at 7:02 am #

Well done!

Reply

Dominique April 26, 2020 at 12:27 am #

Hi Jason,

Thank you very much for all the knowledge you put on lie.

I am a bit surprised that for this lesson #5, no-one has put his/her results. So I am not able to compare.

Here in this case I see that the predictions are diverging from real measures.

Is it due to the reduced number of observations (38)?

Does autoregression work better with large number of observations?

Thanks,

Lesson #5 Autoregressive Forecast Model

I get the following results:
predicted=9.336638, expected=226.000000
predicted=326.840564, expected=303.600000
predicted=204.167853, expected=289.900000
predicted=300.091369, expected=421.600000
predicted=107.833831, expected=264.500000
predicted=142.341172, expected=342.300000
predicted=300.400743, expected=339.700000
predicted=186.799132, expected=440.400000
predicted=355.622045, expected=315.900000
predicted=-22.204158, expected=439.300000
predicted=296.017093, expected=401.300000
predicted=151.606498, expected=437.400000
predicted=383.542211, expected=575.500000
predicted=152.813375, expected=407.600000
predicted=79.334973, expected=682.000000
predicted=268.380054, expected=475.300000
predicted=160.012569, expected=581.300000
predicted=466.048968, expected=646.900000
Test RMSE: 260.905
La moyenne des ventes est de: 312.6
L erreur moyenne absolue des ventes (MAE) est de: 213.747

The code is below:

from pandas import read_csv
from matplotlib import pyplot
from pandas.plotting import lag_plot
from pandas import DataFrame
from pandas import concat
from pandas.plotting import autocorrelation_plot
from statsmodels.graphics.tsaplots import plot_acf
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from statsmodels.tsa.ar_model import AutoReg
from math import sqrt
import statistics


series = read_csv('shampoo.csv', header=0, index_col=0)
print(series.head())
series.plot()
pyplot.show()

# Quick check for autocorrelation
# scatter: go in all directions
lag_plot(series)
pyplot.show()

# Coefficient de corrélation de Pearson (corrélation linéaire)
values = DataFrame(series.values)
dataframe = concat([values.shift(1), values], axis=1)
dataframe.columns = ['t-1', 't+1']
result = dataframe.corr()
print(result)

# Autocorrelation plot
autocorrelation_plot(series)
pyplot.show()

# line plot of autocorrelation
plot_acf(series, lags=30)
pyplot.show()

# Retrieving the coefficient from the model
# create and evaluate an updated autoregressive model
# split dataset
X = series.values
train, test = X[1:int(len(X)/2)], X[int(len(X)/2):]
# train autoregression
model_fit = AutoReg(train, lags=7).fit()
print('Coefficients: %s' % model_fit.params)
#coef = model_fit.params
# walk forward over time steps in test
#history = train[len(train)-window:]
#history = [history[i] for i in range(len(history))]
#predictions = list()
#for t in range(len(test)):
#	length = len(history)
#	lag = [history[i] for i in range(length-window,length)]
#	yhat = coef[0]
#	for d in range(window):
#		yhat += coef[d+1] * lag[window-d-1]
#	obs = test[t]
#	predictions.append(yhat)
#	history.append(obs)
#	print('predicted=%f, expected=%f' % (yhat, obs))

predictions = model_fit.predict(start=len(train), end=len(train)+len(test)-1, dynamic=False)
for i in range(len(predictions)):
	print('predicted=%f, expected=%f' % (predictions[i], test[i]))


rmse = sqrt(mean_squared_error(test, predictions))
print('Test RMSE: %.3f' % rmse)
# plot
pyplot.plot(test)
pyplot.plot(predictions, color='red')
pyplot.show()
moyenne_des_ventes = statistics.mean(series.Sales)
print(f'La moyenne des ventes est de: {moyenne_des_ventes}')
MAE_des_ventes = mean_absolute_error(test, predictions)
print('L erreur moyenne absolue des ventes (MAE) est de: %.3f' %MAE_des_ventes)

from pandas import read_csv

from matplotlib import pyplot

from pandas.plotting import lag_plot

from pandas import DataFrame

from pandas import concat

from pandas.plotting import autocorrelation_plot

from statsmodels.graphics.tsaplots import plot_acf

from sklearn.metrics import mean_squared_error

from sklearn.metrics import mean_absolute_error

from statsmodels.tsa.ar_model import AutoReg

from math import sqrt

import statistics

series = read_csv('shampoo.csv', header=0, index_col=0)

print(series.head())

series.plot()

pyplot.show()

# Quick check for autocorrelation

# scatter: go in all directions

lag_plot(series)

pyplot.show()

# Coefficient de corrélation de Pearson (corrélation linéaire)

values = DataFrame(series.values)

dataframe = concat([values.shift(1), values], axis=1)

dataframe.columns = ['t-1', 't+1']

result = dataframe.corr()

print(result)

# Autocorrelation plot

autocorrelation_plot(series)

pyplot.show()

# line plot of autocorrelation

plot_acf(series, lags=30)

pyplot.show()

# Retrieving the coefficient from the model

# create and evaluate an updated autoregressive model

# split dataset

X = series.values

train, test = X[1:int(len(X)/2)], X[int(len(X)/2):]

# train autoregression

model_fit = AutoReg(train, lags=7).fit()

print('Coefficients: %s' % model_fit.params)

#coef = model_fit.params

# walk forward over time steps in test

#history = train[len(train)-window:]

#history = [history[i] for i in range(len(history))]

#predictions = list()

#for t in range(len(test)):

# length = len(history)

# lag = [history[i] for i in range(length-window,length)]

# yhat = coef[0]

# for d in range(window):

# yhat += coef[d+1] * lag[window-d-1]

# obs = test[t]

# predictions.append(yhat)

# history.append(obs)

# print('predicted=%f, expected=%f' % (yhat, obs))

predictions = model_fit.predict(start=len(train), end=len(train)+len(test)-1, dynamic=False)

for i in range(len(predictions)):

print('predicted=%f, expected=%f' % (predictions[i], test[i]))

rmse = sqrt(mean_squared_error(test, predictions))

print('Test RMSE: %.3f' % rmse)

# plot

pyplot.plot(test)

pyplot.plot(predictions, color='red')

pyplot.show()

moyenne_des_ventes = statistics.mean(series.Sales)

print(f'La moyenne des ventes est de: {moyenne_des_ventes}')

MAE_des_ventes = mean_absolute_error(test, predictions)

print('L erreur moyenne absolue des ventes (MAE) est de: %.3f' %MAE_des_ventes)

Jason Brownlee April 26, 2020 at 6:14 am #

Great work!

Results really depend on the specifics of the data and the chosen model/model configuration. It is hard to generalize that “data like this always gets results like that”.

Reply

Dominique April 28, 2020 at 4:03 am #

Hi Jason,

Lesson #6 ARIMA forecast model

I run the ARIMA model based on the code you provided here: https://machinelearningmastery.com/arima-for-time-series-forecasting-with-python/.

I got an RMSE of 83 with an ARIMA fitted with p=4, d=1 and q=0.

When I compare with the RMSE = 260 obtained in lesson #5 (Autoregressive forecast), it is far better and now I completed the picture with this ARIMA test.

I see also graphically and also with the predicted numbers that this ARIMA model greatly improve the simple Autoregression model. Hence the importance of removing the trends in data.

Thank you very much for this perfect learning curve of the different models.

Kind regards,
Dominique

Reply
- Jason Brownlee April 28, 2020 at 6:52 am #
  
  Well done on your progress!!!
  
  Reply
Dominique May 1, 2020 at 7:20 pm #

Hi Jason,

Lesson #7 Hello world project results:

I run the project monthly_car based on your code provided here: https://machinelearningmastery.com/time-series-forecast-study-python-monthly-sales-french-champagne/

The code was perfect except for the stationary.index which I had to change to a data frame as an index of a list is not modifiable. I did this :

stationary = pd.DataFrame(difference(X, months_in_year)) stationary.index = series.index[months_in_year:]

1
2

stationary = pd.DataFrame(difference(X, months_in_year))
stationary.index = series.index[months_in_year:]

RMSE for PERSISTENCE: 3618.284

AUGMENTED DICKEY FULLER TEST
ADF Statistic: -3.900907
p-value: 0.002028
Critical Values:
1%: -3.513
5%: -2.897
10%: -2.586

RMSE for ARIMA (1,0,1): 1796.986

The final validation of the model gave the following results:

>Predicted=11779.040, Expected=13210
>Predicted=11610.133, Expected=14251
>Predicted=21729.333, Expected=20139
>Predicted=19901.654, Expected=21725
>Predicted=24749.498, Expected=26099
>Predicted=23080.763, Expected=21084
>Predicted=14541.186, Expected=18024
>Predicted=14607.661, Expected=16722
>Predicted=15245.562, Expected=14385
>Predicted=18423.721, Expected=21342
>Predicted=18042.117, Expected=17180
>Predicted=15214.474, Expected=14577
RMSE: 1993.547

The final curve of predictions is not so different in alignment with the expected values.

Thank you very much for all the explanations and the code.
Dominique

Reply
- Jason Brownlee May 2, 2020 at 5:42 am #
  
  Great work!
  
  Reply
Ivan June 2, 2020 at 9:53 pm #

Very informative.Persistance is a great baseline for time series. It is always a good idea to have a baseline performance for any given problem.

Reply
- Jason Brownlee June 3, 2020 at 7:59 am #
  
  It sure is!
  
  Reply
Huleji Abraham Tukura June 7, 2020 at 12:44 pm #

Hello,

I tried fitting an AR Model and got the following error:

1 import statsmodels.api as sm
2 from statsmodels.tsa.ar_model import AutoReg

ImportError Traceback (most recent call last)
in
1 import statsmodels.api as sm
—-> 2 from statsmodels.tsa.ar_model import AutoReg

ImportError: cannot import name ‘AutoReg’ from ‘statsmodels.tsa.ar_model’ (C:\Users\Huleji\Anaconda3\lib\site-packages\statsmodels\tsa\ar_model.py)

Reply
- Jason Brownlee June 7, 2020 at 1:14 pm #
  
  Thanks I’ll investigate.
  
  Reply
Rasheed July 23, 2020 at 5:24 pm #

Thank you Mr Jason, really appreciate your work.
we are working on a time series prediction using ANN to predict wind speed. the performance of the model we built using ANN was not that different from the Persistence method. going through the literature we found many using a Wavelet Decomposition before training the models to improve performance, but we couldn’t find the details on how to apply it and use it with ANN, can you help us with that pls ? Thank you.

Reply
- Jason Brownlee July 24, 2020 at 6:24 am #
  
  Thanks for the suggestion, I may write about the topic in the future.
  
  Reply
Gopal February 4, 2021 at 1:51 am #

Hello Jason, GM.!, We are running Arima (pmdarima: auto_arima), with seasonality set to true with weekly data on last 5 years..and looks like arima is taking lot of time .killed the session after like 15/20mins. Any alternatives with out loosing seasonality..?

Reply
- Jason Brownlee February 4, 2021 at 6:25 am #
  
  Perhaps try running on a faster machine?
  Perhaps try running with less data?
  Perhaps try a simpler configuration set?
  Perhaps try ETS?
  
  Reply
Eduardo Oreamuno Aparicio April 24, 2021 at 12:13 pm #

Greetings

could you have a time series dataset as follows?

Time, Observation1, Observation2, Observation3
day1, obs11, obs21, obs31
day2, obs12, obs22, obs32
day3, obs13, obs23, obs33

Could you work with the 3 columns of observations, to predict each of the columns individually?

Reply
- Jason Brownlee April 25, 2021 at 5:15 am #
  
  Good question, perhaps start here:
  https://machinelearningmastery.com/start-here/#deep_learning_time_series
  
  Reply
Abderrahmane May 22, 2021 at 4:26 pm #

Hello Jason.
Thanks for all the work done so far to make ML understandable and easy to apply for newbies.

I would to like to know what does “parse_dates” do when loading a time series dataset using Pandas’ read_csv method?

Reply
- Jason Brownlee May 23, 2021 at 5:23 am #
  
  I believe it does its best to interpret the dates.
  
  Reply
Liliana May 24, 2021 at 9:41 am #

Hi Jason:

I want to know, can I do a persistence forecast model, for a multivariate time series of the type multiple parallel input and multi-step output, or for a series of the same type but with one-step output?

Thanks for your attention.

Reply
- Jason Brownlee May 25, 2021 at 6:04 am #
  
  Sure.
  
  Reply
  - Liliana May 25, 2021 at 9:15 am #
    
    Ok thanks.
    
    Reply
Chandrakant November 21, 2021 at 12:24 am #

1.
“Date”,”Births”
0 “1959-01-01”,35
1 “1959-01-02”,32
2 “1959-01-03”,30
3 “1959-01-04”,31
4 “1959-01-05”,44

2.

RangeIndex: 365 entries, 0 to 364
Data columns (total 1 columns):
# Column Non-Null Count Dtype
— —— ————– —–
0 “Date”,”Births” 365 non-null object
dtypes: object(1)
memory usage: 3.0+ KB
None
3.
Don’t know how to —Query the dataset using a date-time string

4.
“Date”,”Births”
count 365
unique 365
top “1959-07-06”,42
freq 1

Reply
monireh December 29, 2022 at 7:09 am #

hi, I can not use python- why do I learn it?

Reply
- James Carmichael December 29, 2022 at 8:39 am #
  
  Hi monireh…You may benefit from the following course:
  
  https://machinelearningmastery.com/python-for-machine-learning-7-day-mini-course/
  
  Reply
Myles March 11, 2025 at 5:45 am #

You publish a lot of useful articles that help us learn python and data analysis.

Unfortunately, the code in most of those articles raise error messages and deprecation notices.

If we know how to fix these errors we wouldn’t need to read your articles.

For example, in the time series course lesson two the read_csv raises this error message:
“””TypeError: read_csv() got an unexpected keyword argument ‘squeeze'”””

This is typical of the half-dozen articles I have tried to re-create in Jupyter, every block of code has errors in it.

I think you are doing yourself a disservice by not checking the code before publishing.

Reply
- James Carmichael March 11, 2025 at 5:51 am #
  
  Hi Myles…I completely understand your frustration. Code that doesn’t work as expected can be discouraging, especially when you’re trying to learn.
  
  The issue with read_csv() and the squeeze parameter is due to changes in **Pandas**. In Pandas **1.3.0 and later**, squeeze has been deprecated, which is why you’re seeing the **TypeError**.
  
  ### ✅ **Fix for the squeeze Error in read_csv()**
  Instead of:
  python df = pd.read_csv('data.csv', squeeze=True)
  Use:
  python df = pd.read_csv('data.csv') if df.shape[1] == 1: # Check if there's only one column df = df.iloc[:, 0] # Convert to a Series
  This ensures compatibility with the latest versions of Pandas.
  
  —
  
  You’re right that checking code before publishing is important. Would you be willing to share other errors you’ve encountered? I can help troubleshoot them and provide updated fixes.
  
  Also, if you’re running into deprecation warnings, you can check which version of Pandas, NumPy, or other libraries you have by running:
  python import pandas as pd print(pd.__version__)
  This can help confirm if an update is required.
  
  Reply

Navigation

Time Series Forecasting with Python 7-Day Mini-Course

From Developer to Time Series Forecaster in 7 Days.

Who Is This Mini-Course For?

Mini-Course Overview

Post your results in the comments, I’ll cheer you on!

Stop learning Time Series Forecasting the slow way!

Lesson 01: Time Series as Supervised Learning

Lesson 02: Load Time Series Data

Lesson 03: Data Visualization

Lesson 04: Persistence Forecast Model

Lesson 05: Autoregressive Forecast Model

Lesson 06: ARIMA Forecast Model

Lesson 07: Hello World End-to-End Project

The End!
(Look How Far You Have Come)

Summary

Want to Develop Time Series Forecasts with Python?

Develop Your Own Forecasts in Minutes

Finally Bring Time Series Forecasting to
Your Own Projects

More On This Topic

51 Responses to Time Series Forecasting with Python 7-Day Mini-Course

Leave a Reply Click here to cancel reply.

Navigation

From Developer to Time Series Forecaster in 7 Days.

Who Is This Mini-Course For?

Mini-Course Overview

Post your results in the comments, I’ll cheer you on!

Stop learning Time Series Forecasting the slow way!

Lesson 01: Time Series as Supervised Learning

Lesson 02: Load Time Series Data

Lesson 03: Data Visualization

Lesson 04: Persistence Forecast Model

Lesson 05: Autoregressive Forecast Model

Lesson 06: ARIMA Forecast Model

Lesson 07: Hello World End-to-End Project

The End! (Look How Far You Have Come)

Summary

Want to Develop Time Series Forecasts with Python?

Develop Your Own Forecasts in Minutes

Finally Bring Time Series Forecasting to Your Own Projects

More On This Topic

51 Responses to Time Series Forecasting with Python 7-Day Mini-Course

Leave a Reply Click here to cancel reply.

The End!
(Look How Far You Have Come)

Finally Bring Time Series Forecasting to
Your Own Projects