From Developer to Time Series Forecaster in 7 Days.
Python is one of the fastest-growing platforms for applied machine learning.
In this mini-course, you will discover how you can get started, build accurate models and confidently complete predictive modeling time series forecasting projects using Python in 7 days.
This is a big and important post. You might want to bookmark it.
Kick-start your project with my new book Time Series Forecasting With Python, including step-by-step tutorials and the Python source code files for all examples.
Let’s get started.
- Updated Apr/2019: Updated the links to datasets.
- Updated Aug/2019: Updated data loading to use new API.
- Updated Apr/2020: Changed AR to AutoReg due to API change.
- Updated Dec/2020: Updated ARIMA API to the latest version of statsmodels.
Who Is This Mini-Course For?
Before we get started, let’s make sure you are in the right place.
The list below provides some general guidelines as to who this course was designed for.
Don’t panic if you don’t match these points exactly, you might just need to brush up in one area or another to keep up.
- You’re a Developer: This is a course for developers. You are a developer of some sort. You know how to read and write code. You know how to develop and debug a program.
- You know Python: This is a course for Python people. You know the Python programming language, or you’re a skilled enough developer that you can pick it up as you go along.
- You know some Machine Learning: This is a course for novice machine learning practitioners. You know some basic practical machine learning, or you can figure it out quickly.
This mini-course is neither a textbook on Python or a textbook on time series forecasting.
It will take you from a developer that knows a little machine learning to a developer who can get time series forecasting results using the Python ecosystem, the rising platform for professional machine learning.
Note: This mini-course assumes you have a working Python 2 or 3 SciPy environment with at least NumPy, Pandas, scikit-learn and statsmodels installed.
Mini-Course Overview
This mini-course is broken down into 7 lessons.
You could complete one lesson per day (recommended) or complete all of the lessons in one day (hardcore). It really depends on the time you have available and your level of enthusiasm.
Below are 7 lessons that will get you started and productive with machine learning in Python:
- Lesson 01: Time Series as Supervised Learning.
- Lesson 02: Load Time Series Data.
- Lesson 03: Data Visualization.
- Lesson 04: Persistence Forecast Model.
- Lesson 05: Autoregressive Forecast Model.
- Lesson 06: ARIMA Forecast Model.
- Lesson 07: Hello World End-to-End Project.
Each lesson could take you 60 seconds or up to 30 minutes. Take your time and complete the lessons at your own pace. Ask questions and even post results in the comments below.
The lessons expect you to go off and find out how to do things. I will give you hints, but part of the point of each lesson is to force you to learn where to go to look for help on and about the Python platform for time series (hint, I have all of the answers directly on this blog, use the search feature).
I do provide more help in the early lessons because I want you to build up some confidence and inertia.
Post your results in the comments, I’ll cheer you on!
Hang in there, don’t give up.
Stop learning Time Series Forecasting the slow way!
Take my free 7-day email course and discover how to get started (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
Lesson 01: Time Series as Supervised Learning
Time series problems are different to traditional prediction problems.
The addition of time adds an order to observations that both must be preserved and can provide additional information for learning algorithms.
A time series dataset may look like the following:
1 2 3 4 |
Time, Observation day1, obs1 day2, obs2 day3, obs3 |
We can reframe this data as a supervised learning problem with inputs and outputs to be predicted. For example:
1 2 3 4 5 |
Input, Output ?, obs1 obs1, obs2 obs2, obs3 obs3, ? |
You can see that the reframing means we have to discard some rows with missing data.
Once it is reframed, we can then apply all of our favorite learning algorithms like k-Nearest Neighbors and Random Forest.
For more help, see the post:
Lesson 02: Load Time Series Data
Before you can develop forecast models, you must load and work with your time series data.
Pandas provides tools to load data in CSV format.
In this lesson, you will download a standard time series dataset, load it in Pandas and explore it.
Download the daily female births dataset in CSV format and save it with the filename “daily-births.csv“.
You can load a time series dataset as a Pandas Series and specify the header row at line zero, as follows:
1 2 |
from pandas import read_csv series = read_csv('daily-births.csv', header=0, index_col=0) |
Get used to exploring loaded time series data in Python:
- Print the first few rows using the head() function.
- Print the dimensions of the dataset using the size attribute.
- Query the dataset using a date-time string.
- Print summary statistics of the observations.
For more help, see the post:
Lesson 03: Data Visualization
Data visualization is a big part of time series forecasting.
Line plots of observations over time are popular, but there is a suite of other plots that you can use to learn more about your problem.
In this lesson, you must download a standard time series dataset and create 6 different types of plots.
Download the monthly shampoo sales dataset in CSV format and save it with the filename “shampoo-sales.csv“.
Now create the following 6 types of plots:
- Line Plots.
- Histograms and Density Plots.
- Box and Whisker Plots by year or quarter.
- Heat Maps.
- Lag Plots or Scatter Plots.
- Autocorrelation Plots.
Below is an example of a simple line plot to get you started:
1 2 3 4 5 |
from pandas import read_csv from matplotlib import pyplot series = read_csv('shampoo-sales.csv', header=0, index_col=0) series.plot() pyplot.show() |
For more help, see the post:
Lesson 04: Persistence Forecast Model
It is important to establish a baseline forecast.
The simplest forecast you can make is to use the current observation (t) to predict the observation at the next time step (t+1).
This is called the naive forecast or the persistence forecast and may be the best possible model on some time series forecast problems.
In this lesson, you will make a persistence forecast for a standard time series forecast problem.
Download the daily female births dataset in CSV format and save it with the filename “daily-births.csv“.
You can implement the persistence forecast as a single line function, as follows:
1 2 3 |
# persistence model def model_persistence(x): return x |
Write code to load the dataset and use the persistence forecast to make a prediction for each time step in the dataset. Note, that you will not be able to make a forecast for the first time step in the dataset as there is no previous observation to use.
Store all of the predictions in a list. You can calculate a Root Mean Squared Error (RMSE) for the predictions compared to the actual observations as follows:
1 2 3 4 5 |
from sklearn.metrics import mean_squared_error from math import sqrt predictions = [] actual = series.values[1:] rmse = sqrt(mean_squared_error(actual, predictions)) |
For more help, see the post:
Lesson 05: Autoregressive Forecast Model
Autoregression means developing a linear model that uses observations at previous time steps to predict observations at future time step (“auto” means self in ancient Greek).
Autoregression is a quick and powerful time series forecasting method.
The statsmodels Python library provides the autoregression model in the AutoReg class.
In this lesson, you will develop an autoregressive forecast model for a standard time series dataset.
Download the monthly shampoo sales dataset in CSV format and save it with the filename “shampoo-sales.csv“.
You can fit an AR model as follows:
1 2 |
model = AutoReg(dataset, lags=2) model_fit = model.fit() |
You can predict the next out of sample observation with a fit AR model as follows:
1 |
prediction = model_fit.predict(start=len(dataset), end=len(dataset)) |
You may want to experiment by fitting the model on half of the dataset and predicting one or more of the second half of the series, then compare the predictions to the actual observations.
For more help, see the post:
Lesson 06: ARIMA Forecast Model
The ARIMA is a classical linear model for time series forecasting.
It combines the autoregressive model (AR), differencing to remove trends and seasonality, called integrated (I) and the moving average model (MA) which is an old name given to a model that forecasts the error, used to correct predictions.
The statsmodels Python library provides the ARIMA class.
In this lesson, you will develop an ARIMA model for a standard time series dataset.
Download the monthly shampoo sales dataset in CSV format and save it with the filename “shampoo-sales.csv“.
The ARIMA class requires an order(p,d,q) that is comprised of three arguments p, d and q for the AR lags, number of differences and MA lags.
You can fit an ARIMA model as follows:
1 2 |
model = ARIMA(dataset, order=(0,1,0)) model_fit = model.fit() |
You can make a one-step out-of-sample forecast for a fit ARIMA model as follows:
1 |
outcome = model_fit.forecast()[0] |
The shampoo dataset has a trend so I’d recommend a d value of 1. Experiment with different p and q values and evaluate the predictions from resulting models.
For more help, see the post:
Lesson 07: Hello World End-to-End Project
You now have the tools to work through a time series problem and develop a simple forecast model.
In this lesson, you will use the skills learned from all of the prior lessons to work through a new time series forecasting problem.
Download the monthy car sales dataset in CSV format and save it with the filename “monthly-car-sales.csv“.
Split the data, perhaps extract the last 1 or 2 years to a separate file. Work through the problem and develop forecasts for the missing data, including:
- Load and explore the dataset.
- Visualize the dataset.
- Develop a persistence model.
- Develop an autoregressive model.
- Develop an ARIMA model.
- Visualize forecasts and summarize forecast error.
For an example of working through a project, see the post:
The End!
(Look How Far You Have Come)
You made it. Well done!
Take a moment and look back at how far you have come.
You discovered:
- How to frame a time series forecasting problem as supervised learning.
- How to load and explore time series data with Pandas.
- How to plot and visualize time series data a number of different ways.
- How to develop a naive forecast called the persistence model as a baseline.
- How to develop an autoregressive forecast model using lagged observations.
- How to develop an ARIMA model including autoregression, integration and moving average elements.
- How to pull all of these elements together into an end-to-end project.
Don’t make light of this, you have come a long way in a short amount of time.
This is just the beginning of your time series forecasting journey with Python. Keep practicing and developing your skills.
Summary
How Did You Go With The Mini-Course?
Did you enjoy this mini-course?
Do you have any questions? Were there any sticking points?
Let me know. Leave a comment below.
Hi
Thanks for so many articles in your blog. Really appreciate.
I have a question that I see sometimes we use a fixed-parameter model (e.g. parameters in ARIMA model is always fixed), while other times use an iterative way to determine the model parameters in each iteration of a test data sample. Are there any differences or reasons behind that? and when fixed model is useful and when to use an iterative way?
my understanding from the examples are: iterative way of modeling ARIMA seems more appropriate to seasonal and trending dataset, right?
Thanks a lot
In general, I would suggest evaluating a suite of different models for a problem and see what works best.
Thanks Jason for these helpful articles. I have a general question. When we have a very high number of sensors, are there simpler methods to model them over building a model for each sensor using its own time series data?
If we have some intuition that we may find groups of sensors which may exhibit similar behaviour, is there a method to cluster them and validate, given the individual time series data?
Good question Gururaj, sorry I have not worked on consolidating sensor data, I can’t give you expert advice.
These articles are very helpful… I like to know whether it is possible to forecast the time(dependent) variable with date as independent variable. For example with the past data of my arrival at the office can I predict at what time I will be able to mark the attendance tomorrow ?
Date is not needed, we work from the variable directly with univariate forecasting.
Thanks for the course.
I intend on doing the course
I would like to know:
do you have anomaly detection course?
are hidden markov models and recurrent nn fit this area(time series)?
thanks
joseph
Not at this stage, perhaps in the future.
Yes, LSTM (a type of rnn) has been using for a while for time series problems
Hi Jason, I am really enjoying the course. You make it easy to learn ML faster than via other curricula.
I wanted to ask where/if you have the answers to these lessons (using the same datsets).
Specifically I am having trouble with “Lesson 03” (box plots and on) in grouping by year and quarter the data from “shampoo sales”.
The function “TimeGrouper” has been deprecated. I am using “Grouper” but have been unable to replicate results.
I have blog posts on each, you can use the search or start here:
https://machinelearningmastery.com/start-here/#timeseries
Hello. Thank you for creating this time series mini-course, I am learning a lot of things.
One thing that I’m wondering is that how hard it is if someone tries to code ARIMA computation without using APIs such as statsmodel. Do people usually use APIs in time series forecasting?
Yes, people usually use APIs. Coding from scratch is only a good idea if you want to learn how it works in more detail or you have special operational requirements.
This lessons are wonderful, thanks. I have a dude about a problem.
I have a 4 time series dataset.
1 dataset with data for hour. For last month
1 dataset with data for last 12 months. Data for day.
1 dataset with data per month. 3 last years.
How can i use this dataset?
Thanks.
What problem are you having exactly?
In this example, our dataset is dayN, obs.
In a real case, I have a few datasets like:
diary_dataset_last30days
day1, obs
day2, obs
monthly_dataset_actual_year
month1, obs
month2, obs
month3, obs
years_dataset_last5_years
year1, obs
year2, obs
year3, obs
I have data from days, hour too, week, years. What can i do to use this data for forecasting?
What do you think?
Perhaps prototype some models and see?
Hi Jason,
Thanks for so many articles in your blog. Deeply appreciate. Really.
In this Lesson 03: Data Visualization, I am trying to practice plotting BoxPlot.
——–
But got this error : TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of ‘Index’
——–
While running the below code :
from pandas import DataFrame
from pandas import Grouper
series = pd.read_csv(‘shampoo-sales.csv’, header=0, index_col=0, parse_dates=True, squeeze=True)
groups = series.groupby(Grouper(freq=’M’))
years = DataFrame()
for name, group in groups:
years[name.year] = group.values
years.boxplot()
plt.show()
——–
Hope you can help. Thanks a lot.
Perhaps confirm that the dates/index was parsed correctly.
Thanks for your sample
You’re welcome!
Hi Jason,
For lesson #04: Persistence Forecast Model.
I adapted your code below.
I got a MSE of 83 for which I have difficulty to interpret: do you have an hint?
Thanks,
Dominique
Well done!
Hi Jason,
Lesson #3 Data Visualization.
For this one I had to update the shampoo dataset as no year was present: I was obliged to reformat the shampoo dataset by adding a year in the first column so that I can plot by year and quarter.
Your post proved of great help.
The code below:
Well done!
Hi Jason,
Thank you very much for all the knowledge you put on lie.
I am a bit surprised that for this lesson #5, no-one has put his/her results. So I am not able to compare.
Here in this case I see that the predictions are diverging from real measures.
Is it due to the reduced number of observations (38)?
Does autoregression work better with large number of observations?
Thanks,
Lesson #5 Autoregressive Forecast Model
I get the following results:
predicted=9.336638, expected=226.000000
predicted=326.840564, expected=303.600000
predicted=204.167853, expected=289.900000
predicted=300.091369, expected=421.600000
predicted=107.833831, expected=264.500000
predicted=142.341172, expected=342.300000
predicted=300.400743, expected=339.700000
predicted=186.799132, expected=440.400000
predicted=355.622045, expected=315.900000
predicted=-22.204158, expected=439.300000
predicted=296.017093, expected=401.300000
predicted=151.606498, expected=437.400000
predicted=383.542211, expected=575.500000
predicted=152.813375, expected=407.600000
predicted=79.334973, expected=682.000000
predicted=268.380054, expected=475.300000
predicted=160.012569, expected=581.300000
predicted=466.048968, expected=646.900000
Test RMSE: 260.905
La moyenne des ventes est de: 312.6
L erreur moyenne absolue des ventes (MAE) est de: 213.747
The code is below:
Great work!
Results really depend on the specifics of the data and the chosen model/model configuration. It is hard to generalize that “data like this always gets results like that”.
Hi Jason,
Lesson #6 ARIMA forecast model
I run the ARIMA model based on the code you provided here: https://machinelearningmastery.com/arima-for-time-series-forecasting-with-python/.
I got an RMSE of 83 with an ARIMA fitted with p=4, d=1 and q=0.
When I compare with the RMSE = 260 obtained in lesson #5 (Autoregressive forecast), it is far better and now I completed the picture with this ARIMA test.
I see also graphically and also with the predicted numbers that this ARIMA model greatly improve the simple Autoregression model. Hence the importance of removing the trends in data.
Thank you very much for this perfect learning curve of the different models.
Kind regards,
Dominique
Well done on your progress!!!
Hi Jason,
Lesson #7 Hello world project results:
I run the project monthly_car based on your code provided here: https://machinelearningmastery.com/time-series-forecast-study-python-monthly-sales-french-champagne/
The code was perfect except for the stationary.index which I had to change to a data frame as an index of a list is not modifiable. I did this :
RMSE for PERSISTENCE: 3618.284
AUGMENTED DICKEY FULLER TEST
ADF Statistic: -3.900907
p-value: 0.002028
Critical Values:
1%: -3.513
5%: -2.897
10%: -2.586
RMSE for ARIMA (1,0,1): 1796.986
The final validation of the model gave the following results:
>Predicted=11779.040, Expected=13210
>Predicted=11610.133, Expected=14251
>Predicted=21729.333, Expected=20139
>Predicted=19901.654, Expected=21725
>Predicted=24749.498, Expected=26099
>Predicted=23080.763, Expected=21084
>Predicted=14541.186, Expected=18024
>Predicted=14607.661, Expected=16722
>Predicted=15245.562, Expected=14385
>Predicted=18423.721, Expected=21342
>Predicted=18042.117, Expected=17180
>Predicted=15214.474, Expected=14577
RMSE: 1993.547
The final curve of predictions is not so different in alignment with the expected values.
Thank you very much for all the explanations and the code.
Dominique
Great work!
Very informative.Persistance is a great baseline for time series. It is always a good idea to have a baseline performance for any given problem.
It sure is!
Hello,
I tried fitting an AR Model and got the following error:
1 import statsmodels.api as sm
2 from statsmodels.tsa.ar_model import AutoReg
ImportError Traceback (most recent call last)
in
1 import statsmodels.api as sm
—-> 2 from statsmodels.tsa.ar_model import AutoReg
ImportError: cannot import name ‘AutoReg’ from ‘statsmodels.tsa.ar_model’ (C:\Users\Huleji\Anaconda3\lib\site-packages\statsmodels\tsa\ar_model.py)
Thanks I’ll investigate.
Thank you Mr Jason, really appreciate your work.
we are working on a time series prediction using ANN to predict wind speed. the performance of the model we built using ANN was not that different from the Persistence method. going through the literature we found many using a Wavelet Decomposition before training the models to improve performance, but we couldn’t find the details on how to apply it and use it with ANN, can you help us with that pls ? Thank you.
Thanks for the suggestion, I may write about the topic in the future.
Hello Jason, GM.!, We are running Arima (pmdarima: auto_arima), with seasonality set to true with weekly data on last 5 years..and looks like arima is taking lot of time .killed the session after like 15/20mins. Any alternatives with out loosing seasonality..?
Perhaps try running on a faster machine?
Perhaps try running with less data?
Perhaps try a simpler configuration set?
Perhaps try ETS?
Greetings
could you have a time series dataset as follows?
Time, Observation1, Observation2, Observation3
day1, obs11, obs21, obs31
day2, obs12, obs22, obs32
day3, obs13, obs23, obs33
Could you work with the 3 columns of observations, to predict each of the columns individually?
Good question, perhaps start here:
https://machinelearningmastery.com/start-here/#deep_learning_time_series
Hello Jason.
Thanks for all the work done so far to make ML understandable and easy to apply for newbies.
I would to like to know what does “parse_dates” do when loading a time series dataset using Pandas’ read_csv method?
I believe it does its best to interpret the dates.
Hi Jason:
I want to know, can I do a persistence forecast model, for a multivariate time series of the type multiple parallel input and multi-step output, or for a series of the same type but with one-step output?
Thanks for your attention.
Sure.
Ok thanks.
1.
“Date”,”Births”
0 “1959-01-01”,35
1 “1959-01-02”,32
2 “1959-01-03”,30
3 “1959-01-04”,31
4 “1959-01-05”,44
2.
RangeIndex: 365 entries, 0 to 364
Data columns (total 1 columns):
# Column Non-Null Count Dtype
— —— ————– —–
0 “Date”,”Births” 365 non-null object
dtypes: object(1)
memory usage: 3.0+ KB
None
3.
Don’t know how to —Query the dataset using a date-time string
4.
“Date”,”Births”
count 365
unique 365
top “1959-07-06”,42
freq 1
hi, I can not use python- why do I learn it?
Hi monireh…You may benefit from the following course:
https://machinelearningmastery.com/python-for-machine-learning-7-day-mini-course/