The post Random Forest for Time Series Forecasting appeared first on Machine Learning Mastery.

]]>It is widely used for classification and regression predictive modeling problems with structured (tabular) data sets, e.g. data as it looks in a spreadsheet or database table.

Random Forest can also be used for **time series forecasting**, although it requires that the time series dataset be transformed into a supervised learning problem first. It also requires the use of a specialized technique for evaluating the model called walk-forward validation, as evaluating the model using k-fold cross validation would result in optimistically biased results.

In this tutorial, you will discover how to develop a Random Forest model for time series forecasting.

After completing this tutorial, you will know:

- Random Forest is an ensemble of decision trees algorithms that can be used for classification and regression predictive modeling.
- Time series datasets can be transformed into supervised learning using a sliding-window representation.
- How to fit, evaluate, and make predictions with an Random Forest regression model for time series forecasting.

Let’s get started.

This tutorial is divided into three parts; they are:

- Random Forest Ensemble
- Time Series Data Preparation
- Random Forest for Time Series

Random forest is an ensemble of decision tree algorithms.

It is an extension of bootstrap aggregation (bagging) of decision trees and can be used for classification and regression problems.

In bagging, a number of decision trees are made where each tree is created from a different bootstrap sample of the training dataset. A bootstrap sample is a sample of the training dataset where an example may appear more than once in the sample. This is referred to as “*sampling with replacement*”.

Bagging is an effective ensemble algorithm as each decision tree is fit on a slightly different training dataset, and in turn, has a slightly different performance. Unlike normal decision tree models, such as classification and regression trees (CART), trees used in the ensemble are unpruned, making them slightly overfit to the training dataset. This is desirable as it helps to make each tree more different and have less correlated predictions or prediction errors.

Predictions from the trees are averaged across all decision trees, resulting in better performance than any single tree in the model.

A prediction on a regression problem is the average of the prediction across the trees in the ensemble. A prediction on a classification problem is the majority vote for the class label across the trees in the ensemble.

**Regression**: Prediction is the average prediction across the decision trees.**Classification**: Prediction is the majority vote class label predicted across the decision trees.

Random forest involves constructing a large number of decision trees from bootstrap samples from the training dataset, like bagging.

Unlike bagging, random forest also involves selecting a subset of input features (columns or variables) at each split point in the construction of the trees. Typically, constructing a decision tree involves evaluating the value for each input variable in the data in order to select a split point. By reducing the features to a random subset that may be considered at each split point, it forces each decision tree in the ensemble to be more different.

The effect is that the predictions, and in turn, prediction errors, made by each tree in the ensemble are more different or less correlated. When the predictions from these less correlated trees are averaged to make a prediction, it often results in better performance than bagged decision trees.

For more on the Random Forest algorithm, see the tutorial:

Time series data can be phrased as supervised learning.

Given a sequence of numbers for a time series dataset, we can restructure the data to look like a supervised learning problem. We can do this by using previous time steps as input variables and use the next time step as the output variable.

Let’s make this concrete with an example. Imagine we have a time series as follows:

time, measure 1, 100 2, 110 3, 108 4, 115 5, 120

We can restructure this time series dataset as a supervised learning problem by using the value at the previous time step to predict the value at the next time-step.

Reorganizing the time series dataset this way, the data would look as follows:

X, y ?, 100 100, 110 110, 108 108, 115 115, 120 120, ?

Note that the time column is dropped and some rows of data are unusable for training a model, such as the first and the last.

This representation is called a sliding window, as the window of inputs and expected outputs is shifted forward through time to create new “*samples*” for a supervised learning model.

For more on the sliding window approach to preparing time series forecasting data, see the tutorial:

We can use the shift() function in Pandas to automatically create new framings of time series problems given the desired length of input and output sequences.

This would be a useful tool as it would allow us to explore different framings of a time series problem with machine learning algorithms to see which might result in better-performing models.

The function below will take a time series as a NumPy array time series with one or more columns and transform it into a supervised learning problem with the specified number of inputs and outputs.

# transform a time series dataset into a supervised learning dataset def series_to_supervised(data, n_in=1, n_out=1, dropnan=True): n_vars = 1 if type(data) is list else data.shape[1] df = DataFrame(data) cols = list() # input sequence (t-n, ... t-1) for i in range(n_in, 0, -1): cols.append(df.shift(i)) # forecast sequence (t, t+1, ... t+n) for i in range(0, n_out): cols.append(df.shift(-i)) # put it all together agg = concat(cols, axis=1) # drop rows with NaN values if dropnan: agg.dropna(inplace=True) return agg.values

We can use this function to prepare a time series dataset for Random Forest.

For more on the step-by-step development of this function, see the tutorial:

Once the dataset is prepared, we must be careful in how it is used to fit and evaluate a model.

For example, it would not be valid to fit the model on data from the future and have it predict the past. The model must be trained on the past and predict the future.

This means that methods that randomize the dataset during evaluation, like k-fold cross-validation, cannot be used. Instead, we must use a technique called walk-forward validation.

In walk-forward validation, the dataset is first split into train and test sets by selecting a cut point, e.g. all data except the last 12 months is used for training and the last 12 months is used for testing.

If we are interested in making a one-step forecast, e.g. one month, then we can evaluate the model by training on the training dataset and predicting the first step in the test dataset. We can then add the real observation from the test set to the training dataset, refit the model, then have the model predict the second step in the test dataset.

Repeating this process for the entire test dataset will give a one-step prediction for the entire test dataset from which an error measure can be calculated to evaluate the skill of the model.

For more on walk-forward validation, see the tutorial:

The function below performs walk-forward validation.

It takes the entire supervised learning version of the time series dataset and the number of rows to use as the test set as arguments.

It then steps through the test set, calling the *random_forest_forecast()* function to make a one-step forecast. An error measure is calculated and the details are returned for analysis.

# walk-forward validation for univariate data def walk_forward_validation(data, n_test): predictions = list() # split dataset train, test = train_test_split(data, n_test) # seed history with training dataset history = [x for x in train] # step over each time-step in the test set for i in range(len(test)): # split test row into input and output columns testX, testy = test[i, :-1], test[i, -1] # fit model on history and make a prediction yhat = random_forest_forecast(history, testX) # store forecast in list of predictions predictions.append(yhat) # add actual observation to history for the next loop history.append(test[i]) # summarize progress print('>expected=%.1f, predicted=%.1f' % (testy, yhat)) # estimate prediction error error = mean_absolute_error(test[:, -1], predictions) return error, test[:, 1], predictions

The *train_test_split()* function is called to split the dataset into train and test sets.

We can define this function below.

# split a univariate dataset into train/test sets def train_test_split(data, n_test): return data[:-n_test, :], data[-n_test:, :]

We can use the RandomForestRegressor class to make a one-step forecast.

The *random_forest_forecast()* function below implements this, taking the training dataset and test input row as input, fitting a model and making a one-step prediction.

# fit an random forest model and make a one step prediction def random_forest_forecast(train, testX): # transform list into array train = asarray(train) # split into input and output columns trainX, trainy = train[:, :-1], train[:, -1] # fit model model = RandomForestRegressor(n_estimators=1000) model.fit(trainX, trainy) # make a one-step prediction yhat = model.predict([testX]) return yhat[0]

Now that we know how to prepare time series data for forecasting and evaluate a Random Forest model, next we can look at using Random Forest on a real dataset.

In this section, we will explore how to use the Random Forest regressor for time series forecasting.

We will use a standard univariate time series dataset with the intent of using the model to make a one-step forecast.

You can use the code in this section as the starting point in your own project and easily adapt it for multivariate inputs, multivariate forecasts, and multi-step forecasts.

We will use the daily female births dataset, that is the monthly births across three years.

You can download the dataset from here, place it in your current working directory with the filename “*daily-total-female-births.csv*“.

The first few lines of the dataset look as follows:

"Date","Births" "1959-01-01",35 "1959-01-02",32 "1959-01-03",30 "1959-01-04",31 "1959-01-05",44 ...

First, let’s load and plot the dataset.

The complete example is listed below.

# load and plot the time series dataset from pandas import read_csv from matplotlib import pyplot # load dataset series = read_csv('daily-total-female-births.csv', header=0, index_col=0) values = series.values # plot dataset pyplot.plot(values) pyplot.show()

Running the example creates a line plot of the dataset.

We can see there is no obvious trend or seasonality.

A persistence model can achieve a MAE of about 6.7 births when predicting the last 12 months. This provides a baseline in performance above which a model may be considered skillful.

Next, we can evaluate the Random Forest model on the dataset when making one-step forecasts for the last 12 months of data.

We will use only the previous six time steps as input to the model and default model hyperparameters, except we will use 1,000 trees in the ensemble (to avoid underlearning).

The complete example is listed below.

# forecast monthly births with random forest from numpy import asarray from pandas import read_csv from pandas import DataFrame from pandas import concat from sklearn.metrics import mean_absolute_error from sklearn.ensemble import RandomForestRegressor from matplotlib import pyplot # transform a time series dataset into a supervised learning dataset def series_to_supervised(data, n_in=1, n_out=1, dropnan=True): n_vars = 1 if type(data) is list else data.shape[1] df = DataFrame(data) cols = list() # input sequence (t-n, ... t-1) for i in range(n_in, 0, -1): cols.append(df.shift(i)) # forecast sequence (t, t+1, ... t+n) for i in range(0, n_out): cols.append(df.shift(-i)) # put it all together agg = concat(cols, axis=1) # drop rows with NaN values if dropnan: agg.dropna(inplace=True) return agg.values # split a univariate dataset into train/test sets def train_test_split(data, n_test): return data[:-n_test, :], data[-n_test:, :] # fit an random forest model and make a one step prediction def random_forest_forecast(train, testX): # transform list into array train = asarray(train) # split into input and output columns trainX, trainy = train[:, :-1], train[:, -1] # fit model model = RandomForestRegressor(n_estimators=1000) model.fit(trainX, trainy) # make a one-step prediction yhat = model.predict([testX]) return yhat[0] # walk-forward validation for univariate data def walk_forward_validation(data, n_test): predictions = list() # split dataset train, test = train_test_split(data, n_test) # seed history with training dataset history = [x for x in train] # step over each time-step in the test set for i in range(len(test)): # split test row into input and output columns testX, testy = test[i, :-1], test[i, -1] # fit model on history and make a prediction yhat = random_forest_forecast(history, testX) # store forecast in list of predictions predictions.append(yhat) # add actual observation to history for the next loop history.append(test[i]) # summarize progress print('>expected=%.1f, predicted=%.1f' % (testy, yhat)) # estimate prediction error error = mean_absolute_error(test[:, -1], predictions) return error, test[:, -1], predictions # load the dataset series = read_csv('daily-total-female-births.csv', header=0, index_col=0) values = series.values # transform the time series data into supervised learning data = series_to_supervised(values, n_in=6) # evaluate mae, y, yhat = walk_forward_validation(data, 12) print('MAE: %.3f' % mae) # plot expected vs predicted pyplot.plot(y, label='Expected') pyplot.plot(yhat, label='Predicted') pyplot.legend() pyplot.show()

Running the example reports the expected and predicted values for each step in the test set, then the MAE for all predicted values.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that the model performs better than a persistence model, achieving a MAE of about 5.9 births, compared to 6.7 births.

**Can you do better?**

You can test different Random Forest hyperparameters and numbers of time steps as input to see if you can achieve better performance. Share your results in the comments below.

>expected=42.0, predicted=45.0 >expected=53.0, predicted=43.7 >expected=39.0, predicted=41.4 >expected=40.0, predicted=38.1 >expected=38.0, predicted=42.6 >expected=44.0, predicted=48.7 >expected=34.0, predicted=42.7 >expected=37.0, predicted=37.0 >expected=52.0, predicted=38.4 >expected=48.0, predicted=41.4 >expected=55.0, predicted=43.7 >expected=50.0, predicted=45.3 MAE: 5.905

A line plot is created comparing the series of expected values and predicted values for the last 12 months of the dataset.

This gives a geometric interpretation of how well the model performed on the test set.

Once a final Random Forest model configuration is chosen, a model can be finalized and used to make a prediction on new data.

This is called an out-of-sample forecast, e.g. predicting beyond the training dataset. This is identical to making a prediction during the evaluation of the model, as we always want to evaluate a model using the same procedure that we expect to use when the model is used to make predictions on new data.

The example below demonstrates fitting a final Random Forest model on all available data and making a one-step prediction beyond the end of the dataset.

# finalize model and make a prediction for monthly births with random forest from numpy import asarray from pandas import read_csv from pandas import DataFrame from pandas import concat from sklearn.ensemble import RandomForestRegressor # transform a time series dataset into a supervised learning dataset def series_to_supervised(data, n_in=1, n_out=1, dropnan=True): n_vars = 1 if type(data) is list else data.shape[1] df = DataFrame(data) cols = list() # input sequence (t-n, ... t-1) for i in range(n_in, 0, -1): cols.append(df.shift(i)) # forecast sequence (t, t+1, ... t+n) for i in range(0, n_out): cols.append(df.shift(-i)) # put it all together agg = concat(cols, axis=1) # drop rows with NaN values if dropnan: agg.dropna(inplace=True) return agg.values # load the dataset series = read_csv('daily-total-female-births.csv', header=0, index_col=0) values = series.values # transform the time series data into supervised learning train = series_to_supervised(values, n_in=6) # split into input and output columns trainX, trainy = train[:, :-1], train[:, -1] # fit model model = RandomForestRegressor(n_estimators=1000) model.fit(trainX, trainy) # construct an input for a new prediction row = values[-6:].flatten() # make a one-step prediction yhat = model.predict(asarray([row])) print('Input: %s, Predicted: %.3f' % (row, yhat[0]))

Running the example fits an Random Forest model on all available data.

A new row of input is prepared using the last six months of known data and the next month beyond the end of the dataset is predicted.

Input: [34 37 52 48 55 50], Predicted: 43.053

This section provides more resources on the topic if you are looking to go deeper.

- How to Develop a Random Forest Ensemble in Python
- Time Series Forecasting as Supervised Learning
- How to Convert a Time Series to a Supervised Learning Problem in Python
- How To Backtest Machine Learning Models for Time Series Forecasting

In this tutorial, you discovered how to develop a Random Forest model for time series forecasting.

Specifically, you learned:

- Random Forest is an ensemble of decision trees algorithms that can be used for classification and regression predictive modeling.
- Time series datasets can be transformed into supervised learning using a sliding-window representation.
- How to fit, evaluate, and make predictions with an Random Forest regression model for time series forecasting.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Random Forest for Time Series Forecasting appeared first on Machine Learning Mastery.

]]>The post Time Series Forecasting With Prophet in Python appeared first on Machine Learning Mastery.

]]>The Prophet library is an open-source library designed for making forecasts for univariate time series datasets. It is easy to use and designed to automatically find a good set of hyperparameters for the model in an effort to make skillful forecasts for data with trends and seasonal structure by default.

In this tutorial, you will discover how to use the Facebook Prophet library for time series forecasting.

After completing this tutorial, you will know:

- Prophet is an open-source library developed by Facebook and designed for automatic forecasting of univariate time series data.
- How to fit Prophet models and use them to make in-sample and out-of-sample forecasts.
- How to evaluate a Prophet model on a hold-out dataset.

Let’s get started.

This tutorial is divided into three parts; they are:

- Prophet Forecasting Library
- Car Sales Dataset
- Load and Summarize Dataset
- Load and Plot Dataset

- Forecast Car Sales With Prophet
- Fit Prophet Model
- Make an In-Sample Forecast
- Make an Out-of-Sample Forecast
- Manually Evaluate Forecast Model

Prophet, or “*Facebook Prophet*,” is an open-source library for univariate (one variable) time series forecasting developed by Facebook.

Prophet implements what they refer to as an additive time series forecasting model, and the implementation supports trends, seasonality, and holidays.

Implements a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects

— Package ‘prophet’, 2019.

It is designed to be easy and completely automatic, e.g. point it at a time series and get a forecast. As such, it is intended for internal company use, such as forecasting sales, capacity, etc.

For a great overview of Prophet and its capabilities, see the post:

The library provides two interfaces, including R and Python. We will focus on the Python interface in this tutorial.

The first step is to install the Prophet library using Pip, as follows:

sudo pip install fbprophet

Next, we can confirm that the library was installed correctly.

To do this, we can import the library and print the version number in Python. The complete example is listed below.

# check prophet version import fbprophet # print version number print('Prophet %s' % fbprophet.__version__)

Running the example prints the installed version of Prophet.

You should have the same version or higher.

Prophet 0.5

Now that we have Prophet installed, let’s select a dataset we can use to explore using the library.

We will use the monthly car sales dataset.

It is a standard univariate time series dataset that contains both a trend and seasonality. The dataset has 108 months of data and a naive persistence forecast can achieve a mean absolute error of about 3,235 sales, providing a lower error limit.

No need to download the dataset as we will download it automatically as part of each example.

First, let’s load and summarize the dataset.

Prophet requires data to be in Pandas DataFrames. Therefore, we will load and summarize the data using Pandas.

We can load the data directly from the URL by calling the read_csv() Pandas function, then summarize the shape (number of rows and columns) of the data and view the first few rows of data.

The complete example is listed below.

# load the car sales dataset from pandas import read_csv # load data path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/monthly-car-sales.csv' df = read_csv(path, header=0) # summarize shape print(df.shape) # show first few rows print(df.head())

Running the example first reports the number of rows and columns, then lists the first five rows of data.

We can see that as we expected, there are 108 months worth of data and two columns. The first column is the date and the second is the number of sales.

Note that the first column in the output is a row index and is not a part of the dataset, just a helpful tool that Pandas uses to order rows.

(108, 2) Month Sales 0 1960-01 6550 1 1960-02 8728 2 1960-03 12026 3 1960-04 14395 4 1960-05 14587

A time-series dataset does not make sense to us until we plot it.

Plotting a time series helps us actually see if there is a trend, a seasonal cycle, outliers, and more. It gives us a feel for the data.

We can plot the data easily in Pandas by calling the *plot()* function on the DataFrame.

The complete example is listed below.

# load and plot the car sales dataset from pandas import read_csv from matplotlib import pyplot # load data path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/monthly-car-sales.csv' df = read_csv(path, header=0) # plot the time series df.plot() pyplot.show()

Running the example creates a plot of the time series.

We can clearly see the trend in sales over time and a monthly seasonal pattern to the sales. These are patterns we expect the forecast model to take into account.

Now that we are familiar with the dataset, let’s explore how we can use the Prophet library to make forecasts.

In this section, we will explore using the Prophet to forecast the car sales dataset.

Let’s start by fitting a model on the dataset

To use Prophet for forecasting, first, a *Prophet()* object is defined and configured, then it is fit on the dataset by calling the *fit()* function and passing the data.

The *Prophet()* object takes arguments to configure the type of model you want, such as the type of growth, the type of seasonality, and more. By default, the model will work hard to figure out almost everything automatically.

The *fit()* function takes a *DataFrame* of time series data. The *DataFrame* must have a specific format. The first column must have the name ‘*ds*‘ and contain the date-times. The second column must have the name ‘*y*‘ and contain the observations.

This means we change the column names in the dataset. It also requires that the first column be converted to date-time objects, if they are not already (e.g. this can be down as part of loading the dataset with the right arguments to *read_csv*).

For example, we can modify our loaded car sales dataset to have this expected structure, as follows:

... # prepare expected column names df.columns = ['ds', 'y'] df['ds']= to_datetime(df['ds'])

The complete example of fitting a Prophet model on the car sales dataset is listed below.

# fit prophet model on the car sales dataset from pandas import read_csv from pandas import to_datetime from fbprophet import Prophet # load data path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/monthly-car-sales.csv' df = read_csv(path, header=0) # prepare expected column names df.columns = ['ds', 'y'] df['ds']= to_datetime(df['ds']) # define the model model = Prophet() # fit the model model.fit(df)

Running the example loads the dataset, prepares the DataFrame in the expected format, and fits a Prophet model.

By default, the library provides a lot of verbose output during the fit process. I think it’s a bad idea in general as it trains developers to ignore output.

Nevertheless, the output summarizes what happened during the model fitting process, specifically the optimization processes that ran.

INFO:fbprophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this. INFO:fbprophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. Initial log joint probability = -4.39613 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 99 270.121 0.00413718 75.7289 1 1 120 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 179 270.265 0.00019681 84.1622 2.169e-06 0.001 273 LS failed, Hessian reset 199 270.283 1.38947e-05 87.8642 0.3402 1 299 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 240 270.296 1.6343e-05 89.9117 1.953e-07 0.001 381 LS failed, Hessian reset 299 270.3 4.73573e-08 74.9719 0.3914 1 455 Iter log prob ||dx|| ||grad|| alpha alpha0 # evals Notes 300 270.3 8.25604e-09 74.4478 0.3522 0.3522 456 Optimization terminated normally: Convergence detected: absolute parameter change was below tolerance

I will not reproduce this output in subsequent sections when we fit the model.

Next, let’s make a forecast.

It can be useful to make a forecast on historical data.

That is, we can make a forecast on data used as input to train the model. Ideally, the model has seen the data before and would make a perfect prediction.

Nevertheless, this is not the case as the model tries to generalize across all cases in the data.

This is called making an in-sample (in training set sample) forecast and reviewing the results can give insight into how good the model is. That is, how well it learned the training data.

A forecast is made by calling the *predict()* function and passing a *DataFrame* that contains one column named ‘*ds*‘ and rows with date-times for all the intervals to be predicted.

There are many ways to create this “*forecast*” *DataFrame*. In this case, we will loop over one year of dates, e.g. the last 12 months in the dataset, and create a string for each month. We will then convert the list of dates into a *DataFrame* and convert the string values into date-time objects.

... # define the period for which we want a prediction future = list() for i in range(1, 13): date = '1968-%02d' % i future.append([date]) future = DataFrame(future) future.columns = ['ds'] future['ds']= to_datetime(future['ds'])

This *DataFrame* can then be provided to the *predict()* function to calculate a forecast.

The result of the predict() function is a *DataFrame* that contains many columns. Perhaps the most important columns are the forecast date time (‘*ds*‘), the forecasted value (‘*yhat*‘), and the lower and upper bounds on the predicted value (‘*yhat_lower*‘ and ‘*yhat_upper*‘) that provide uncertainty of the forecast.

For example, we can print the first few predictions as follows:

... # summarize the forecast print(forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].head())

Prophet also provides a built-in tool for visualizing the prediction in the context of the training dataset.

This can be achieved by calling the *plot()* function on the model and passing it a result DataFrame. It will create a plot of the training dataset and overlay the prediction with the upper and lower bounds for the forecast dates.

... print(forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].head()) # plot forecast model.plot(forecast) pyplot.show()

Tying this all together, a complete example of making an in-sample forecast is listed below.

# make an in-sample forecast from pandas import read_csv from pandas import to_datetime from pandas import DataFrame from fbprophet import Prophet from matplotlib import pyplot # load data path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/monthly-car-sales.csv' df = read_csv(path, header=0) # prepare expected column names df.columns = ['ds', 'y'] df['ds']= to_datetime(df['ds']) # define the model model = Prophet() # fit the model model.fit(df) # define the period for which we want a prediction future = list() for i in range(1, 13): date = '1968-%02d' % i future.append([date]) future = DataFrame(future) future.columns = ['ds'] future['ds']= to_datetime(future['ds']) # use the model to make a forecast forecast = model.predict(future) # summarize the forecast print(forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].head()) # plot forecast model.plot(forecast) pyplot.show()

Running the example forecasts the last 12 months of the dataset.

The first five months of the prediction are reported and we can see that values are not too different from the actual sales values in the dataset.

ds yhat yhat_lower yhat_upper 0 1968-01-01 14364.866157 12816.266184 15956.555409 1 1968-02-01 14940.687225 13299.473640 16463.811658 2 1968-03-01 20858.282598 19439.403787 22345.747821 3 1968-04-01 22893.610396 21417.399440 24454.642588 4 1968-05-01 24212.079727 22667.146433 25816.191457

Next, a plot is created. We can see the training data are represented as black dots and the forecast is a blue line with upper and lower bounds in a blue shaded area.

We can see that the forecasted 12 months is a good match for the real observations, especially when the bounds are taken into account.

In practice, we really want a forecast model to make a prediction beyond the training data.

This is called an out-of-sample forecast.

We can achieve this in the same way as an in-sample forecast and simply specify a different forecast period.

In this case, a period beyond the end of the training dataset, starting 1969-01.

... # define the period for which we want a prediction future = list() for i in range(1, 13): date = '1969-%02d' % i future.append([date]) future = DataFrame(future) future.columns = ['ds'] future['ds']= to_datetime(future['ds'])

Tying this together, the complete example is listed below.

# make an out-of-sample forecast from pandas import read_csv from pandas import to_datetime from pandas import DataFrame from fbprophet import Prophet from matplotlib import pyplot # load data path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/monthly-car-sales.csv' df = read_csv(path, header=0) # prepare expected column names df.columns = ['ds', 'y'] df['ds']= to_datetime(df['ds']) # define the model model = Prophet() # fit the model model.fit(df) # define the period for which we want a prediction future = list() for i in range(1, 13): date = '1969-%02d' % i future.append([date]) future = DataFrame(future) future.columns = ['ds'] future['ds']= to_datetime(future['ds']) # use the model to make a forecast forecast = model.predict(future) # summarize the forecast print(forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].head()) # plot forecast model.plot(forecast) pyplot.show()

Running the example makes an out-of-sample forecast for the car sales data.

The first five rows of the forecast are printed, although it is hard to get an idea of whether they are sensible or not.

ds yhat yhat_lower yhat_upper 0 1969-01-01 15406.401318 13751.534121 16789.969780 1 1969-02-01 16165.737458 14486.887740 17634.953132 2 1969-03-01 21384.120631 19738.950363 22926.857539 3 1969-04-01 23512.464086 21939.204670 25105.341478 4 1969-05-01 25026.039276 23544.081762 26718.820580

A plot is created to help us evaluate the prediction in the context of the training data.

The new one-year forecast does look sensible, at least by eye.

It is critical to develop an objective estimate of a forecast model’s performance.

This can be achieved by holding some data back from the model, such as the last 12 months. Then, fitting the model on the first portion of the data, using it to make predictions on the held-pack portion, and calculating an error measure, such as the mean absolute error across the forecasts. E.g. a simulated out-of-sample forecast.

The score gives an estimate of how well we might expect the model to perform on average when making an out-of-sample forecast.

We can do this with the samples data by creating a new *DataFrame* for training with the last 12 months removed.

... # create test dataset, remove last 12 months train = df.drop(df.index[-12:]) print(train.tail())

A forecast can then be made on the last 12 months of date-times.

We can then retrieve the forecast values and the expected values from the original dataset and calculate a mean absolute error metric using the scikit-learn library.

... # calculate MAE between expected and predicted values for december y_true = df['y'][-12:].values y_pred = forecast['yhat'].values mae = mean_absolute_error(y_true, y_pred) print('MAE: %.3f' % mae)

It can also be helpful to plot the expected vs. predicted values to see how well the out-of-sample prediction matches the known values.

... # plot expected vs actual pyplot.plot(y_true, label='Actual') pyplot.plot(y_pred, label='Predicted') pyplot.legend() pyplot.show()

Tying this together, the example below demonstrates how to evaluate a Prophet model on a hold-out dataset.

# evaluate prophet time series forecasting model on hold out dataset from pandas import read_csv from pandas import to_datetime from pandas import DataFrame from fbprophet import Prophet from sklearn.metrics import mean_absolute_error from matplotlib import pyplot # load data path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/monthly-car-sales.csv' df = read_csv(path, header=0) # prepare expected column names df.columns = ['ds', 'y'] df['ds']= to_datetime(df['ds']) # create test dataset, remove last 12 months train = df.drop(df.index[-12:]) print(train.tail()) # define the model model = Prophet() # fit the model model.fit(train) # define the period for which we want a prediction future = list() for i in range(1, 13): date = '1968-%02d' % i future.append([date]) future = DataFrame(future) future.columns = ['ds'] future['ds'] = to_datetime(future['ds']) # use the model to make a forecast forecast = model.predict(future) # calculate MAE between expected and predicted values for december y_true = df['y'][-12:].values y_pred = forecast['yhat'].values mae = mean_absolute_error(y_true, y_pred) print('MAE: %.3f' % mae) # plot expected vs actual pyplot.plot(y_true, label='Actual') pyplot.plot(y_pred, label='Predicted') pyplot.legend() pyplot.show()

Running the example first reports the last few rows of the training dataset.

It confirms the training ends in the last month of 1967 and 1968 will be used as the hold-out dataset.

ds y 91 1967-08-01 13434 92 1967-09-01 13598 93 1967-10-01 17187 94 1967-11-01 16119 95 1967-12-01 13713

Next, a mean absolute error is calculated for the forecast period.

In this case we can see that the error is approximately 1,336 sales, which is much lower (better) than a naive persistence model that achieves an error of 3,235 sales over the same period.

MAE: 1336.814

Finally, a plot is created comparing the actual vs. predicted values. In this case, we can see that the forecast is a good fit. The model has skill and forecast that looks sensible.

The Prophet library also provides tools to automatically evaluate models and plot results, although those tools don’t appear to work well with data above one day in resolution.

This section provides more resources on the topic if you are looking to go deeper.

- Prophet Homepage.
- Prophet GitHub Project.
- Prophet API Documentation.
- Prophet: forecasting at scale, 2017.
- Forecasting at scale, 2017.
- Car Sales Dataset.
- Package ‘prophet’, R Documentation.

In this tutorial, you discovered how to use the Facebook Prophet library for time series forecasting.

Specifically, you learned:

- Prophet is an open-source library developed by Facebook and designed for automatic forecasting of univariate time series data.
- How to fit Prophet models and use them to make in-sample and out-of-sample forecasts.
- How to evaluate a Prophet model on a hold-out dataset.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Time Series Forecasting With Prophet in Python appeared first on Machine Learning Mastery.

]]>The post How to Model Volatility with ARCH and GARCH for Time Series Forecasting in Python appeared first on Machine Learning Mastery.

]]>The ARCH or Autoregressive Conditional Heteroskedasticity method provides a way to model a change in variance in a time series that is time dependent, such as increasing or decreasing volatility. An extension of this approach named GARCH or Generalized Autoregressive Conditional Heteroskedasticity allows the method to support changes in the time dependent volatility, such as increasing and decreasing volatility in the same series.

In this tutorial, you will discover the ARCH and GARCH models for predicting the variance of a time series.

After completing this tutorial, you will know:

- The problem with variance in a time series and the need for ARCH and GARCH models.
- How to configure ARCH and GARCH models.
- How to implement ARCH and GARCH models in Python.

**Kick-start your project** with my new book Time Series Forecasting With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into five parts; they are:

- Problem with Variance
- What Is an ARCH Model?
- What Is a GARCH Model?
- How to Configure ARCH and GARCH Models
- ARCH and GARCH Models in Python

Autoregressive models can be developed for univariate time series data that is stationary (AR), has a trend (ARIMA), and has a seasonal component (SARIMA).

One aspect of a univariate time series that these autoregressive models do not model is a change in the variance over time.

Classically, a time series with modest changes in variance can sometimes be adjusted using a power transform, such as by taking the Log or using a Box-Cox transform.

There are some time series where the variance changes consistently over time. In the context of a time series in the financial domain, this would be called increasing and decreasing volatility.

In time series where the variance is increasing in a systematic way, such as an increasing trend, this property of the series is called heteroskedasticity. It’s a fancy word from statistics that means changing or unequal variance across the series.

If the change in variance can be correlated over time, then it can be modeled using an autoregressive process, such as ARCH.

Autoregressive Conditional Heteroskedasticity, or ARCH, is a method that explicitly models the change in variance over time in a time series.

Specifically, an ARCH method models the variance at a time step as a function of the residual errors from a mean process (e.g. a zero mean).

The ARCH process introduced by Engle (1982) explicitly recognizes the difference between the unconditional and the conditional variance allowing the latter to change over time as a function of past errors.

— Generalized autoregressive conditional heteroskedasticity, 1986.

A lag parameter must be specified to define the number of prior residual errors to include in the model. Using the notation of the GARCH model (discussed later), we can refer to this parameter as “*q*“. Originally, this parameter was called “*p*“, and is also called “*p*” in the arch Python package used later in this tutorial.

**q**: The number of lag squared residual errors to include in the ARCH model.

A generally accepted notation for an ARCH model is to specify the ARCH() function with the q parameter ARCH(q); for example, ARCH(1) would be a first order ARCH model.

The approach expects the series is stationary, other than the change in variance, meaning it does not have a trend or seasonal component. An ARCH model is used to predict the variance at future time steps.

[ARCH] are mean zero, serially uncorrelated processes with nonconstant variances conditional on the past, but constant unconditional variances. For such processes, the recent past gives information about the one-period forecast variance.

In practice, this can be used to model the expected variance on the residuals after another autoregressive model has been used, such as an ARMA or similar.

The model should only be applied to a prewhitened residual series {e_t} that is uncorrelated and contains no trends or seasonal changes, such as might be obtained after fitting a satisfactory SARIMA model.

— Page 148, Introductory Time Series with R, 2009.

Generalized Autoregressive Conditional Heteroskedasticity, or GARCH, is an extension of the ARCH model that incorporates a moving average component together with the autoregressive component.

Specifically, the model includes lag variance terms (e.g. the observations if modeling the white noise residual errors of another process), together with lag residual errors from a mean process.

The introduction of a moving average component allows the model to both model the conditional change in variance over time as well as changes in the time-dependent variance. Examples include conditional increases and decreases in variance.

As such, the model introduces a new parameter “p” that describes the number of lag variance terms:

**p**: The number of lag variances to include in the GARCH model.**q**: The number of lag residual errors to include in the GARCH model.

A generally accepted notation for a GARCH model is to specify the GARCH() function with the *p* and *q* parameters GARCH(p, q); for example GARCH(1, 1) would be a first order GARCH model.

A GARCH model subsumes ARCH models, where a GARCH(0, q) is equivalent to an ARCH(q) model.

For p = 0 the process reduces to the ARCH(q) process, and for p = q = 0 E(t) is simply white noise. In the ARCH(q) process the conditional variance is specified as a linear function of past sample variances only, whereas the GARCH(p, q) process allows lagged conditional variances to enter as well. This corresponds to some sort of adaptive learning mechanism.

— Generalized autoregressive conditional heteroskedasticity, 1986.

As with ARCH, GARCH predicts the future variance and expects that the series is stationary, other than the change in variance, meaning it does not have a trend or seasonal component.

The configuration for an ARCH model is best understood in the context of ACF and PACF plots of the variance of the time series.

This can be achieved by subtracting the mean from each observation in the series and squaring the result, or just squaring the observation if you’re already working with white noise residuals from another model.

If a correlogram appears to be white noise […], then volatility ca be detected by looking at the correlogram of the squared values since the squared values are equivalent to the variance (provided the series is adjusted to have a mean of zero).

— Pages 146-147, Introductory Time Series with R, 2009.

The ACF and PACF plots can then be interpreted to estimate values for p and q, in a similar way as is done for the ARMA model.

For more information on how to do this, see the post:

In this section, we will look at how we can develop ARCH and GARCH models in Python using the arch library.

First, let’s prepare a dataset we can use for these examples.

We can create a dataset with a controlled model of variance.

The simplest case would be a series of random noise where the mean is zero and the variance starts at 0.0 and steadily increases.

We can achieve this in Python using the gauss() function that generates a Gaussian random number with the specified mean and standard deviation.

# create dataset data = [gauss(0, i*0.01) for i in range(1,100+1)]

We can plot the dataset to get an idea of how the linear change in variance looks. The complete example is listed below.

# create a simple white noise with increasing variance from random import gauss from random import seed from matplotlib import pyplot # seed pseudorandom number generator seed(1) # create dataset data = [gauss(0, i*0.01) for i in range(0,100)] # plot pyplot.plot(data) pyplot.show()

Running the example creates and plots the dataset. We can see the clear change in variance over the course of the series.

We know there is an autocorrelation in the variance of the contrived dataset.

Nevertheless, we can look at an autocorrelation plot to confirm this expectation. The complete example is listed below.

# check correlations of squared observations from random import gauss from random import seed from matplotlib import pyplot from statsmodels.graphics.tsaplots import plot_acf # seed pseudorandom number generator seed(1) # create dataset data = [gauss(0, i*0.01) for i in range(0,100)] # square the dataset squared_data = [x**2 for x in data] # create acf plot plot_acf(squared_data) pyplot.show()

Running the example creates an autocorrelation plot of the squared observations. We see significant positive correlation in variance out to perhaps 15 lag time steps.

This might make a reasonable value for the parameter in the ARCH model.

Developing an ARCH model involves three steps:

- Define the model
- Fit the model
- Make a forecast.

Before fitting and forecasting, we can split the dataset into a train and test set so that we can fit the model on the train and evaluate its performance on the test set.

# split into train/test n_test = 10 train, test = data[:-n_test], data[-n_test:]

A model can be defined by calling the arch_model() function. We can specify a model for the mean of the series: in this case *mean=’Zero’* is an appropriate model. We can then specify the model for the variance: in this case *vol=’ARCH’*. We can also specify the lag parameter for the ARCH model: in this case *p=15*.

Note, in the arch library, the names of *p* and *q* parameters for ARCH/GARCH have been reversed.

# define model model = arch_model(train, mean='Zero', vol='ARCH', p=15)

The model can be fit on the data by calling the fit() function. There are many options on this function, although the defaults are good enough for getting started. This will return a fit model.

# fit model model_fit = model.fit()

Finally, we can make a prediction by calling the forecast() function on the fit model. We can specify the horizon for the forecast.

In this case, we will predict the variance for the last 10 time steps of the dataset, and withhold them from the training of the model.

# forecast the test set yhat = model_fit.forecast(horizon=n_test)

We can tie all of this together; the complete example is listed below.

# example of ARCH model from random import gauss from random import seed from matplotlib import pyplot from arch import arch_model # seed pseudorandom number generator seed(1) # create dataset data = [gauss(0, i*0.01) for i in range(0,100)] # split into train/test n_test = 10 train, test = data[:-n_test], data[-n_test:] # define model model = arch_model(train, mean='Zero', vol='ARCH', p=15) # fit model model_fit = model.fit() # forecast the test set yhat = model_fit.forecast(horizon=n_test) # plot the actual variance var = [i*0.01 for i in range(0,100)] pyplot.plot(var[-n_test:]) # plot forecast variance pyplot.plot(yhat.variance.values[-1, :]) pyplot.show()

Running the example defines and fits the model then predicts the variance for the last 10 time steps of the dataset.

A line plot is created comparing the series of expected variance to the predicted variance. Although the model was not tuned, the predicted variance looks reasonable.

We can fit a GARCH model just as easily using the arch library.

The *arch_model()* function can specify a GARCH instead of ARCH model vol=’GARCH’ as well as the lag parameters for both.

# define model model = arch_model(train, mean='Zero', vol='GARCH', p=15, q=15)

The dataset may not be a good fit for a GARCH model given the linearly increasing variance, nevertheless, the complete example is listed below.

# example of ARCH model from random import gauss from random import seed from matplotlib import pyplot from arch import arch_model # seed pseudorandom number generator seed(1) # create dataset data = [gauss(0, i*0.01) for i in range(0,100)] # split into train/test n_test = 10 train, test = data[:-n_test], data[-n_test:] # define model model = arch_model(train, mean='Zero', vol='GARCH', p=15, q=15) # fit model model_fit = model.fit() # forecast the test set yhat = model_fit.forecast(horizon=n_test) # plot the actual variance var = [i*0.01 for i in range(0,100)] pyplot.plot(var[-n_test:]) # plot forecast variance pyplot.plot(yhat.variance.values[-1, :]) pyplot.show()

A plot of the expected and predicted variance is listed below.

This section provides more resources on the topic if you are looking to go deeper.

- Autoregressive Conditional Heteroscedasticity with Estimates of the Variance of United Kingdom Inflation, 1982.
- Generalized autoregressive conditional heteroskedasticity, 1986.
- Chapter 7, Non-stationary Models, Introductory Time Series with R, 2009.

- Autoregressive conditional heteroskedasticity on Wikipedia
- Heteroscedasticity on Wikipedia
- What is the difference between GARCH and ARCH?

In this tutorial, you discovered the ARCH and GARCH models for predicting the variance of a time series.

Specifically, you learned:

- The problem with variance in a time series and the need for ARCH and GARCH models.
- How to configure ARCH and GARCH models.
- How to implement ARCH and GARCH models in Python.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Model Volatility with ARCH and GARCH for Time Series Forecasting in Python appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Exponential Smoothing for Time Series Forecasting in Python appeared first on Machine Learning Mastery.

]]>It is a powerful forecasting method that may be used as an alternative to the popular Box-Jenkins ARIMA family of methods.

In this tutorial, you will discover the exponential smoothing method for univariate time series forecasting.

After completing this tutorial, you will know:

- What exponential smoothing is and how it is different from other forecasting methods.
- The three main types of exponential smoothing and how to configure them.
- How to implement exponential smoothing in Python.

**Kick-start your project** with my new book Time Series Forecasting With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into 4 parts; they are:

- What Is Exponential Smoothing?
- Types of Exponential Smoothing
- How to Configure Exponential Smoothing
- Exponential Smoothing in Python

Exponential smoothing is a time series forecasting method for univariate data.

Time series methods like the Box-Jenkins ARIMA family of methods develop a model where the prediction is a weighted linear sum of recent past observations or lags.

Exponential smoothing forecasting methods are similar in that a prediction is a weighted sum of past observations, but the model explicitly uses an exponentially decreasing weight for past observations.

Specifically, past observations are weighted with a geometrically decreasing ratio.

Forecasts produced using exponential smoothing methods are weighted averages of past observations, with the weights decaying exponentially as the observations get older. In other words, the more recent the observation the higher the associated weight.

— Page 171, Forecasting: principles and practice, 2013.

Exponential smoothing methods may be considered as peers and an alternative to the popular Box-Jenkins ARIMA class of methods for time series forecasting.

Collectively, the methods are sometimes referred to as ETS models, referring to the explicit modeling of Error, Trend and Seasonality.

There are three main types of exponential smoothing time series forecasting methods.

A simple method that assumes no systematic structure, an extension that explicitly handles trends, and the most advanced approach that add support for seasonality.

Single Exponential Smoothing, SES for short, also called Simple Exponential Smoothing, is a time series forecasting method for univariate data without a trend or seasonality.

It requires a single parameter, called *alpha* (*a*), also called the smoothing factor or smoothing coefficient.

This parameter controls the rate at which the influence of the observations at prior time steps decay exponentially. Alpha is often set to a value between 0 and 1. Large values mean that the model pays attention mainly to the most recent past observations, whereas smaller values mean more of the history is taken into account when making a prediction.

A value close to 1 indicates fast learning (that is, only the most recent values influence the forecasts), whereas a value close to 0 indicates slow learning (past observations have a large influence on forecasts).

— Page 89, Practical Time Series Forecasting with R, 2016.

Hyperparameters:

**Alpha**: Smoothing factor for the level.

Double Exponential Smoothing is an extension to Exponential Smoothing that explicitly adds support for trends in the univariate time series.

In addition to the *alpha* parameter for controlling smoothing factor for the level, an additional smoothing factor is added to control the decay of the influence of the change in trend called *beta* (*b*).

The method supports trends that change in different ways: an additive and a multiplicative, depending on whether the trend is linear or exponential respectively.

Double Exponential Smoothing with an additive trend is classically referred to as Holt’s linear trend model, named for the developer of the method Charles Holt.

**Additive Trend**: Double Exponential Smoothing with a linear trend.**Multiplicative Trend**: Double Exponential Smoothing with an exponential trend.

For longer range (multi-step) forecasts, the trend may continue on unrealistically. As such, it can be useful to dampen the trend over time.

Dampening means reducing the size of the trend over future time steps down to a straight line (no trend).

The forecasts generated by Holt’s linear method display a constant trend (increasing or decreasing) indecently into the future. Even more extreme are the forecasts generated by the exponential trend method […] Motivated by this observation […] introduced a parameter that “dampens” the trend to a flat line some time in the future.

— Page 183, Forecasting: principles and practice, 2013.

As with modeling the trend itself, we can use the same principles in dampening the trend, specifically additively or multiplicatively for a linear or exponential dampening effect. A damping coefficient *Phi* (*p*) is used to control the rate of dampening.

**Additive Dampening**: Dampen a trend linearly.**Multiplicative Dampening**: Dampen the trend exponentially.

Hyperparameters:

**Alpha**: Smoothing factor for the level.**Beta**: Smoothing factor for the trend.**Trend Type**: Additive or multiplicative.**Dampen Type**: Additive or multiplicative.**Phi**: Damping coefficient.

Triple Exponential Smoothing is an extension of Exponential Smoothing that explicitly adds support for seasonality to the univariate time series.

This method is sometimes called Holt-Winters Exponential Smoothing, named for two contributors to the method: Charles Holt and Peter Winters.

In addition to the alpha and beta smoothing factors, a new parameter is added called *gamma* (*g*) that controls the influence on the seasonal component.

As with the trend, the seasonality may be modeled as either an additive or multiplicative process for a linear or exponential change in the seasonality.

**Additive Seasonality**: Triple Exponential Smoothing with a linear seasonality.**Multiplicative Seasonality**: Triple Exponential Smoothing with an exponential seasonality.

Triple exponential smoothing is the most advanced variation of exponential smoothing and through configuration, it can also develop double and single exponential smoothing models.

Being an adaptive method, Holt-Winter’s exponential smoothing allows the level, trend and seasonality patterns to change over time.

— Page 95, Practical Time Series Forecasting with R, 2016.

Additionally, to ensure that the seasonality is modeled correctly, the number of time steps in a seasonal period (*Period*) must be specified. For example, if the series was monthly data and the seasonal period repeated each year, then the Period=12.

Hyperparameters:

**Alpha**: Smoothing factor for the level.**Beta**: Smoothing factor for the trend.**Gamma**: Smoothing factor for the seasonality.**Trend Type**: Additive or multiplicative.**Dampen Type**: Additive or multiplicative.**Phi**: Damping coefficient.**Seasonality Type**: Additive or multiplicative.**Period**: Time steps in seasonal period.

All of the model hyperparameters can be specified explicitly.

This can be challenging for experts and beginners alike.

Instead, it is common to use numerical optimization to search for and fund the smoothing coefficients (*alpha*, *beta*, *gamma*, and *phi*) for the model that result in the lowest error.

[…] a more robust and objective way to obtain values for the unknown parameters included in any exponential smoothing method is to estimate them from the observed data. […] the unknown parameters and the initial values for any exponential smoothing method can be estimated by minimizing the SSE [sum of the squared errors].

— Page 177, Forecasting: principles and practice, 2013.

The parameters that specify the type of change in the trend and seasonality, such as weather they are additive or multiplicative and whether they should be dampened, must be specified explicitly.

This section looks at how to implement exponential smoothing in Python.

The implementations of Exponential Smoothing in Python are provided in the Statsmodels Python library.

The implementations are based on the description of the method in Rob Hyndman and George Athanasopoulos’ excellent book “Forecasting: Principles and Practice,” 2013 and their R implementations in their “forecast” package.

Single Exponential Smoothing or simple smoothing can be implemented in Python via the SimpleExpSmoothing Statsmodels class.

First, an instance of the *SimpleExpSmoothing* class must be instantiated and passed the training data. The *fit()* function is then called providing the fit configuration, specifically the *alpha* value called *smoothing_level*. If this is not provided or set to *None*, the model will automatically optimize the value.

This *fit()* function returns an instance of the *HoltWintersResults* class that contains the learned coefficients. The *forecast()* or the *predict()* function on the result object can be called to make a forecast.

For example:

# single exponential smoothing ... from statsmodels.tsa.holtwinters import SimpleExpSmoothing # prepare data data = ... # create class model = SimpleExpSmoothing(data) # fit model model_fit = model.fit(...) # make prediction yhat = model_fit.predict(...)

Single, Double and Triple Exponential Smoothing can be implemented in Python using the ExponentialSmoothing Statsmodels class.

First, an instance of the ExponentialSmoothing class must be instantiated, specifying both the training data and some configuration for the model.

Specifically, you must specify the following configuration parameters:

**trend**: The type of trend component, as either “*add*” for additive or “*mul*” for multiplicative. Modeling the trend can be disabled by setting it to None.**damped**: Whether or not the trend component should be damped, either*True*or*False*.**seasonal**: The type of seasonal component, as either “*add*” for additive or “*mul*” for multiplicative. Modeling the seasonal component can be disabled by setting it to None.**seasonal_periods**: The number of time steps in a seasonal period, e.g. 12 for 12 months in a yearly seasonal structure (more here).

The model can then be fit on the training data by calling the *fit()* function.

This function allows you to either specify the smoothing coefficients of the exponential smoothing model or have them optimized. By default, they are optimized (e.g. *optimized=True*). These coefficients include:

**smoothing_level**(*alpha*): the smoothing coefficient for the level.**smoothing_slope**(*beta*): the smoothing coefficient for the trend.**smoothing_seasonal**(*gamma*): the smoothing coefficient for the seasonal component.**damping_slope**(*phi*): the coefficient for the damped trend.

Additionally, the fit function can perform basic data preparation prior to modeling; specifically:

**use_boxcox**: Whether or not to perform a power transform of the series (True/False) or specify the lambda for the transform.

The *fit()* function will return an instance of the *HoltWintersResults* class that contains the learned coefficients. The *forecast()* or the *predict()* function on the result object can be called to make a forecast.

# double or triple exponential smoothing ... from statsmodels.tsa.holtwinters import ExponentialSmoothing # prepare data data = ... # create class model = ExponentialSmoothing(data, ...) # fit model model_fit = model.fit(...) # make prediction yhat = model_fit.predict(...)

This section provides more resources on the topic if you are looking to go deeper.

- Chapter 7 Exponential smoothing, Forecasting: principles and practice, 2013.
- Section 6.4. Introduction to Time Series Analysis, Engineering Statistics Handbook, 2012.
- Practical Time Series Forecasting with R, 2016.

- Statsmodels Time Series analysis tsa
- statsmodels.tsa.holtwinters.SimpleExpSmoothing API
- statsmodels.tsa.holtwinters.ExponentialSmoothing API
- statsmodels.tsa.holtwinters.HoltWintersResults API
- forecast: Forecasting Functions for Time Series and Linear Models R package

In this tutorial, you discovered the exponential smoothing method for univariate time series forecasting.

Specifically, you learned:

- What exponential smoothing is and how it is different from other forecast methods.
- The three main types of exponential smoothing and how to configure them.
- How to implement exponential smoothing in Python.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Exponential Smoothing for Time Series Forecasting in Python appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to SARIMA for Time Series Forecasting in Python appeared first on Machine Learning Mastery.

]]>Although the method can handle data with a trend, it does not support time series with a seasonal component.

An extension to ARIMA that supports the direct modeling of the seasonal component of the series is called SARIMA.

In this tutorial, you will discover the Seasonal Autoregressive Integrated Moving Average, or SARIMA, method for time series forecasting with univariate data containing trends and seasonality.

After completing this tutorial, you will know:

- The limitations of ARIMA when it comes to seasonal data.
- The SARIMA extension of ARIMA that explicitly models the seasonal element in univariate data.
- How to implement the SARIMA method in Python using the Statsmodels library.

**Kick-start your project** with my new book Time Series Forecasting With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Update**: For help using and grid searching SARIMA hyperparameters, see this post:

This tutorial is divided into four parts; they are:

- What’s Wrong with ARIMA
- What Is SARIMA?
- How to Configure SARIMA
- How to use SARIMA in Python

Autoregressive Integrated Moving Average, or ARIMA, is a forecasting method for univariate time series data.

As its name suggests, it supports both an autoregressive and moving average elements. The integrated element refers to differencing allowing the method to support time series data with a trend.

A problem with ARIMA is that it does not support seasonal data. That is a time series with a repeating cycle.

ARIMA expects data that is either not seasonal or has the seasonal component removed, e.g. seasonally adjusted via methods such as seasonal differencing.

For more on ARIMA, see the post:

An alternative is to use SARIMA.

Seasonal Autoregressive Integrated Moving Average, SARIMA or Seasonal ARIMA, is an extension of ARIMA that explicitly supports univariate time series data with a seasonal component.

It adds three new hyperparameters to specify the autoregression (AR), differencing (I) and moving average (MA) for the seasonal component of the series, as well as an additional parameter for the period of the seasonality.

A seasonal ARIMA model is formed by including additional seasonal terms in the ARIMA […] The seasonal part of the model consists of terms that are very similar to the non-seasonal components of the model, but they involve backshifts of the seasonal period.

— Page 242, Forecasting: principles and practice, 2013.

Configuring a SARIMA requires selecting hyperparameters for both the trend and seasonal elements of the series.

There are three trend elements that require configuration.

They are the same as the ARIMA model; specifically:

**p**: Trend autoregression order.**d**: Trend difference order.**q**: Trend moving average order.

There are four seasonal elements that are not part of ARIMA that must be configured; they are:

**P**: Seasonal autoregressive order.**D**: Seasonal difference order.**Q**: Seasonal moving average order.**m**: The number of time steps for a single seasonal period.

Together, the notation for an SARIMA model is specified as:

SARIMA(p,d,q)(P,D,Q)m

Where the specifically chosen hyperparameters for a model are specified; for example:

SARIMA(3,1,0)(1,1,0)12

Importantly, the *m* parameter influences the *P*, *D*, and *Q* parameters. For example, an m of 12 for monthly data suggests a yearly seasonal cycle.

A *P*=1 would make use of the first seasonally offset observation in the model, e.g. t-(m*1) or t-12. A *P*=2, would use the last two seasonally offset observations t-(m * 1), t-(m * 2).

Similarly, a *D* of 1 would calculate a first order seasonal difference and a *Q*=1 would use a first order errors in the model (e.g. moving average).

A seasonal ARIMA model uses differencing at a lag equal to the number of seasons (s) to remove additive seasonal effects. As with lag 1 differencing to remove a trend, the lag s differencing introduces a moving average term. The seasonal ARIMA model includes autoregressive and moving average terms at lag s.

— Page 142, Introductory Time Series with R, 2009.

The trend elements can be chosen through careful analysis of ACF and PACF plots looking at the correlations of recent time steps (e.g. 1, 2, 3).

Similarly, ACF and PACF plots can be analyzed to specify values for the seasonal model by looking at correlation at seasonal lag time steps.

For more on interpreting ACF/PACF plots, see the post:

Seasonal ARIMA models can potentially have a large number of parameters and combinations of terms. Therefore, it is appropriate to try out a wide range of models when fitting to data and choose a best fitting model using an appropriate criterion …

— Pages 143-144, Introductory Time Series with R, 2009.

Alternately, a grid search can be used across the trend and seasonal hyperparameters.

For more on grid searching SARIMA parameters, see the post:

The SARIMA time series forecasting method is supported in Python via the Statsmodels library.

To use SARIMA there are three steps, they are:

- Define the model.
- Fit the defined model.
- Make a prediction with the fit model.

Let’s look at each step in turn.

An instance of the SARIMAX class can be created by providing the training data and a host of model configuration parameters.

# specify training data data = ... # define model model = SARIMAX(data, ...)

The implementation is called SARIMAX instead of SARIMA because the “X” addition to the method name means that the implementation also supports exogenous variables.

These are parallel time series variates that are not modeled directly via AR, I, or MA processes, but are made available as a weighted input to the model.

Exogenous variables are optional can be specified via the “*exog*” argument.

# specify training data data = ... # specify additional data other_data = ... # define model model = SARIMAX(data, exog=other_data, ...)

The trend and seasonal hyperparameters are specified as 3 and 4 element tuples respectively to the “*order*” and “*seasonal_order*” arguments.

These elements must be specified.

# specify training data data = ... # define model configuration my_order = (1, 1, 1) my_seasonal_order = (1, 1, 1, 12) # define model model = SARIMAX(data, order=my_order, seasonal_order=my_seasonal_order, ...)

These are the main configuration elements.

There are other fine tuning parameters you may want to configure. Learn more in the full API:

Once the model is created, it can be fit on the training data.

The model is fit by calling the fit() function.

Fitting the model returns an instance of the *SARIMAXResults* class. This object contains the details of the fit, such as the data and coefficients, as well as functions that can be used to make use of the model.

# specify training data data = ... # define model model = SARIMAX(data, order=..., seasonal_order=...) # fit model model_fit = model.fit()

Many elements of the fitting process can be configured, and it is worth reading the API to review these options once you are comfortable with the implementation.

Once fit, the model can be used to make a forecast.

A forecast can be made by calling the *forecast()* or the *predict()* functions on the *SARIMAXResults* object returned from calling fit.

The forecast() function takes a single parameter that specifies the number of out of sample time steps to forecast, or assumes a one step forecast if no arguments are provided.

# specify training data data = ... # define model model = SARIMAX(data, order=..., seasonal_order=...) # fit model model_fit = model.fit() # one step forecast yhat = model_fit.forecast()

The *predict()* function requires a start and end date or index to be specified.

Additionally, if exogenous variables were provided when defining the model, they too must be provided for the forecast period to the *predict()* function.

# specify training data data = ... # define model model = SARIMAX(data, order=..., seasonal_order=...) # fit model model_fit = model.fit() # one step forecast yhat = model_fit.predict(start=len(data), end=len(data))

This section provides more resources on the topic if you are looking to go deeper.

- How to Grid Search SARIMA Model Hyperparameters for Time Series Forecasting
- How to Create an ARIMA Model for Time Series Forecasting with Python
- How to Grid Search ARIMA Model Hyperparameters with Python
- A Gentle Introduction to Autocorrelation and Partial Autocorrelation

- Chapter 8 ARIMA models, Forecasting: principles and practice, 2013.
- Chapter 7, Non-stationary Models, Introductory Time Series with R, 2009.

- Statsmodels Time Series Analysis by State Space Methods
- statsmodels.tsa.statespace.sarimax.SARIMAX API
- statsmodels.tsa.statespace.sarimax.SARIMAXResults API
- Statsmodels SARIMAX Notebook

In this tutorial, you discovered the Seasonal Autoregressive Integrated Moving Average, or SARIMA, method for time series forecasting with univariate data containing trends and seasonality.

Specifically, you learned:

- The limitations of ARIMA when it comes to seasonal data.
- The SARIMA extension of ARIMA that explicitly models the seasonal element in univariate data.
- How to implement the SARIMA method in Python using the Statsmodels library.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to SARIMA for Time Series Forecasting in Python appeared first on Machine Learning Mastery.

]]>The post 11 Classical Time Series Forecasting Methods in Python (Cheat Sheet) appeared first on Machine Learning Mastery.

]]>Before exploring machine learning methods for time series, it is a good idea to ensure you have exhausted classical linear time series forecasting methods. Classical time series forecasting methods may be focused on linear relationships, nevertheless, they are sophisticated and perform well on a wide range of problems, assuming that your data is suitably prepared and the method is well configured.

In this post, will you will discover a suite of **classical methods for time series forecasting** that you can test on your forecasting problem prior to exploring to machine learning methods.

The post is structured as a cheat sheet to give you just enough information on each method to get started with a working code example and where to look to get more information on the method.

All code examples are in Python and use the Statsmodels library. The APIs for this library can be tricky for beginners (trust me!), so having a working code example as a starting point will greatly accelerate your progress.

This is a large post; you may want to bookmark it.

**Kick-start your project** with my new book Time Series Forecasting With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Updated Apr/2020**: Changed AR to AutoReg due to API change.**Updated Dec/2020**: Updated ARIMA API to the latest version of statsmodels.

This cheat sheet demonstrates 11 different classical time series forecasting methods; they are:

- Autoregression (AR)
- Moving Average (MA)
- Autoregressive Moving Average (ARMA)
- Autoregressive Integrated Moving Average (ARIMA)
- Seasonal Autoregressive Integrated Moving-Average (SARIMA)
- Seasonal Autoregressive Integrated Moving-Average with Exogenous Regressors (SARIMAX)
- Vector Autoregression (VAR)
- Vector Autoregression Moving-Average (VARMA)
- Vector Autoregression Moving-Average with Exogenous Regressors (VARMAX)
- Simple Exponential Smoothing (SES)
- Holt Winter’s Exponential Smoothing (HWES)

*Did I miss your favorite classical time series forecasting method?*

Let me know in the comments below.

Each method is presented in a consistent manner.

This includes:

**Description**. A short and precise description of the technique.**Python Code**. A short working example of fitting the model and making a prediction in Python.**More Information**. References for the API and the algorithm.

Each code example is demonstrated on a simple contrived dataset that may or may not be appropriate for the method. Replace the contrived dataset with your data in order to test the method.

Remember: each method will require tuning to your specific problem. In many cases, I have examples of how to configure and even grid search parameters on the blog already, try the search function.

If you find this cheat sheet useful, please let me know in the comments below.

The autoregression (AR) method models the next step in the sequence as a linear function of the observations at prior time steps.

The notation for the model involves specifying the order of the model p as a parameter to the AR function, e.g. AR(p). For example, AR(1) is a first-order autoregression model.

The method is suitable for univariate time series without trend and seasonal components.

# AR example from statsmodels.tsa.ar_model import AutoReg from random import random # contrived dataset data = [x + random() for x in range(1, 100)] # fit model model = AutoReg(data, lags=1) model_fit = model.fit() # make prediction yhat = model_fit.predict(len(data), len(data)) print(yhat)

- statsmodels.tsa.ar_model.AutoReg API
- statsmodels.tsa.ar_model.AutoRegResults API
- Autoregressive model on Wikipedia

The moving average (MA) method models the next step in the sequence as a linear function of the residual errors from a mean process at prior time steps.

A moving average model is different from calculating the moving average of the time series.

The notation for the model involves specifying the order of the model q as a parameter to the MA function, e.g. MA(q). For example, MA(1) is a first-order moving average model.

The method is suitable for univariate time series without trend and seasonal components.

We can use the ARIMA class to create an MA model and setting a zeroth-order AR model. We must specify the order of the MA model in the order argument.

# MA example from statsmodels.tsa.arima.model import ARIMA from random import random # contrived dataset data = [x + random() for x in range(1, 100)] # fit model model = ARIMA(data, order=(0, 0, 1)) model_fit = model.fit() # make prediction yhat = model_fit.predict(len(data), len(data)) print(yhat)

The Autoregressive Moving Average (ARMA) method models the next step in the sequence as a linear function of the observations and residual errors at prior time steps.

It combines both Autoregression (AR) and Moving Average (MA) models.

The notation for the model involves specifying the order for the AR(p) and MA(q) models as parameters to an ARMA function, e.g. ARMA(p, q). An ARIMA model can be used to develop AR or MA models.

The method is suitable for univariate time series without trend and seasonal components.

# ARMA example from statsmodels.tsa.arima.model import ARIMA from random import random # contrived dataset data = [random() for x in range(1, 100)] # fit model model = ARIMA(data, order=(2, 0, 1)) model_fit = model.fit() # make prediction yhat = model_fit.predict(len(data), len(data)) print(yhat)

The Autoregressive Integrated Moving Average (ARIMA) method models the next step in the sequence as a linear function of the differenced observations and residual errors at prior time steps.

It combines both Autoregression (AR) and Moving Average (MA) models as well as a differencing pre-processing step of the sequence to make the sequence stationary, called integration (I).

The notation for the model involves specifying the order for the AR(p), I(d), and MA(q) models as parameters to an ARIMA function, e.g. ARIMA(p, d, q). An ARIMA model can also be used to develop AR, MA, and ARMA models.

The method is suitable for univariate time series with trend and without seasonal components.

# ARIMA example from statsmodels.tsa.arima.model import ARIMA from random import random # contrived dataset data = [x + random() for x in range(1, 100)] # fit model model = ARIMA(data, order=(1, 1, 1)) model_fit = model.fit() # make prediction yhat = model_fit.predict(len(data), len(data), typ='levels') print(yhat)

The Seasonal Autoregressive Integrated Moving Average (SARIMA) method models the next step in the sequence as a linear function of the differenced observations, errors, differenced seasonal observations, and seasonal errors at prior time steps.

It combines the ARIMA model with the ability to perform the same autoregression, differencing, and moving average modeling at the seasonal level.

The notation for the model involves specifying the order for the AR(p), I(d), and MA(q) models as parameters to an ARIMA function and AR(P), I(D), MA(Q) and m parameters at the seasonal level, e.g. SARIMA(p, d, q)(P, D, Q)m where “m” is the number of time steps in each season (the seasonal period). A SARIMA model can be used to develop AR, MA, ARMA and ARIMA models.

The method is suitable for univariate time series with trend and/or seasonal components.

# SARIMA example from statsmodels.tsa.statespace.sarimax import SARIMAX from random import random # contrived dataset data = [x + random() for x in range(1, 100)] # fit model model = SARIMAX(data, order=(1, 1, 1), seasonal_order=(0, 0, 0, 0)) model_fit = model.fit(disp=False) # make prediction yhat = model_fit.predict(len(data), len(data)) print(yhat)

- statsmodels.tsa.statespace.sarimax.SARIMAX API
- statsmodels.tsa.statespace.sarimax.SARIMAXResults API
- Autoregressive integrated moving average on Wikipedia

The Seasonal Autoregressive Integrated Moving-Average with Exogenous Regressors (SARIMAX) is an extension of the SARIMA model that also includes the modeling of exogenous variables.

Exogenous variables are also called covariates and can be thought of as parallel input sequences that have observations at the same time steps as the original series. The primary series may be referred to as endogenous data to contrast it from the exogenous sequence(s). The observations for exogenous variables are included in the model directly at each time step and are not modeled in the same way as the primary endogenous sequence (e.g. as an AR, MA, etc. process).

The SARIMAX method can also be used to model the subsumed models with exogenous variables, such as ARX, MAX, ARMAX, and ARIMAX.

The method is suitable for univariate time series with trend and/or seasonal components and exogenous variables.

# SARIMAX example from statsmodels.tsa.statespace.sarimax import SARIMAX from random import random # contrived dataset data1 = [x + random() for x in range(1, 100)] data2 = [x + random() for x in range(101, 200)] # fit model model = SARIMAX(data1, exog=data2, order=(1, 1, 1), seasonal_order=(0, 0, 0, 0)) model_fit = model.fit(disp=False) # make prediction exog2 = [200 + random()] yhat = model_fit.predict(len(data1), len(data1), exog=[exog2]) print(yhat)

- statsmodels.tsa.statespace.sarimax.SARIMAX API
- statsmodels.tsa.statespace.sarimax.SARIMAXResults API
- Autoregressive integrated moving average on Wikipedia

The Vector Autoregression (VAR) method models the next step in each time series using an AR model. It is the generalization of AR to multiple parallel time series, e.g. multivariate time series.

The notation for the model involves specifying the order for the AR(p) model as parameters to a VAR function, e.g. VAR(p).

The method is suitable for multivariate time series without trend and seasonal components.

# VAR example from statsmodels.tsa.vector_ar.var_model import VAR from random import random # contrived dataset with dependency data = list() for i in range(100): v1 = i + random() v2 = v1 + random() row = [v1, v2] data.append(row) # fit model model = VAR(data) model_fit = model.fit() # make prediction yhat = model_fit.forecast(model_fit.y, steps=1) print(yhat)

- statsmodels.tsa.vector_ar.var_model.VAR API
- statsmodels.tsa.vector_ar.var_model.VARResults API
- Vector autoregression on Wikipedia

The Vector Autoregression Moving-Average (VARMA) method models the next step in each time series using an ARMA model. It is the generalization of ARMA to multiple parallel time series, e.g. multivariate time series.

The notation for the model involves specifying the order for the AR(p) and MA(q) models as parameters to a VARMA function, e.g. VARMA(p, q). A VARMA model can also be used to develop VAR or VMA models.

The method is suitable for multivariate time series without trend and seasonal components.

# VARMA example from statsmodels.tsa.statespace.varmax import VARMAX from random import random # contrived dataset with dependency data = list() for i in range(100): v1 = random() v2 = v1 + random() row = [v1, v2] data.append(row) # fit model model = VARMAX(data, order=(1, 1)) model_fit = model.fit(disp=False) # make prediction yhat = model_fit.forecast() print(yhat)

- statsmodels.tsa.statespace.varmax.VARMAX API
- statsmodels.tsa.statespace.varmax.VARMAXResults
- Vector autoregression on Wikipedia

The Vector Autoregression Moving-Average with Exogenous Regressors (VARMAX) is an extension of the VARMA model that also includes the modeling of exogenous variables. It is a multivariate version of the ARMAX method.

Exogenous variables are also called covariates and can be thought of as parallel input sequences that have observations at the same time steps as the original series. The primary series(es) are referred to as endogenous data to contrast it from the exogenous sequence(s). The observations for exogenous variables are included in the model directly at each time step and are not modeled in the same way as the primary endogenous sequence (e.g. as an AR, MA, etc. process).

The VARMAX method can also be used to model the subsumed models with exogenous variables, such as VARX and VMAX.

The method is suitable for multivariate time series without trend and seasonal components with exogenous variables.

# VARMAX example from statsmodels.tsa.statespace.varmax import VARMAX from random import random # contrived dataset with dependency data = list() for i in range(100): v1 = random() v2 = v1 + random() row = [v1, v2] data.append(row) data_exog = [x + random() for x in range(100)] # fit model model = VARMAX(data, exog=data_exog, order=(1, 1)) model_fit = model.fit(disp=False) # make prediction data_exog2 = [[100]] yhat = model_fit.forecast(exog=data_exog2) print(yhat)

- statsmodels.tsa.statespace.varmax.VARMAX API
- statsmodels.tsa.statespace.varmax.VARMAXResults
- Vector autoregression on Wikipedia

The Simple Exponential Smoothing (SES) method models the next time step as an exponentially weighted linear function of observations at prior time steps.

The method is suitable for univariate time series without trend and seasonal components.

# SES example from statsmodels.tsa.holtwinters import SimpleExpSmoothing from random import random # contrived dataset data = [x + random() for x in range(1, 100)] # fit model model = SimpleExpSmoothing(data) model_fit = model.fit() # make prediction yhat = model_fit.predict(len(data), len(data)) print(yhat)

- statsmodels.tsa.holtwinters.SimpleExpSmoothing API
- statsmodels.tsa.holtwinters.HoltWintersResults API
- Exponential smoothing on Wikipedia

The Holt Winter’s Exponential Smoothing (HWES) also called the Triple Exponential Smoothing method models the next time step as an exponentially weighted linear function of observations at prior time steps, taking trends and seasonality into account.

The method is suitable for univariate time series with trend and/or seasonal components.

# HWES example from statsmodels.tsa.holtwinters import ExponentialSmoothing from random import random # contrived dataset data = [x + random() for x in range(1, 100)] # fit model model = ExponentialSmoothing(data) model_fit = model.fit() # make prediction yhat = model_fit.predict(len(data), len(data)) print(yhat)

- statsmodels.tsa.holtwinters.ExponentialSmoothing API
- statsmodels.tsa.holtwinters.HoltWintersResults API
- Exponential smoothing on Wikipedia

This section provides more resources on the topic if you are looking to go deeper.

In this post, you discovered a suite of classical time series forecasting methods that you can test and tune on your time series dataset.

Did I miss your favorite classical time series forecasting method?

Let me know in the comments below.

Did you try any of these methods on your dataset?

Let me know about your findings in the comments.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post 11 Classical Time Series Forecasting Methods in Python (Cheat Sheet) appeared first on Machine Learning Mastery.

]]>The post A Standard Multivariate, Multi-Step, and Multi-Site Time Series Forecasting Problem appeared first on Machine Learning Mastery.

]]>In this post, you will discover a standardized yet complex time series forecasting problem that has these properties, but is small and sufficiently well understood that it can be used to explore and better understand methods for developing forecasting models on challenging datasets.

After reading this post, you will know:

- The competition and motivation for addressing the air-quality dataset.
- An overview of the defined prediction problem and the data challenges it covers.
- A description of the free data files that you can download and start working with immediately.

**Kick-start your project** with my new book Time Series Forecasting With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

The dataset was used as the center of a Kaggle competition.

Specifically, a 24-hour hackathon hosted by Data Science London and Data Science Global as part of a Big Data Week event, two organizations that don’t seem to exist now, 6 years later.

The competition involved a multi-thousand-dollar cash prize, and the dataset was provided by the Cook County, Illinois local government, suggesting all locations mentioned in the dataset are in that locality.

The motivation for the challenge is to develop a better model for predicting air quality, taken from the competition description:

The EPA’s Air Quality Index is used daily by people suffering from asthma and other respiratory diseases to avoid dangerous levels of outdoor air pollutants, which can trigger attacks. According to the World Health Organisation there are now estimated to be 235 million people suffering from asthma. Globally, it is now the most common chronic disease among children, with incidence in the US doubling since 1980.

The competition description suggests that winning models could be used as the basis for a new air-quality prediction system, although it is not clear if any models were ever transitioned for this purpose.

The competition was won by a Kaggle employee, Ben Hamner, who presumably did not collect the prize given the conflict of interest. Ben described his winning approach in the blog post titled “Chucking everything into a Random Forest: Ben Hamner on Winning The Air Quality Prediction Hackathon” and provided his code on GitHub.

There is also a good discussion of solutions and related code in this forum post titled “General approaches to partitioning the models?“.

The data describes a multi-step forecasting problem given a multivariate time series across multiple sites or physical locations.

Given multiple weather measurements over time, predict a sequence of air quality measurements at specific future time intervals across multiple physical locations.

It is a challenging time series forecasting problem that has a lot of the qualities of real-world forecasting:

**Incomplete data**. Not all weather and air quality measures are available for all locations.**Missing data**. Not all available measures have a complete history.**Multivariate inputs**: The model inputs for each forecast are comprised of multiple weather observations.**Multi-step outputs**: The model outputs are a discontiguous sequence of forecasted air quality measures.**Multi-site outputs**: The mode must output a multi-step forecast for multiple physical sites.

The dataset is available for free from the Kaggle website.

You must create an account and sign-in with Kaggle before you can get access to download the dataset.

The dataset can be downloaded from here:

There are 4 files of interest that you must download separately; they are:

This file contains a list of site locations marked by unique identifiers and their precise location on Earth measured by longitude and latitude.

All coordinates appear to be relatively close in the North-Western Hemisphere, e.g. America.

Below is a sample of the file.

"SITE_ID","LATITUDE","LONGITUDE" 1,41.6709918952829,-87.7324568962847 32,41.755832412403,-87.545349670582 50,41.7075695897648,-87.5685738570845 57,41.9128621248178,-87.7227234452095 64,41.7907868783739,-87.6016464917605 ...

This file has the same format as *SiteLocations.csv* and appears to list all of the same locations as that file with some additional locations.

As the filename suggests, it is just an updated version of the list of sites.

Below is a sample of the file.

"SITE_ID","LATITUDE","LONGITUDE" 1,41.6709918952829,-87.7324568962847 14,41.834243,-87.6238 22,41.6871654376343,-87.5393154841479 32,41.755832412403,-87.545349670582 50,41.7075695897648,-87.5685738570845 ...

This file contains the training data for modeling.

The data is presented in an unnormalized manner. Each row of data contains one set of meteorological measurements for one hour across multiple locations as well as the targets or outcomes for each location for that hour.

The measures include:

- Time information, including the block of time, the index within the contiguous block of time, the average month, day of the week, and hour of the day.
- Wind measurements such as direction and speed.
- Temperature measurements such as minimum and maximum ambient temperature.
- Pressure measurements such as minimum and maximum barometric pressure.

The target variables are a collection of different air quality or pollution measures at different physical locations.

Not all locations have all weather measurements and not all locations are concerned with all target measures. Further, for those recorded variables, there are missing values marked as NA.

Below is a sample of the file.

"rowID","chunkID","position_within_chunk","month_most_common","weekday","hour","Solar.radiation_64","WindDirection..Resultant_1","WindDirection..Resultant_1018","WindSpeed..Resultant_1","WindSpeed..Resultant_1018","Ambient.Max.Temperature_14","Ambient.Max.Temperature_22","Ambient.Max.Temperature_50","Ambient.Max.Temperature_52","Ambient.Max.Temperature_57","Ambient.Max.Temperature_76","Ambient.Max.Temperature_2001","Ambient.Max.Temperature_3301","Ambient.Max.Temperature_6005","Ambient.Min.Temperature_14","Ambient.Min.Temperature_22","Ambient.Min.Temperature_50","Ambient.Min.Temperature_52","Ambient.Min.Temperature_57","Ambient.Min.Temperature_76","Ambient.Min.Temperature_2001","Ambient.Min.Temperature_3301","Ambient.Min.Temperature_6005","Sample.Baro.Pressure_14","Sample.Baro.Pressure_22","Sample.Baro.Pressure_50","Sample.Baro.Pressure_52","Sample.Baro.Pressure_57","Sample.Baro.Pressure_76","Sample.Baro.Pressure_2001","Sample.Baro.Pressure_3301","Sample.Baro.Pressure_6005","Sample.Max.Baro.Pressure_14","Sample.Max.Baro.Pressure_22","Sample.Max.Baro.Pressure_50","Sample.Max.Baro.Pressure_52","Sample.Max.Baro.Pressure_57","Sample.Max.Baro.Pressure_76","Sample.Max.Baro.Pressure_2001","Sample.Max.Baro.Pressure_3301","Sample.Max.Baro.Pressure_6005","Sample.Min.Baro.Pressure_14","Sample.Min.Baro.Pressure_22","Sample.Min.Baro.Pressure_50","Sample.Min.Baro.Pressure_52","Sample.Min.Baro.Pressure_57","Sample.Min.Baro.Pressure_76","Sample.Min.Baro.Pressure_2001","Sample.Min.Baro.Pressure_3301","Sample.Min.Baro.Pressure_6005","target_1_57","target_10_4002","target_10_8003","target_11_1","target_11_32","target_11_50","target_11_64","target_11_1003","target_11_1601","target_11_4002","target_11_8003","target_14_4002","target_14_8003","target_15_57","target_2_57","target_3_1","target_3_50","target_3_57","target_3_1601","target_3_4002","target_3_6006","target_4_1","target_4_50","target_4_57","target_4_1018","target_4_1601","target_4_2001","target_4_4002","target_4_4101","target_4_6006","target_4_8003","target_5_6006","target_7_57","target_8_57","target_8_4002","target_8_6004","target_8_8003","target_9_4002","target_9_8003" 1,1,1,10,"Saturday",21,0.01,117,187,0.3,0.3,NA,NA,NA,14.9,NA,NA,NA,NA,NA,NA,NA,NA,5.8,NA,NA,NA,NA,NA,NA,NA,NA,747,NA,NA,NA,NA,NA,NA,NA,NA,750,NA,NA,NA,NA,NA,NA,NA,NA,743,NA,NA,NA,NA,NA,2.67923294292042,6.1816228132982,NA,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,NA,2.38965627997991,NA,5.56815355612325,0.690015329704154,NA,NA,NA,NA,NA,NA,2.84349016287551,0.0920223353681394,1.69321097077376,0.368089341472558,0.184044670736279,0.368089341472558,0.276067006104418,0.892616653070952,1.74842437199465,NA,NA,5.1306307034019,1.34160578423204,2.13879182993514,3.01375212399952,NA,5.67928016629218,NA 2,1,2,10,"Saturday",22,0.01,231,202,0.5,0.6,NA,NA,NA,14.9,NA,NA,NA,NA,NA,NA,NA,NA,5.8,NA,NA,NA,NA,NA,NA,NA,NA,747,NA,NA,NA,NA,NA,NA,NA,NA,750,NA,NA,NA,NA,NA,NA,NA,NA,743,NA,NA,NA,NA,NA,2.67923294292042,8.47583334194495,NA,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,NA,1.99138023331659,NA,5.56815355612325,0.923259948195698,NA,NA,NA,NA,NA,NA,3.1011527019063,0.0920223353681394,1.94167127626774,0.368089341472558,0.184044670736279,0.368089341472558,0.368089341472558,1.73922213845783,2.14412041407765,NA,NA,5.1306307034019,1.19577906855465,2.72209869264472,3.88871241806389,NA,7.42675098668978,NA 3,1,3,10,"Saturday",23,0.01,247,227,0.5,1.5,NA,NA,NA,14.9,NA,NA,NA,NA,NA,NA,NA,NA,5.8,NA,NA,NA,NA,NA,NA,NA,NA,747,NA,NA,NA,NA,NA,NA,NA,NA,750,NA,NA,NA,NA,NA,NA,NA,NA,743,NA,NA,NA,NA,NA,2.67923294292042,8.92192983362627,NA,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,NA,1.7524146053186,NA,5.56815355612325,0.680296803933673,NA,NA,NA,NA,NA,NA,3.06434376775904,0.0920223353681394,2.52141198908702,0.460111676840697,0.184044670736279,0.368089341472558,0.368089341472558,1.7852333061419,1.93246904273093,NA,NA,5.13639545700122,1.40965825154816,3.11096993445111,3.88871241806389,NA,7.68373198968942,NA 4,1,4,10,"Sunday",0,0.01,219,218,0.2,1.2,NA,NA,NA,14,NA,NA,NA,NA,NA,NA,NA,NA,4.8,NA,NA,NA,NA,NA,NA,NA,NA,751,NA,NA,NA,NA,NA,NA,NA,NA,754,NA,NA,NA,NA,NA,NA,NA,NA,748,NA,NA,NA,NA,NA,2.67923294292042,5.09824561921501,NA,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,NA,2.38965627997991,NA,5.6776192223642,0.612267123540305,NA,NA,NA,NA,NA,NA,3.21157950434806,0.184044670736279,2.374176252498,0.460111676840697,0.184044670736279,0.368089341472558,0.276067006104418,1.86805340797323,2.08890701285676,NA,NA,5.21710200739181,1.47771071886428,2.04157401948354,3.20818774490271,NA,4.83124285639335,NA 5,1,5,10,"Sunday",1,0.01,2,216,0.2,0.3,NA,NA,NA,14,NA,NA,NA,NA,NA,NA,NA,NA,4.8,NA,NA,NA,NA,NA,NA,NA,NA,751,NA,NA,NA,NA,NA,NA,NA,NA,754,NA,NA,NA,NA,NA,NA,NA,NA,748,NA,NA,NA,NA,NA,2.67923294292042,4.87519737337435,NA,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,NA,2.31000107064725,NA,5.6776192223642,0.694874592589394,NA,NA,NA,NA,NA,NA,3.67169118118876,0.184044670736279,2.46619858786614,0.460111676840697,0.184044670736279,0.368089341472558,0.276067006104418,1.70241320431058,2.60423209091834,NA,NA,5.21710200739181,1.45826715677396,2.13879182993514,3.4998411762575,NA,4.62565805399363,NA ...

This file contains a sample of the submission for the prediction problem.

Each row specifies the prediction for each target measure across all target locations for a given hour in a chunk of contiguous time.

Below is a sample of the file.

"rowID","chunkID","position_within_chunk","hour","month_most_common","target_1_57","target_10_4002","target_10_8003","target_11_1","target_11_32","target_11_50","target_11_64","target_11_1003","target_11_1601","target_11_4002","target_11_8003","target_14_4002","target_14_8003","target_15_57","target_2_57","target_3_1","target_3_50","target_3_57","target_3_1601","target_3_4002","target_3_6006","target_4_1","target_4_50","target_4_57","target_4_1018","target_4_1601","target_4_2001","target_4_4002","target_4_4101","target_4_6006","target_4_8003","target_5_6006","target_7_57","target_8_57","target_8_4002","target_8_6004","target_8_8003","target_9_4002","target_9_8003" 193,1,193,21,10,0,0,-1e+06,0,0,0,0,0,0,0,-1e+06,0,-1e+06,0,0,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,0,0,0,0,0,0,0,0,0,-1e+06,-1e+06,0,0,0,0,-1e+06,0,-1e+06 194,1,194,22,10,0,0,-1e+06,0,0,0,0,0,0,0,-1e+06,0,-1e+06,0,0,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,0,0,0,0,0,0,0,0,0,-1e+06,-1e+06,0,0,0,0,-1e+06,0,-1e+06 195,1,195,23,10,0,0,-1e+06,0,0,0,0,0,0,0,-1e+06,0,-1e+06,0,0,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,0,0,0,0,0,0,0,0,0,-1e+06,-1e+06,0,0,0,0,-1e+06,0,-1e+06 196,1,196,0,10,0,0,-1e+06,0,0,0,0,0,0,0,-1e+06,0,-1e+06,0,0,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,0,0,0,0,0,0,0,0,0,-1e+06,-1e+06,0,0,0,0,-1e+06,0,-1e+06 197,1,197,1,10,0,0,-1e+06,0,0,0,0,0,0,0,-1e+06,0,-1e+06,0,0,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,0,0,0,0,0,0,0,0,0,-1e+06,-1e+06,0,0,0,0,-1e+06,0,-1e+06 ...

A large part of the challenge of this prediction problem is the vast number of ways that the problem can be framed for modeling.

This is challenging because it is not clear which framing may be the best for this specific modeling problem.

For example, below are some questions to provoke thought about how the problem could be framed.

- Is it better to impute or ignore missing observations?
- Is it better to feed in a time series of weather observations or only the observations for the current hour?
- Is it better to use weather observations from one or multiple source locations for a forecast?
- Is it better to have one model for each location or one mode for all locations?
- Is it better to have one model for each forecast time or one for all forecast times?

This section provides more resources on the topic if you are looking to go deeper.

- EMC Data Science Global Hackathon (Air Quality Prediction)
- Download Dataset
- Chucking everything into a Random Forest: Ben Hamner on Winning The Air Quality Prediction Hackathon
- Winning Code for the EMC Data Science Global Hackathon (Air Quality Prediction)
- General approaches to partitioning the models?

In this post, you discovered the Kaggle air-quality dataset that provides a standard dataset for complex time series forecasting.

Specifically, you learned:

- The competition and motivation for addressing the air-quality dataset.
- An overview of the defined prediction problem and the data challenges it covers.
- A description of the free data files that can download and start working with immediately.

Have you worked on this dataset, or do you intend to?

Share your experiences in the comments below.

The post A Standard Multivariate, Multi-Step, and Multi-Site Time Series Forecasting Problem appeared first on Machine Learning Mastery.

]]>The post How to Convert a Time Series to a Supervised Learning Problem in Python appeared first on Machine Learning Mastery.

]]>Before machine learning can be used, time series forecasting problems must be re-framed as supervised learning problems. From a sequence to pairs of input and output sequences.

In this tutorial, you will discover how to transform univariate and multivariate time series forecasting problems into supervised learning problems for use with machine learning algorithms.

After completing this tutorial, you will know:

- How to develop a function to transform a time series dataset into a supervised learning dataset.
- How to transform univariate time series data for machine learning.
- How to transform multivariate time series data for machine learning.

**Kick-start your project** with my new book Time Series Forecasting With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

Before we get started, let’s take a moment to better understand the form of time series and supervised learning data.

A time series is a sequence of numbers that are ordered by a time index. This can be thought of as a list or column of ordered values.

For example:

0 1 2 3 4 5 6 7 8 9

A supervised learning problem is comprised of input patterns (*X*) and output patterns (*y*), such that an algorithm can learn how to predict the output patterns from the input patterns.

For example:

X, y 1 2 2, 3 3, 4 4, 5 5, 6 6, 7 7, 8 8, 9

For more on this topic, see the post:

A key function to help transform time series data into a supervised learning problem is the Pandas shift() function.

Given a DataFrame, the *shift()* function can be used to create copies of columns that are pushed forward (rows of NaN values added to the front) or pulled back (rows of NaN values added to the end).

This is the behavior required to create columns of lag observations as well as columns of forecast observations for a time series dataset in a supervised learning format.

Let’s look at some examples of the shift function in action.

We can define a mock time series dataset as a sequence of 10 numbers, in this case a single column in a DataFrame as follows:

from pandas import DataFrame df = DataFrame() df['t'] = [x for x in range(10)] print(df)

Running the example prints the time series data with the row indices for each observation.

t 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9

We can shift all the observations down by one time step by inserting one new row at the top. Because the new row has no data, we can use NaN to represent “no data”.

The shift function can do this for us and we can insert this shifted column next to our original series.

from pandas import DataFrame df = DataFrame() df['t'] = [x for x in range(10)] df['t-1'] = df['t'].shift(1) print(df)

Running the example gives us two columns in the dataset. The first with the original observations and a new shifted column.

We can see that shifting the series forward one time step gives us a primitive supervised learning problem, although with *X* and *y* in the wrong order. Ignore the column of row labels. The first row would have to be discarded because of the NaN value. The second row shows the input value of 0.0 in the second column (input or *X*) and the value of 1 in the first column (output or *y*).

t t-1 0 0 NaN 1 1 0.0 2 2 1.0 3 3 2.0 4 4 3.0 5 5 4.0 6 6 5.0 7 7 6.0 8 8 7.0 9 9 8.0

We can see that if we can repeat this process with shifts of 2, 3, and more, how we could create long input sequences (*X*) that can be used to forecast an output value (*y*).

The shift operator can also accept a negative integer value. This has the effect of pulling the observations up by inserting new rows at the end. Below is an example:

from pandas import DataFrame df = DataFrame() df['t'] = [x for x in range(10)] df['t+1'] = df['t'].shift(-1) print(df)

Running the example shows a new column with a NaN value as the last value.

We can see that the forecast column can be taken as an input (*X*) and the second as an output value (*y*). That is the input value of 0 can be used to forecast the output value of 1.

t t+1 0 0 1.0 1 1 2.0 2 2 3.0 3 3 4.0 4 4 5.0 5 5 6.0 6 6 7.0 7 7 8.0 8 8 9.0 9 9 NaN

Technically, in time series forecasting terminology the current time (*t*) and future times (*t+1*, *t+n*) are forecast times and past observations (*t-1*, *t-n*) are used to make forecasts.

We can see how positive and negative shifts can be used to create a new DataFrame from a time series with sequences of input and output patterns for a supervised learning problem.

This permits not only classical *X -> y* prediction, but also *X -> Y* where both input and output can be sequences.

Further, the shift function also works on so-called multivariate time series problems. That is where instead of having one set of observations for a time series, we have multiple (e.g. temperature and pressure). All variates in the time series can be shifted forward or backward to create multivariate input and output sequences. We will explore this more later in the tutorial.

We can use the *shift()* function in Pandas to automatically create new framings of time series problems given the desired length of input and output sequences.

This would be a useful tool as it would allow us to explore different framings of a time series problem with machine learning algorithms to see which might result in better performing models.

In this section, we will define a new Python function named *series_to_supervised()* that takes a univariate or multivariate time series and frames it as a supervised learning dataset.

The function takes four arguments:

**data**: Sequence of observations as a list or 2D NumPy array. Required.**n_in**: Number of lag observations as input (*X*). Values may be between [1..len(data)] Optional. Defaults to 1.**n_out**: Number of observations as output (*y*). Values may be between [0..len(data)-1]. Optional. Defaults to 1.**dropnan**: Boolean whether or not to drop rows with NaN values. Optional. Defaults to True.

The function returns a single value:

**return**: Pandas DataFrame of series framed for supervised learning.

The new dataset is constructed as a DataFrame, with each column suitably named both by variable number and time step. This allows you to design a variety of different time step sequence type forecasting problems from a given univariate or multivariate time series.

Once the DataFrame is returned, you can decide how to split the rows of the returned DataFrame into *X* and *y* components for supervised learning any way you wish.

The function is defined with default parameters so that if you call it with just your data, it will construct a DataFrame with *t-1* as *X* and *t* as *y*.

The function is confirmed to be compatible with Python 2 and Python 3.

The complete function is listed below, including function comments.

from pandas import DataFrame from pandas import concat def series_to_supervised(data, n_in=1, n_out=1, dropnan=True): """ Frame a time series as a supervised learning dataset. Arguments: data: Sequence of observations as a list or NumPy array. n_in: Number of lag observations as input (X). n_out: Number of observations as output (y). dropnan: Boolean whether or not to drop rows with NaN values. Returns: Pandas DataFrame of series framed for supervised learning. """ n_vars = 1 if type(data) is list else data.shape[1] df = DataFrame(data) cols, names = list(), list() # input sequence (t-n, ... t-1) for i in range(n_in, 0, -1): cols.append(df.shift(i)) names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)] # forecast sequence (t, t+1, ... t+n) for i in range(0, n_out): cols.append(df.shift(-i)) if i == 0: names += [('var%d(t)' % (j+1)) for j in range(n_vars)] else: names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)] # put it all together agg = concat(cols, axis=1) agg.columns = names # drop rows with NaN values if dropnan: agg.dropna(inplace=True) return agg

Can you see obvious ways to make the function more robust or more readable?

Please let me know in the comments below.

Now that we have the whole function, we can explore how it may be used.

It is standard practice in time series forecasting to use lagged observations (e.g. t-1) as input variables to forecast the current time step (t).

This is called one-step forecasting.

The example below demonstrates a one lag time step (t-1) to predict the current time step (t).

from pandas import DataFrame from pandas import concat def series_to_supervised(data, n_in=1, n_out=1, dropnan=True): """ Frame a time series as a supervised learning dataset. Arguments: data: Sequence of observations as a list or NumPy array. n_in: Number of lag observations as input (X). n_out: Number of observations as output (y). dropnan: Boolean whether or not to drop rows with NaN values. Returns: Pandas DataFrame of series framed for supervised learning. """ n_vars = 1 if type(data) is list else data.shape[1] df = DataFrame(data) cols, names = list(), list() # input sequence (t-n, ... t-1) for i in range(n_in, 0, -1): cols.append(df.shift(i)) names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)] # forecast sequence (t, t+1, ... t+n) for i in range(0, n_out): cols.append(df.shift(-i)) if i == 0: names += [('var%d(t)' % (j+1)) for j in range(n_vars)] else: names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)] # put it all together agg = concat(cols, axis=1) agg.columns = names # drop rows with NaN values if dropnan: agg.dropna(inplace=True) return agg values = [x for x in range(10)] data = series_to_supervised(values) print(data)

Running the example prints the output of the reframed time series.

var1(t-1) var1(t) 1 0.0 1 2 1.0 2 3 2.0 3 4 3.0 4 5 4.0 5 6 5.0 6 7 6.0 7 8 7.0 8 9 8.0 9

We can see that the observations are named “*var1*” and that the input observation is suitably named (t-1) and the output time step is named (t).

We can also see that rows with NaN values have been automatically removed from the DataFrame.

We can repeat this example with an arbitrary number length input sequence, such as 3. This can be done by specifying the length of the input sequence as an argument; for example:

data = series_to_supervised(values, 3)

The complete example is listed below.

from pandas import DataFrame from pandas import concat def series_to_supervised(data, n_in=1, n_out=1, dropnan=True): """ Frame a time series as a supervised learning dataset. Arguments: data: Sequence of observations as a list or NumPy array. n_in: Number of lag observations as input (X). n_out: Number of observations as output (y). dropnan: Boolean whether or not to drop rows with NaN values. Returns: Pandas DataFrame of series framed for supervised learning. """ n_vars = 1 if type(data) is list else data.shape[1] df = DataFrame(data) cols, names = list(), list() # input sequence (t-n, ... t-1) for i in range(n_in, 0, -1): cols.append(df.shift(i)) names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)] # forecast sequence (t, t+1, ... t+n) for i in range(0, n_out): cols.append(df.shift(-i)) if i == 0: names += [('var%d(t)' % (j+1)) for j in range(n_vars)] else: names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)] # put it all together agg = concat(cols, axis=1) agg.columns = names # drop rows with NaN values if dropnan: agg.dropna(inplace=True) return agg values = [x for x in range(10)] data = series_to_supervised(values, 3) print(data)

Again, running the example prints the reframed series. We can see that the input sequence is in the correct left-to-right order with the output variable to be predicted on the far right.

var1(t-3) var1(t-2) var1(t-1) var1(t) 3 0.0 1.0 2.0 3 4 1.0 2.0 3.0 4 5 2.0 3.0 4.0 5 6 3.0 4.0 5.0 6 7 4.0 5.0 6.0 7 8 5.0 6.0 7.0 8 9 6.0 7.0 8.0 9

A different type of forecasting problem is using past observations to forecast a sequence of future observations.

This may be called sequence forecasting or multi-step forecasting.

We can frame a time series for sequence forecasting by specifying another argument. For example, we could frame a forecast problem with an input sequence of 2 past observations to forecast 2 future observations as follows:

data = series_to_supervised(values, 2, 2)

The complete example is listed below:

from pandas import DataFrame from pandas import concat def series_to_supervised(data, n_in=1, n_out=1, dropnan=True): """ Frame a time series as a supervised learning dataset. Arguments: data: Sequence of observations as a list or NumPy array. n_in: Number of lag observations as input (X). n_out: Number of observations as output (y). dropnan: Boolean whether or not to drop rows with NaN values. Returns: Pandas DataFrame of series framed for supervised learning. """ n_vars = 1 if type(data) is list else data.shape[1] df = DataFrame(data) cols, names = list(), list() # input sequence (t-n, ... t-1) for i in range(n_in, 0, -1): cols.append(df.shift(i)) names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)] # forecast sequence (t, t+1, ... t+n) for i in range(0, n_out): cols.append(df.shift(-i)) if i == 0: names += [('var%d(t)' % (j+1)) for j in range(n_vars)] else: names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)] # put it all together agg = concat(cols, axis=1) agg.columns = names # drop rows with NaN values if dropnan: agg.dropna(inplace=True) return agg values = [x for x in range(10)] data = series_to_supervised(values, 2, 2) print(data)

Running the example shows the differentiation of input (t-n) and output (t+n) variables with the current observation (t) considered an output.

var1(t-2) var1(t-1) var1(t) var1(t+1) 2 0.0 1.0 2 3.0 3 1.0 2.0 3 4.0 4 2.0 3.0 4 5.0 5 3.0 4.0 5 6.0 6 4.0 5.0 6 7.0 7 5.0 6.0 7 8.0 8 6.0 7.0 8 9.0

Another important type of time series is called multivariate time series.

This is where we may have observations of multiple different measures and an interest in forecasting one or more of them.

For example, we may have two sets of time series observations obs1 and obs2 and we wish to forecast one or both of these.

We can call *series_to_supervised()* in exactly the same way.

For example:

from pandas import DataFrame from pandas import concat def series_to_supervised(data, n_in=1, n_out=1, dropnan=True): """ Frame a time series as a supervised learning dataset. Arguments: data: Sequence of observations as a list or NumPy array. n_in: Number of lag observations as input (X). n_out: Number of observations as output (y). dropnan: Boolean whether or not to drop rows with NaN values. Returns: Pandas DataFrame of series framed for supervised learning. """ n_vars = 1 if type(data) is list else data.shape[1] df = DataFrame(data) cols, names = list(), list() # input sequence (t-n, ... t-1) for i in range(n_in, 0, -1): cols.append(df.shift(i)) names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)] # forecast sequence (t, t+1, ... t+n) for i in range(0, n_out): cols.append(df.shift(-i)) if i == 0: names += [('var%d(t)' % (j+1)) for j in range(n_vars)] else: names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)] # put it all together agg = concat(cols, axis=1) agg.columns = names # drop rows with NaN values if dropnan: agg.dropna(inplace=True) return agg raw = DataFrame() raw['ob1'] = [x for x in range(10)] raw['ob2'] = [x for x in range(50, 60)] values = raw.values data = series_to_supervised(values) print(data)

Running the example prints the new framing of the data, showing an input pattern with one time step for both variables and an output pattern of one time step for both variables.

Again, depending on the specifics of the problem, the division of columns into *X* and *Y* components can be chosen arbitrarily, such as if the current observation of *var1* was also provided as input and only *var2* was to be predicted.

var1(t-1) var2(t-1) var1(t) var2(t) 1 0.0 50.0 1 51 2 1.0 51.0 2 52 3 2.0 52.0 3 53 4 3.0 53.0 4 54 5 4.0 54.0 5 55 6 5.0 55.0 6 56 7 6.0 56.0 7 57 8 7.0 57.0 8 58 9 8.0 58.0 9 59

You can see how this may be easily used for sequence forecasting with multivariate time series by specifying the length of the input and output sequences as above.

For example, below is an example of a reframing with 1 time step as input and 2 time steps as forecast sequence.

from pandas import DataFrame from pandas import concat def series_to_supervised(data, n_in=1, n_out=1, dropnan=True): """ Frame a time series as a supervised learning dataset. Arguments: data: Sequence of observations as a list or NumPy array. n_in: Number of lag observations as input (X). n_out: Number of observations as output (y). dropnan: Boolean whether or not to drop rows with NaN values. Returns: Pandas DataFrame of series framed for supervised learning. """ n_vars = 1 if type(data) is list else data.shape[1] df = DataFrame(data) cols, names = list(), list() # input sequence (t-n, ... t-1) for i in range(n_in, 0, -1): cols.append(df.shift(i)) names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)] # forecast sequence (t, t+1, ... t+n) for i in range(0, n_out): cols.append(df.shift(-i)) if i == 0: names += [('var%d(t)' % (j+1)) for j in range(n_vars)] else: names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)] # put it all together agg = concat(cols, axis=1) agg.columns = names # drop rows with NaN values if dropnan: agg.dropna(inplace=True) return agg raw = DataFrame() raw['ob1'] = [x for x in range(10)] raw['ob2'] = [x for x in range(50, 60)] values = raw.values data = series_to_supervised(values, 1, 2) print(data)

Running the example shows the large reframed DataFrame.

var1(t-1) var2(t-1) var1(t) var2(t) var1(t+1) var2(t+1) 1 0.0 50.0 1 51 2.0 52.0 2 1.0 51.0 2 52 3.0 53.0 3 2.0 52.0 3 53 4.0 54.0 4 3.0 53.0 4 54 5.0 55.0 5 4.0 54.0 5 55 6.0 56.0 6 5.0 55.0 6 56 7.0 57.0 7 6.0 56.0 7 57 8.0 58.0 8 7.0 57.0 8 58 9.0 59.0

Experiment with your own dataset and try multiple different framings to see what works best.

In this tutorial, you discovered how to reframe time series datasets as supervised learning problems with Python.

Specifically, you learned:

- About the Pandas
*shift()*function and how it can be used to automatically define supervised learning datasets from time series data. - How to reframe a univariate time series into one-step and multi-step supervised learning problems.
- How to reframe multivariate time series into one-step and multi-step supervised learning problems.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Convert a Time Series to a Supervised Learning Problem in Python appeared first on Machine Learning Mastery.

]]>The post Seasonal Persistence Forecasting With Python appeared first on Machine Learning Mastery.

]]>A better first-cut forecast on time series data with a seasonal component is to persist the observation for the same time in the previous season. This is called seasonal persistence.

In this tutorial, you will discover how to implement seasonal persistence for time series forecasting in Python.

After completing this tutorial, you will know:

- How to use point observations from prior seasons for a persistence forecast.
- How to use mean observations across a sliding window of prior seasons for a persistence forecast.
- How to apply and evaluate seasonal persistence on monthly and daily time series data.

**Kick-start your project** with my new book Time Series Forecasting With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Updated Apr/2019**: Updated the links to datasets.**Updated Aug/2019**: Updated data loading to use new API.

It is critical to have a useful first-cut forecast on time series problems to provide a lower-bound on skill before moving on to more sophisticated methods.

This is to ensure we are not wasting time on models or datasets that are not predictive.

It is common to use a persistence or a naive forecast as a first-cut forecast model when time series forecasting.

This does not make sense with time series data that has an obvious seasonal component. A better first cut model for seasonal data is to use the observation at the same time in the previous seasonal cycle as the prediction.

We can call this “*seasonal persistence*” and it is a simple model that can result in an effective first cut model.

One step better is to use a simple function of the last few observations at the same time in previous seasonal cycles. For example, the mean of the observations. This can often provide a small additional benefit.

In this tutorial, we will demonstrate this simple seasonal persistence forecasting method for providing a lower bound on forecast skill on three different real-world time series datasets.

In this tutorial, we will use a sliding window seasonal persistence model to make forecasts.

Within a sliding window, observations at the same time in previous one-year seasons will be collected and the mean of those observations can be used as the persisted forecast.

Different window sizes can be evaluated to find a combination that minimizes error.

As an example, if the data is monthly and the month to be predicted is February, then with a window of size 1 (*w=1*) the observation last February will be used to make the forecast.

A window of size 2 (*w=2*) would involve taking observations for the last two Februaries to be averaged and used as a forecast.

An alternate interpretation might seek to use point observations from prior years (e.g. t-12, t-24, etc. for monthly data) rather than taking the mean of the cumulative point observations. Perhaps try both methods on your dataset and see what works best as a good starting point model.

Take my free 7-day email course and discover how to get started (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

It is important to evaluate time series forecasting models consistently.

In this section, we will define how we will evaluate forecast models in this tutorial.

First, we will hold the last two years of data back and evaluate forecasts on this data. This works for both monthly and daily data we will look at.

We will use a walk-forward validation to evaluate model performance. This means that each time step in the test dataset will be enumerated, a model constructed on historical data, and the forecast compared to the expected value. The observation will then be added to the training dataset and the process repeated.

Walk-forward validation is a realistic way to evaluate time series forecast models as one would expect models to be updated as new observations are made available.

Finally, forecasts will be evaluated using root mean squared error, or RMSE. The benefit of RMSE is that it penalizes large errors and the scores are in the same units as the forecast values (car sales per month).

In summary, the test harness involves:

- The last 2 years of data used as a test set.
- Walk-forward validation for model evaluation.
- Root mean squared error used to report model skill.

The Monthly Car Sales dataset describes the number of car sales in Quebec, Canada between 1960 and 1968.

The units are a count of the number of sales and there are 108 observations. The source of the data is credited to Abraham and Ledolter (1983).

Download the dataset and save it into your current working directory with the filename “*car-sales.csv*“. Note, you may need to delete the footer information from the file.

The code below loads the dataset as a Pandas Series object.

# line plot of time series from pandas import read_csv from matplotlib import pyplot # load dataset series = read_csv('car-sales.csv', header=0, index_col=0) # display first few rows print(series.head(5)) # line plot of dataset series.plot() pyplot.show()

Running the example prints the first 5 rows of data.

Month 1960-01-01 6550 1960-02-01 8728 1960-03-01 12026 1960-04-01 14395 1960-05-01 14587 Name: Sales, dtype: int64

A line plot of the data is also provided. We can see both a yearly seasonal component and an increasing trend.

The prior 24 months of data will be held back as test data. We will investigate seasonal persistence with a sliding window from 1 to 5 years.

The complete example is listed below.

from pandas import read_csv from sklearn.metrics import mean_squared_error from math import sqrt from numpy import mean from matplotlib import pyplot # load data series = read_csv('car-sales.csv', header=0, index_col=0) # prepare data X = series.values train, test = X[0:-24], X[-24:] # evaluate mean of different number of years years = [1, 2, 3, 4, 5] scores = list() for year in years: # walk-forward validation history = [x for x in train] predictions = list() for i in range(len(test)): # collect obs obs = list() for y in range(1, year+1): obs.append(history[-(y*12)]) # make prediction yhat = mean(obs) predictions.append(yhat) # observation history.append(test[i]) # report performance rmse = sqrt(mean_squared_error(test, predictions)) scores.append(rmse) print('Years=%d, RMSE: %.3f' % (year, rmse)) pyplot.plot(years, scores) pyplot.show()

Running the example prints the year number and the RMSE for the mean observation from the sliding window of observations at the same month in prior years.

The results suggest that taking the average from the last three years is a good starting model with an RMSE of 1803.630 car sales.

Years=1, RMSE: 1997.732 Years=2, RMSE: 1914.911 Years=3, RMSE: 1803.630 Years=4, RMSE: 2099.481 Years=5, RMSE: 2522.235

A plot of the relationship of sliding window size to model error is created.

The plot nicely shows the improvement with the sliding window size to 3 years, then the rapid increase in error from that point.

The Monthly Writing Paper Sales dataset describes the number of specialty writing paper sales.

The units are a type of count of the number of sales and there are 147 months of observations (just over 12 years). The counts are fractional, suggesting the data may in fact be in the units of hundreds of thousands of sales. The source of the data is credited to Makridakis and Wheelwright (1989).

Download the dataset and save it into your current working directory with the filename “*writing-paper-sales.csv*“. Note, you may need to delete the footer information from the file.

The date-time stamps only contain the year number and month. Therefore, a custom date-time parsing function is required to load the data and base the data in an arbitrary year. The year 1900 was chosen as the starting point, which should not affect this case study.

The example below loads the Monthly Writing Paper Sales dataset as a Pandas Series.

# load and plot dataset from pandas import read_csv from pandas import datetime from matplotlib import pyplot # load dataset def parser(x): if len(x) == 4: return datetime.strptime('190'+x, '%Y-%m') return datetime.strptime('19'+x, '%Y-%m') series = read_csv('writing-paper-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser) # summarize first few rows print(series.head()) # line plot series.plot() pyplot.show()

Running the example prints the first 5 rows of the loaded dataset.

Month 1901-01-01 1359.795 1901-02-01 1278.564 1901-03-01 1508.327 1901-04-01 1419.710 1901-05-01 1440.510

A line plot of the loaded dataset is then created. We can see the yearly seasonal component and an increasing trend.

As in the previous example, we can hold back the last 24 months of observations as a test dataset. Because we have much more data, we will try sliding window sizes from 1 year to 10 years.

The complete example is listed below.

from pandas import read_csv from pandas import datetime from sklearn.metrics import mean_squared_error from math import sqrt from numpy import mean from matplotlib import pyplot # load dataset def parser(x): if len(x) == 4: return datetime.strptime('190'+x, '%Y-%m') return datetime.strptime('19'+x, '%Y-%m') series = read_csv('writing-paper-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser) # prepare data X = series.values train, test = X[0:-24], X[-24:] # evaluate mean of different number of years years = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] scores = list() for year in years: # walk-forward validation history = [x for x in train] predictions = list() for i in range(len(test)): # collect obs obs = list() for y in range(1, year+1): obs.append(history[-(y*12)]) # make prediction yhat = mean(obs) predictions.append(yhat) # observation history.append(test[i]) # report performance rmse = sqrt(mean_squared_error(test, predictions)) scores.append(rmse) print('Years=%d, RMSE: %.3f' % (year, rmse)) pyplot.plot(years, scores) pyplot.show()

Running the example prints the size of the sliding window and the resulting seasonal persistence model error.

The results suggest that a window size of 5 years is optimal, with an RMSE of 554.660 monthly writing paper sales.

Years=1, RMSE: 606.089 Years=2, RMSE: 557.653 Years=3, RMSE: 555.777 Years=4, RMSE: 544.251 Years=5, RMSE: 540.317 Years=6, RMSE: 554.660 Years=7, RMSE: 569.032 Years=8, RMSE: 581.405 Years=9, RMSE: 602.279 Years=10, RMSE: 624.756

The relationship between window size and error is graphed on a line plot showing a similar trend in error to the previous scenario. Error drops to an inflection point (in this case 5 years) before increasing again.

The Daily Maximum Melbourne Temperatures dataset describes the daily temperatures in the city Melbourne, Australia from 1981 to 1990.

The units are in degrees Celsius and there 3,650 observations, or 10 years of data. The source of the data is credited to the Australian Bureau of Meteorology.

Download the dataset and save it into your current working directory with the filename “*max-daily-temps.csv*“. Note, you may need to delete the footer information from the file.

The example below demonstrates loading the dataset as a Pandas Series.

# line plot of time series from pandas import read_csv from matplotlib import pyplot # load dataset series = read_csv('max-daily-temps.csv', header=0, index_col=0) # display first few rows print(series.head(5)) # line plot of dataset series.plot() pyplot.show()

Running the example prints the first 5 rows of data.

Date 1981-01-01 38.1 1981-01-02 32.4 1981-01-03 34.5 1981-01-04 20.7 1981-01-05 21.5

A line plot is also created. We can see we have a lot more observations than the previous two scenarios and that there is a clear seasonal trend in the data.

Because the data is daily, we need to specify the years in the test data as a function of 365 days rather than 12 months.

This ignores leap years, which is a complication that could, or even should, be addressed in your own project.

The complete example of seasonal persistence is listed below.

from pandas import read_csv from sklearn.metrics import mean_squared_error from math import sqrt from numpy import mean from matplotlib import pyplot # load data series = read_csv('max-daily-temps.csv', header=0, index_col=0) # prepare data X = series.values train, test = X[0:-(2*365)], X[-(2*365):] # evaluate mean of different number of years years = [1, 2, 3, 4, 5, 6, 7, 8] scores = list() for year in years: # walk-forward validation history = [x for x in train] predictions = list() for i in range(len(test)): # collect obs obs = list() for y in range(1, year+1): obs.append(history[-(y*365)]) # make prediction yhat = mean(obs) predictions.append(yhat) # observation history.append(test[i]) # report performance rmse = sqrt(mean_squared_error(test, predictions)) scores.append(rmse) print('Years=%d, RMSE: %.3f' % (year, rmse)) pyplot.plot(years, scores) pyplot.show()

Running the example prints the size of the sliding window and the corresponding model error.

Unlike the previous two cases, we can see a trend where the skill continues to improve as the window size is increased.

The best result is a sliding window of all 8 years of historical data with an RMSE of 4.271.

Years=1, RMSE: 5.950 Years=2, RMSE: 5.083 Years=3, RMSE: 4.664 Years=4, RMSE: 4.539 Years=5, RMSE: 4.448 Years=6, RMSE: 4.358 Years=7, RMSE: 4.371 Years=8, RMSE: 4.271

The plot of sliding window size to model error makes this trend apparent.

It suggests that getting more historical data for this problem might be useful if an optimal model turns out to be a function of the observations on the same day in prior years.

We might do just as well if the observations were averaged from the same week or month in previous seasons, and this might prove a fruitful experiment.

In this tutorial, you discovered seasonal persistence for time series forecasting.

You learned:

- How to use point observations from prior seasons for a persistence forecast.
- How to use a mean of a sliding window across multiple prior seasons for a persistence forecast.
- How to apply seasonal persistence to daily and monthly time series data.

Do you have any questions about persistence with seasonal data?

Ask your questions in the comments and I will do my best to answer.

The post Seasonal Persistence Forecasting With Python appeared first on Machine Learning Mastery.

]]>The post Simple Time Series Forecasting Models to Test So That You Don’t Fool Yourself appeared first on Machine Learning Mastery.

]]>This requires that you evaluate a suite of standard naive, or simple, time series forecasting models to get an idea of the worst acceptable performance on the problem for more sophisticated models to beat.

Applying these simple models can also uncover new ideas about more advanced methods that may result in better performance.

In this tutorial, you will discover how to implement and automate three standard baseline time series forecasting methods on a real world dataset.

Specifically, you will learn:

- How to automate the persistence model and test a suite of persisted values.
- How to automate the expanding window model.
- How to automate the rolling window forecast model and test a suite of window sizes.

This is an important topic and highly recommended for any time series forecasting project.

**Kick-start your project** with my new book Time Series Forecasting With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Updated Apr/2019**: Updated the link to dataset.**Updated Aug/2019**: Updated data loading to use new API.

This tutorial is broken down into the following 5 parts:

**Monthly Car Sales Dataset**: An overview of the standard time series dataset we will use.**Test Setup**: How we will evaluate forecast models in this tutorial.**Persistence Forecast**: The persistence forecast and how to automate it.**Expanding Window Forecast**: The expanding window forecast and how to automate it.**Rolling Window Forecast**: The rolling window forecast and how to automate it.

An up-to-date Python SciPy environment is used, including Python 2 or 3, Pandas, Numpy, and Matplotlib.

In this tutorial, we will use the Monthly Car Sales dataset.

This dataset describes the number of car sales in Quebec, Canada between 1960 and 1968.

The units are a count of the number of sales and there are 108 observations. The source data is credited to Abraham and Ledolter (1983).

Download the dataset and save it into your current working directory with the filename “*car-sales.csv*“. Note, you may need to delete the footer information from the file.

The code below loads the dataset as a Pandas Series object.

# line plot of time series from pandas import read_csv from matplotlib import pyplot # load dataset series = read_csv('car-sales.csv', header=0, index_col=0) # display first few rows print(series.head(5)) # line plot of dataset series.plot() pyplot.show()

Running the example prints the first 5 rows of data.

Month 1960-01-01 6550 1960-02-01 8728 1960-03-01 12026 1960-04-01 14395 1960-05-01 14587 Name: Sales, dtype: int64

A line plot of the data is also provided.

It is important to evaluate time series forecasting models consistently.

In this section, we will define how we will evaluate the three forecast models in this tutorial.

First, we will hold the last two years of data back and evaluate forecasts on this data. Given the data is monthly, this means that the last 24 observations will be used as test data.

We will use a walk-forward validation method to evaluate model performance. This means that each time step in the test dataset will be enumerated, a model constructed on history data, and the forecast compared to the expected value. The observation will then be added to the training dataset and the process repeated.

Walk-forward validation is a realistic way to evaluate time series forecast models as one would expect models to be updated as new observations are made available.

Finally, forecasts will be evaluated using root mean squared error or RMSE. The benefit of RMSE is that it penalizes large errors and the scores are in the same units as the forecast values (car sales per month).

In summary, the test harness involves:

- The last 2 years of data used a test set.
- Walk-forward validation for model evaluation.
- Root mean squared error used to report model skill.

The persistence forecast involves using the previous observation to predict the next time step.

For this reason, the approach is often called the naive forecast.

Why stop with using the previous observation? In this section, we will look at automating the persistence forecast and evaluate the use of any arbitrary prior time step to predict the next time step.

We will explore using each of the prior 24 months of point observations in a persistence model. Each configuration will be evaluated using the test harness and RMSE scores collected. We will then display the scores and graph the relationship between the persisted time step and the model skill.

The complete example is listed below.

from pandas import read_csv from sklearn.metrics import mean_squared_error from math import sqrt from matplotlib import pyplot # load data series = read_csv('car-sales.csv', header=0, index_col=0) # prepare data X = series.values train, test = X[0:-24], X[-24:] persistence_values = range(1, 25) scores = list() for p in persistence_values: # walk-forward validation history = [x for x in train] predictions = list() for i in range(len(test)): # make prediction yhat = history[-p] predictions.append(yhat) # observation history.append(test[i]) # report performance rmse = sqrt(mean_squared_error(test, predictions)) scores.append(rmse) print('p=%d RMSE:%.3f' % (p, rmse)) # plot scores over persistence values pyplot.plot(persistence_values, scores) pyplot.show()

Running the example prints the RMSE for each persisted point observation.

p=1 RMSE:3947.200 p=2 RMSE:5485.353 p=3 RMSE:6346.176 p=4 RMSE:6474.553 p=5 RMSE:5756.543 p=6 RMSE:5756.076 p=7 RMSE:5958.665 p=8 RMSE:6543.266 p=9 RMSE:6450.839 p=10 RMSE:5595.971 p=11 RMSE:3806.482 p=12 RMSE:1997.732 p=13 RMSE:3968.987 p=14 RMSE:5210.866 p=15 RMSE:6299.040 p=16 RMSE:6144.881 p=17 RMSE:5349.691 p=18 RMSE:5534.784 p=19 RMSE:5655.016 p=20 RMSE:6746.872 p=21 RMSE:6784.611 p=22 RMSE:5642.737 p=23 RMSE:3692.062 p=24 RMSE:2119.103

A plot of the persisted value (t-n) to model skill (RMSE) is also created.

From the results, it is clear that persisting the observation from 12 months ago or 24 months ago is a great starting point on this dataset.

The best result achieved involved persisting the result from t-12 with an RMSE of 1997.732 car sales.

This is an obvious result, but also very useful.

We would expect that a forecast model that is some weighted combination of the observations at t-12, t-24, t-36 and so on would be a powerful starting point.

It also points out that the naive t-1 persistence would have been a less desirable starting point on this dataset.

We can use the t-12 model to make a prediction and plot it against the test data.

The complete example is listed below.

from pandas import read_csv from sklearn.metrics import mean_squared_error from math import sqrt from matplotlib import pyplot # load data series = read_csv('car-sales.csv', header=0, index_col=0) # prepare data X = series.values train, test = X[0:-24], X[-24:] # walk-forward validation history = [x for x in train] predictions = list() for i in range(len(test)): # make prediction yhat = history[-12] predictions.append(yhat) # observation history.append(test[i]) # plot predictions vs observations pyplot.plot(test) pyplot.plot(predictions) pyplot.show()

Running the example plots the test dataset (blue) against the predicted values (orange).

You can learn more about the persistence model for time series forecasting in the post:

An expanding window refers to a model that calculates a statistic on all available historic data and uses that to make a forecast.

It is an expanding window because it grows as more real observations are collected.

Two good starting point statistics to calculate are the mean and the median historical observation.

The example below uses the expanding window mean as the forecast.

from pandas import read_csv from sklearn.metrics import mean_squared_error from math import sqrt from numpy import mean # load data series = read_csv('car-sales.csv', header=0, index_col=0) # prepare data X = series.values train, test = X[0:-24], X[-24:] # walk-forward validation history = [x for x in train] predictions = list() for i in range(len(test)): # make prediction yhat = mean(history) predictions.append(yhat) # observation history.append(test[i]) # report performance rmse = sqrt(mean_squared_error(test, predictions)) print('RMSE: %.3f' % rmse)

Running the example prints the RMSE evaluation of the approach.

RMSE: 5113.067

We can also repeat the same experiment with the median of the historical observations. The complete example is listed below.

from pandas import read_csv from sklearn.metrics import mean_squared_error from math import sqrt from numpy import median # load data series = read_csv('car-sales.csv', header=0, index_col=0) # prepare data X = series.values train, test = X[0:-24], X[-24:] # walk-forward validation history = [x for x in train] predictions = list() for i in range(len(test)): # make prediction yhat = median(history) predictions.append(yhat) # observation history.append(test[i]) # report performance rmse = sqrt(mean_squared_error(test, predictions)) print('RMSE: %.3f' % rmse)

Again, running the example prints the skill of the model.

We can see that on this problem the historical mean produced a better result than the median, but both were worse models than using the optimized persistence values.

RMSE: 5527.408

We can plot the mean expanding window predictions against the test dataset to get a feeling for how the forecast actually looks in context.

The complete example is listed below.

from pandas import read_csv from sklearn.metrics import mean_squared_error from matplotlib import pyplot from numpy import mean # load data series = read_csv('car-sales.csv', header=0, index_col=0) # prepare data X = series.values train, test = X[0:-24], X[-24:] # walk-forward validation history = [x for x in train] predictions = list() for i in range(len(test)): # make prediction yhat = mean(history) predictions.append(yhat) # observation history.append(test[i]) # plot predictions vs observations pyplot.plot(test) pyplot.plot(predictions) pyplot.show()

The plot shows what a poor forecast looks like and how it does not follow the movements of the data at all, other than a slight rising trend.

You can see more examples of expanding window statistics in the post:

A rolling window model involves calculating a statistic on a fixed contiguous block of prior observations and using it as a forecast.

It is much like the expanding window, but the window size remains fixed and counts backwards from the most recent observation.

It may be more useful on time series problems where recent lag values are more predictive than older lag values.

We will automatically check different rolling window sizes from 1 to 24 months (2 years) and start by calculating the mean observation and using that as a forecast. The complete example is listed below.

from pandas import read_csv from sklearn.metrics import mean_squared_error from math import sqrt from matplotlib import pyplot from numpy import mean # load data series = read_csv('car-sales.csv', header=0, index_col=0) # prepare data X = series.values train, test = X[0:-24], X[-24:] window_sizes = range(1, 25) scores = list() for w in window_sizes: # walk-forward validation history = [x for x in train] predictions = list() for i in range(len(test)): # make prediction yhat = mean(history[-w:]) predictions.append(yhat) # observation history.append(test[i]) # report performance rmse = sqrt(mean_squared_error(test, predictions)) scores.append(rmse) print('w=%d RMSE:%.3f' % (w, rmse)) # plot scores over window sizes values pyplot.plot(window_sizes, scores) pyplot.show()

Running the example prints the rolling window size and RMSE for each configuration.

w=1 RMSE:3947.200 w=2 RMSE:4350.413 w=3 RMSE:4701.446 w=4 RMSE:4810.510 w=5 RMSE:4649.667 w=6 RMSE:4549.172 w=7 RMSE:4515.684 w=8 RMSE:4614.551 w=9 RMSE:4653.493 w=10 RMSE:4563.802 w=11 RMSE:4321.599 w=12 RMSE:4023.968 w=13 RMSE:3901.634 w=14 RMSE:3907.671 w=15 RMSE:4017.276 w=16 RMSE:4084.080 w=17 RMSE:4076.399 w=18 RMSE:4085.376 w=19 RMSE:4101.505 w=20 RMSE:4195.617 w=21 RMSE:4269.784 w=22 RMSE:4258.226 w=23 RMSE:4158.029 w=24 RMSE:4021.885

A line plot of window size to error is also created.

The results suggest that a rolling window of w=13 was best with an RMSE of 3,901 monthly car sales.

We can repeat this experiment with the median statistic.

The complete example is listed below.

from pandas import read_csv from sklearn.metrics import mean_squared_error from math import sqrt from matplotlib import pyplot from numpy import median # load data series = read_csv('car-sales.csv', header=0, index_col=0) # prepare data X = series.values train, test = X[0:-24], X[-24:] window_sizes = range(1, 25) scores = list() for w in window_sizes: # walk-forward validation history = [x for x in train] predictions = list() for i in range(len(test)): # make prediction yhat = median(history[-w:]) predictions.append(yhat) # observation history.append(test[i]) # report performance rmse = sqrt(mean_squared_error(test, predictions)) scores.append(rmse) print('w=%d RMSE:%.3f' % (w, rmse)) # plot scores over window sizes values pyplot.plot(window_sizes, scores) pyplot.show()

Running the example again prints the window size and RMSE for each configuration.

w=1 RMSE:3947.200 w=2 RMSE:4350.413 w=3 RMSE:4818.406 w=4 RMSE:4993.473 w=5 RMSE:5212.887 w=6 RMSE:5002.830 w=7 RMSE:4958.621 w=8 RMSE:4817.664 w=9 RMSE:4932.317 w=10 RMSE:4928.661 w=11 RMSE:4885.574 w=12 RMSE:4414.139 w=13 RMSE:4204.665 w=14 RMSE:4172.579 w=15 RMSE:4382.037 w=16 RMSE:4522.304 w=17 RMSE:4494.803 w=18 RMSE:4360.445 w=19 RMSE:4232.285 w=20 RMSE:4346.389 w=21 RMSE:4465.536 w=22 RMSE:4514.596 w=23 RMSE:4428.739 w=24 RMSE:4236.126

A plot of the window size and RMSE is again created.

Here, we can see that best results were achieved with a window size of w=1 with an RMSE of 3947.200 monthly car sales, which was essentially a t-1 persistence model.

The results were generally worse than optimized persistence, but better than the expanding window model. We could imagine better results with a weighted combination of window observations, this idea leads to using linear models such as AR and ARIMA.

Again, we can plot the predictions from the better model (mean rolling window with w=13) against the actual observations to get a feeling for how the forecast looks in context.

The complete example is listed below.

from pandas import read_csv from sklearn.metrics import mean_squared_error from matplotlib import pyplot from numpy import mean # load data series = read_csv('car-sales.csv', header=0, index_col=0) # prepare data X = series.values train, test = X[0:-24], X[-24:] # walk-forward validation history = [x for x in train] predictions = list() for i in range(len(test)): # make prediction yhat = mean(history[-13:]) predictions.append(yhat) # observation history.append(test[i]) # plot predictions vs observations pyplot.plot(test) pyplot.plot(predictions) pyplot.show()

Running the code creates the line plot of observations (blue) compared to the predicted values (orange).

We can see that the model better follows the level of the data, but again does not follow the actual up and down movements.

You can see more examples of rolling window statistics in the post:

In this tutorial, you discovered the importance of calculating the worst acceptable performance on a time series forecasting problem and methods that you can use to ensure you are not fooling yourself with more sophisticated methods.

Specifically, you learned:

- How to automatically test a suite of persistence configurations.
- How to evaluate an expanding window model.
- How to automatically test a suite of rolling window configurations.

Do you have any questions about baseline forecasting methods, or about this post?

Ask your questions in the comments and I will do my best to answer.

The post Simple Time Series Forecasting Models to Test So That You Don’t Fool Yourself appeared first on Machine Learning Mastery.

]]>