The post How to Grid Search Triple Exponential Smoothing for Time Series Forecasting in Python appeared first on Machine Learning Mastery.

]]>It is common practice to use an optimization process to find the model hyperparameters that result in the exponential smoothing model with the best performance for a given time series dataset. This practice applies only to the coefficients used by the model to describe the exponential structure of the level, trend, and seasonality.

It is also possible to automatically optimize other hyperparameters of an exponential smoothing model, such as whether or not to model the trend and seasonal component and if so, whether to model them using an additive or multiplicative method.

In this tutorial, you will discover how to develop a framework for grid searching all of the exponential smoothing model hyperparameters for univariate time series forecasting.

After completing this tutorial, you will know:

- How to develop a framework for grid searching ETS models from scratch using walk-forward validation.
- How to grid search ETS model hyperparameters for daily time series data for female births.
- How to grid search ETS model hyperparameters for monthly time series data for shampoo sales, car sales, and temperature.

Let’s get started.

This tutorial is divided into six parts; they are:

- Exponential Smoothing for Time Series Forecasting
- Develop a Grid Search Framework
- Case Study 1: No Trend or Seasonality
- Case Study 2: Trend
- Case Study 3: Seasonality
- Case Study 4: Trend and Seasonality

Exponential smoothing is a time series forecasting method for univariate data.

Time series methods like the Box-Jenkins ARIMA family of methods develop a model where the prediction is a weighted linear sum of recent past observations or lags.

Exponential smoothing forecasting methods are similar in that a prediction is a weighted sum of past observations, but the model explicitly uses an exponentially decreasing weight for past observations.

Specifically, past observations are weighted with a geometrically decreasing ratio.

Forecasts produced using exponential smoothing methods are weighted averages of past observations, with the weights decaying exponentially as the observations get older. In other words, the more recent the observation, the higher the associated weight.

— Page 171, Forecasting: principles and practice, 2013.

Exponential smoothing methods may be considered as peers and an alternative to the popular Box-Jenkins ARIMA class of methods for time series forecasting.

Collectively, the methods are sometimes referred to as ETS models, referring to the explicit modeling of *Error*, *Trend*, and *Seasonality*.

There are three types of exponential smoothing; they are:

**Single Exponential Smoothing**, or SES, for univariate data without trend or seasonality.**Double Exponential Smoothing**for univariate data with support for trends.**Triple Exponential Smoothing**, or Holt-Winters Exponential Smoothing, with support for both trends and seasonality.

A triple exponential smoothing model subsumes single and double exponential smoothing by the configuration of the nature of the trend (additive, multiplicative, or none) and the nature of the seasonality (additive, multiplicative, or none), as well as any dampening of the trend.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In this section, we will develop a framework for grid searching exponential smoothing model hyperparameters for a given univariate time series forecasting problem.

We will use the implementation of Holt-Winters Exponential Smoothing provided by the statsmodels library.

This model has hyperparameters that control the nature of the exponential performed for the series, trend, and seasonality, specifically:

**smoothing_level**(*alpha*): the smoothing coefficient for the level.**smoothing_slope**(*beta*): the smoothing coefficient for the trend.**smoothing_seasonal**(*gamma*): the smoothing coefficient for the seasonal component.**damping_slope**(*phi*): the coefficient for the damped trend.

All four of these hyperparameters can be specified when defining the model. If they are not specified, the library will automatically tune the model and find the optimal values for these hyperparameters (e.g. *optimized=True*).

There are other hyperparameters that the model will not automatically tune that you may want to specify; they are:

**trend**: The type of trend component, as either “*add*” for additive or “*mul*” for multiplicative. Modeling the trend can be disabled by setting it to None.**damped**: Whether or not the trend component should be damped, either True or False.**seasonal**: The type of seasonal component, as either “*add*” for additive or “*mul*” for multiplicative. Modeling the seasonal component can be disabled by setting it to None.**seasonal_periods**: The number of time steps in a seasonal period, e.g. 12 for 12 months in a yearly seasonal structure.**use_boxcox**: Whether or not to perform a power transform of the series (True/False) or specify the lambda for the transform.

If you know enough about your problem to specify one or more of these parameters, then you should specify them. If not, you can try grid searching these parameters.

We can start-off by defining a function that will fit a model with a given configuration and make a one-step forecast.

The *exp_smoothing_forecast()* below implements this behavior.

The function takes an array or list of contiguous prior observations and a list of configuration parameters used to configure the model.

The configuration parameters in order are: the trend type, the dampening type, the seasonality type, the seasonal period, whether or not to use a Box-Cox transform, and whether or not to remove the bias when fitting the model.

# one-step Holt Winter's Exponential Smoothing forecast def exp_smoothing_forecast(history, config): t,d,s,p,b,r = config # define model model = ExponentialSmoothing(history, trend=t, damped=d, seasonal=s, seasonal_periods=p) # fit model model_fit = model.fit(optimized=True, use_boxcox=b, remove_bias=r) # make one step forecast yhat = model_fit.predict(len(history), len(history)) return yhat[0]

Next, we need to build up some functions for fitting and evaluating a model repeatedly via walk-forward validation, including splitting a dataset into train and test sets and evaluating one-step forecasts.

We can split a list or NumPy array of data using a slice given a specified size of the split, e.g. the number of time steps to use from the data in the test set.

The *train_test_split()* function below implements this for a provided dataset and a specified number of time steps to use in the test set.

# split a univariate dataset into train/test sets def train_test_split(data, n_test): return data[:-n_test], data[-n_test:]

After forecasts have been made for each step in the test dataset, they need to be compared to the test set in order to calculate an error score.

There are many popular errors scores for time series forecasting. In this case, we will use root mean squared error (RMSE), but you can change this to your preferred measure, e.g. MAPE, MAE, etc.

The *measure_rmse()* function below will calculate the RMSE given a list of actual (the test set) and predicted values.

# root mean squared error or rmse def measure_rmse(actual, predicted): return sqrt(mean_squared_error(actual, predicted))

We can now implement the walk-forward validation scheme. This is a standard approach to evaluating a time series forecasting model that respects the temporal ordering of observations.

First, a provided univariate time series dataset is split into train and test sets using the *train_test_split()* function. Then the number of observations in the test set are enumerated. For each, we fit a model on all of the history and make a one step forecast. The true observation for the time step is then added to the history, and the process is repeated. The *exp_smoothing_forecast()* function is called in order to fit a model and make a prediction. Finally, an error score is calculated by comparing all one-step forecasts to the actual test set by calling the *measure_rmse()* function.

The *walk_forward_validation()* function below implements this, taking a univariate time series, a number of time steps to use in the test set, and an array of model configurations.

# walk-forward validation for univariate data def walk_forward_validation(data, n_test, cfg): predictions = list() # split dataset train, test = train_test_split(data, n_test) # seed history with training dataset history = [x for x in train] # step over each time-step in the test set for i in range(len(test)): # fit model and make forecast for history yhat = exp_smoothing_forecast(history, cfg) # store forecast in list of predictions predictions.append(yhat) # add actual observation to history for the next loop history.append(test[i]) # estimate prediction error error = measure_rmse(test, predictions) return error

If you are interested in making multi-step predictions, you can change the call to *predict()* in the *exp_smoothing_forecast()* function and also change the calculation of error in the *measure_rmse()* function.

We can call *walk_forward_validation()* repeatedly with different lists of model configurations.

One possible issue is that some combinations of model configurations may not be called for the model and will throw an exception, e.g. specifying some but not all aspects of the seasonal structure in the data.

Further, some models may also raise warnings on some data, e.g. from the linear algebra libraries called by the statsmodels library.

We can trap exceptions and ignore warnings during the grid search by wrapping all calls to *walk_forward_validation()* with a try-except and a block to ignore warnings. We can also add debugging support to disable these protections in case we want to see what is really going on. Finally, if an error does occur, we can return a *None* result; otherwise, we can print some information about the skill of each model evaluated. This is helpful when a large number of models are evaluated.

The *score_model()* function below implements this and returns a tuple of (key and result), where the key is a string version of the tested model configuration.

# score a model, return None on failure def score_model(data, n_test, cfg, debug=False): result = None # convert config to a key key = str(cfg) # show all warnings and fail on exception if debugging if debug: result = walk_forward_validation(data, n_test, cfg) else: # one failure during model validation suggests an unstable config try: # never show warnings when grid searching, too noisy with catch_warnings(): filterwarnings("ignore") result = walk_forward_validation(data, n_test, cfg) except: error = None # check for an interesting result if result is not None: print(' > Model[%s] %.3f' % (key, result)) return (key, result)

Next, we need a loop to test a list of different model configurations.

This is the main function that drives the grid search process and will call the *score_model()* function for each model configuration.

We can dramatically speed up the grid search process by evaluating model configurations in parallel. One way to do that is to use the Joblib library.

We can define a *Parallel* object with the number of cores to use and set it to the number of CPU cores detected in your hardware.

executor = Parallel(n_jobs=cpu_count(), backend='multiprocessing')

We can then create a list of tasks to execute in parallel, which will be one call to the *score_model()* function for each model configuration we have.

tasks = (delayed(score_model)(data, n_test, cfg) for cfg in cfg_list)

Finally, we can use the *Parallel* object to execute the list of tasks in parallel.

scores = executor(tasks)

That’s it.

We can also provide a non-parallel version of evaluating all model configurations in case we want to debug something.

scores = [score_model(data, n_test, cfg) for cfg in cfg_list]

The result of evaluating a list of configurations will be a list of tuples, each with a name that summarizes a specific model configuration and the error of the model evaluated with that configuration as either the RMSE or None if there was an error.

We can filter out all scores with a None.

scores = [r for r in scores if r[1] != None]

We can then sort all tuples in the list by the score in ascending order (best are first), then return this list of scores for review.

The *grid_search()* function below implements this behavior given a univariate time series dataset, a list of model configurations (list of lists), and the number of time steps to use in the test set. An optional parallel argument allows the evaluation of models across all cores to be tuned on or off, and is on by default.

# grid search configs def grid_search(data, cfg_list, n_test, parallel=True): scores = None if parallel: # execute configs in parallel executor = Parallel(n_jobs=cpu_count(), backend='multiprocessing') tasks = (delayed(score_model)(data, n_test, cfg) for cfg in cfg_list) scores = executor(tasks) else: scores = [score_model(data, n_test, cfg) for cfg in cfg_list] # remove empty results scores = [r for r in scores if r[1] != None] # sort configs by error, asc scores.sort(key=lambda tup: tup[1]) return scores

We’re nearly done.

The only thing left to do is to define a list of model configurations to try for a dataset.

We can define this generically. The only parameter we may want to specify is the periodicity of the seasonal component in the series, if one exists. By default, we will assume no seasonal component.

The *exp_smoothing_configs()* function below will create a list of model configurations to evaluate.

An optional list of seasonal periods can be specified, and you could even change the function to specify other elements that you may know about your time series.

In theory, there are 72 possible model configurations to evaluate, but in practice, many will not be valid and will result in an error that we will trap and ignore.

# create a set of exponential smoothing configs to try def exp_smoothing_configs(seasonal=[None]): models = list() # define config lists t_params = ['add', 'mul', None] d_params = [True, False] s_params = ['add', 'mul', None] p_params = seasonal b_params = [True, False] r_params = [True, False] # create config instances for t in t_params: for d in d_params: for s in s_params: for p in p_params: for b in b_params: for r in r_params: cfg = [t,d,s,p,b,r] models.append(cfg) return models

We now have a framework for grid searching triple exponential smoothing model hyperparameters via one-step walk-forward validation.

It is generic and will work for any in-memory univariate time series provided as a list or NumPy array.

We can make sure all the pieces work together by testing it on a contrived 10-step dataset.

The complete example is listed below.

# grid search holt winter's exponential smoothing from math import sqrt from multiprocessing import cpu_count from joblib import Parallel from joblib import delayed from warnings import catch_warnings from warnings import filterwarnings from statsmodels.tsa.holtwinters import ExponentialSmoothing from sklearn.metrics import mean_squared_error # one-step Holt Winter’s Exponential Smoothing forecast def exp_smoothing_forecast(history, config): t,d,s,p,b,r = config # define model model = ExponentialSmoothing(history, trend=t, damped=d, seasonal=s, seasonal_periods=p) # fit model model_fit = model.fit(optimized=True, use_boxcox=b, remove_bias=r) # make one step forecast yhat = model_fit.predict(len(history), len(history)) return yhat[0] # root mean squared error or rmse def measure_rmse(actual, predicted): return sqrt(mean_squared_error(actual, predicted)) # split a univariate dataset into train/test sets def train_test_split(data, n_test): return data[:-n_test], data[-n_test:] # walk-forward validation for univariate data def walk_forward_validation(data, n_test, cfg): predictions = list() # split dataset train, test = train_test_split(data, n_test) # seed history with training dataset history = [x for x in train] # step over each time-step in the test set for i in range(len(test)): # fit model and make forecast for history yhat = exp_smoothing_forecast(history, cfg) # store forecast in list of predictions predictions.append(yhat) # add actual observation to history for the next loop history.append(test[i]) # estimate prediction error error = measure_rmse(test, predictions) return error # score a model, return None on failure def score_model(data, n_test, cfg, debug=False): result = None # convert config to a key key = str(cfg) # show all warnings and fail on exception if debugging if debug: result = walk_forward_validation(data, n_test, cfg) else: # one failure during model validation suggests an unstable config try: # never show warnings when grid searching, too noisy with catch_warnings(): filterwarnings("ignore") result = walk_forward_validation(data, n_test, cfg) except: error = None # check for an interesting result if result is not None: print(' > Model[%s] %.3f' % (key, result)) return (key, result) # grid search configs def grid_search(data, cfg_list, n_test, parallel=True): scores = None if parallel: # execute configs in parallel executor = Parallel(n_jobs=cpu_count(), backend='multiprocessing') tasks = (delayed(score_model)(data, n_test, cfg) for cfg in cfg_list) scores = executor(tasks) else: scores = [score_model(data, n_test, cfg) for cfg in cfg_list] # remove empty results scores = [r for r in scores if r[1] != None] # sort configs by error, asc scores.sort(key=lambda tup: tup[1]) return scores # create a set of exponential smoothing configs to try def exp_smoothing_configs(seasonal=[None]): models = list() # define config lists t_params = ['add', 'mul', None] d_params = [True, False] s_params = ['add', 'mul', None] p_params = seasonal b_params = [True, False] r_params = [True, False] # create config instances for t in t_params: for d in d_params: for s in s_params: for p in p_params: for b in b_params: for r in r_params: cfg = [t,d,s,p,b,r] models.append(cfg) return models if __name__ == '__main__': # define dataset data = [10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0] print(data) # data split n_test = 4 # model configs cfg_list = exp_smoothing_configs() # grid search scores = grid_search(data, cfg_list, n_test) print('done') # list top 3 configs for cfg, error in scores[:3]: print(cfg, error)

Running the example first prints the contrived time series dataset.

Next, the model configurations and their errors are reported as they are evaluated.

Finally, the configurations and the error for the top three configurations are reported.

[10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0] > Model[[None, False, None, None, True, True]] 1.380 > Model[[None, False, None, None, True, False]] 10.000 > Model[[None, False, None, None, False, True]] 2.563 > Model[[None, False, None, None, False, False]] 10.000 done [None, False, None, None, True, True] 1.379824445857423 [None, False, None, None, False, True] 2.5628662672606612 [None, False, None, None, False, False] 10.0

We do not report the model parameters optimized by the model itself. It is assumed that you can achieve the same result again by specifying the broader hyperparameters and allow the library to find the same internal parameters.

You can access these internal parameters by refitting a standalone model with the same configuration and printing the contents of the ‘*params*‘ attribute on the model fit; for example:

print(model_fit.params)

Now that we have a robust framework for grid searching ETS model hyperparameters, let’s test it out on a suite of standard univariate time series datasets.

The datasets were chosen for demonstration purposes; I am not suggesting that an ETS model is the best approach for each dataset, and perhaps an SARIMA or something else would be more appropriate in some cases.

The ‘daily female births’ dataset summarizes the daily total female births in California, USA in 1959.

The dataset has no obvious trend or seasonal component.

You can learn more about the dataset from DataMarket.

Download the dataset directly from here:

Save the file with the filename ‘*daily-total-female-births.csv*‘ in your current working directory.

We can load this dataset as a Pandas series using the function *read_csv()*.

series = read_csv('daily-total-female-births.csv', header=0, index_col=0)

The dataset has one year, or 365 observations. We will use the first 200 for training and the remaining 165 as the test set.

The complete example grid searching the daily female univariate time series forecasting problem is listed below.

# grid search ets models for daily female births from math import sqrt from multiprocessing import cpu_count from joblib import Parallel from joblib import delayed from warnings import catch_warnings from warnings import filterwarnings from statsmodels.tsa.holtwinters import ExponentialSmoothing from sklearn.metrics import mean_squared_error from pandas import read_csv # one-step Holt Winter’s Exponential Smoothing forecast def exp_smoothing_forecast(history, config): t,d,s,p,b,r = config # define model model = ExponentialSmoothing(history, trend=t, damped=d, seasonal=s, seasonal_periods=p) # fit model model_fit = model.fit(optimized=True, use_boxcox=b, remove_bias=r) # make one step forecast yhat = model_fit.predict(len(history), len(history)) return yhat[0] # root mean squared error or rmse def measure_rmse(actual, predicted): return sqrt(mean_squared_error(actual, predicted)) # split a univariate dataset into train/test sets def train_test_split(data, n_test): return data[:-n_test], data[-n_test:] # walk-forward validation for univariate data def walk_forward_validation(data, n_test, cfg): predictions = list() # split dataset train, test = train_test_split(data, n_test) # seed history with training dataset history = [x for x in train] # step over each time-step in the test set for i in range(len(test)): # fit model and make forecast for history yhat = exp_smoothing_forecast(history, cfg) # store forecast in list of predictions predictions.append(yhat) # add actual observation to history for the next loop history.append(test[i]) # estimate prediction error error = measure_rmse(test, predictions) return error # score a model, return None on failure def score_model(data, n_test, cfg, debug=False): result = None # convert config to a key key = str(cfg) # show all warnings and fail on exception if debugging if debug: result = walk_forward_validation(data, n_test, cfg) else: # one failure during model validation suggests an unstable config try: # never show warnings when grid searching, too noisy with catch_warnings(): filterwarnings("ignore") result = walk_forward_validation(data, n_test, cfg) except: error = None # check for an interesting result if result is not None: print(' > Model[%s] %.3f' % (key, result)) return (key, result) # grid search configs def grid_search(data, cfg_list, n_test, parallel=True): scores = None if parallel: # execute configs in parallel executor = Parallel(n_jobs=cpu_count(), backend='multiprocessing') tasks = (delayed(score_model)(data, n_test, cfg) for cfg in cfg_list) scores = executor(tasks) else: scores = [score_model(data, n_test, cfg) for cfg in cfg_list] # remove empty results scores = [r for r in scores if r[1] != None] # sort configs by error, asc scores.sort(key=lambda tup: tup[1]) return scores # create a set of exponential smoothing configs to try def exp_smoothing_configs(seasonal=[None]): models = list() # define config lists t_params = ['add', 'mul', None] d_params = [True, False] s_params = ['add', 'mul', None] p_params = seasonal b_params = [True, False] r_params = [True, False] # create config instances for t in t_params: for d in d_params: for s in s_params: for p in p_params: for b in b_params: for r in r_params: cfg = [t,d,s,p,b,r] models.append(cfg) return models if __name__ == '__main__': # load dataset series = read_csv('daily-total-female-births.csv', header=0, index_col=0) data = series.values print(data.shape) # data split n_test = 165 # model configs cfg_list = exp_smoothing_configs() # grid search scores = grid_search(data, cfg_list, n_test) print('done') # list top 3 configs for cfg, error in scores[:3]: print(cfg, error)

Running the example may take a few minutes as fitting each ETS model can take about a minute on modern hardware.

Model configurations and the RMSE are printed as the models are evaluated The top three model configurations and their error are reported at the end of the run.

We can see that the best result was an RMSE of about 7.08 births with the following configuration:

**Trend**: Additive**Damped**: False**Seasonal**: None**Seasonal Periods**: None**Box-Cox Transform**: True**Remove Bias**: True

What is surprising is that a model that assumed an additive trend performed better than one that didn’t.

We would not know that this is the case unless we threw out assumptions and grid searched models.

(365, 1) > Model[['add', False, None, None, True, True]] 7.081 > Model[['add', False, None, None, True, False]] 7.113 > Model[['add', False, None, None, False, True]] 7.112 > Model[['add', False, None, None, False, False]] 7.115 > Model[[None, False, None, None, True, True]] 7.169 > Model[[None, False, None, None, True, False]] 7.212 > Model[[None, False, None, None, False, True]] 7.117 > Model[[None, False, None, None, False, False]] 7.126 > Model[['add', True, None, None, True, False]] 7.170 > Model[['add', True, None, None, True, True]] 7.118 > Model[['add', True, None, None, False, True]] 7.113 > Model[['add', True, None, None, False, False]] 7.126 done ['add', False, None, None, True, True] 7.081359856193836 ['add', False, None, None, False, True] 7.111893396203345 ['add', True, None, None, False, True] 7.112743603181863

The ‘shampoo’ dataset summarizes the monthly sales of shampoo over a three-year period.

The dataset contains an obvious trend but no obvious seasonal component.

You can learn more about the dataset from DataMarket.

Download the dataset directly from here:

Save the file with the filename ‘shampoo.csv’ in your current working directory.

We can load this dataset as a Pandas series using the function *read_csv()*.

# parse dates def custom_parser(x): return datetime.strptime('195'+x, '%Y-%m') # load dataset series = read_csv('shampoo.csv', header=0, index_col=0, date_parser=custom_parser)

The dataset has three years, or 36 observations. We will use the first 24 for training and the remaining 12 as the test set.

The complete example grid searching the shampoo sales univariate time series forecasting problem is listed below.

# grid search ets models for monthly shampoo sales from math import sqrt from multiprocessing import cpu_count from joblib import Parallel from joblib import delayed from warnings import catch_warnings from warnings import filterwarnings from statsmodels.tsa.holtwinters import ExponentialSmoothing from sklearn.metrics import mean_squared_error from pandas import read_csv from pandas import datetime # one-step Holt Winter’s Exponential Smoothing forecast def exp_smoothing_forecast(history, config): t,d,s,p,b,r = config # define model model = ExponentialSmoothing(history, trend=t, damped=d, seasonal=s, seasonal_periods=p) # fit model model_fit = model.fit(optimized=True, use_boxcox=b, remove_bias=r) # make one step forecast yhat = model_fit.predict(len(history), len(history)) return yhat[0] # root mean squared error or rmse def measure_rmse(actual, predicted): return sqrt(mean_squared_error(actual, predicted)) # split a univariate dataset into train/test sets def train_test_split(data, n_test): return data[:-n_test], data[-n_test:] # walk-forward validation for univariate data def walk_forward_validation(data, n_test, cfg): predictions = list() # split dataset train, test = train_test_split(data, n_test) # seed history with training dataset history = [x for x in train] # step over each time-step in the test set for i in range(len(test)): # fit model and make forecast for history yhat = exp_smoothing_forecast(history, cfg) # store forecast in list of predictions predictions.append(yhat) # add actual observation to history for the next loop history.append(test[i]) # estimate prediction error error = measure_rmse(test, predictions) return error # score a model, return None on failure def score_model(data, n_test, cfg, debug=False): result = None # convert config to a key key = str(cfg) # show all warnings and fail on exception if debugging if debug: result = walk_forward_validation(data, n_test, cfg) else: # one failure during model validation suggests an unstable config try: # never show warnings when grid searching, too noisy with catch_warnings(): filterwarnings("ignore") result = walk_forward_validation(data, n_test, cfg) except: error = None # check for an interesting result if result is not None: print(' > Model[%s] %.3f' % (key, result)) return (key, result) # grid search configs def grid_search(data, cfg_list, n_test, parallel=True): scores = None if parallel: # execute configs in parallel executor = Parallel(n_jobs=cpu_count(), backend='multiprocessing') tasks = (delayed(score_model)(data, n_test, cfg) for cfg in cfg_list) scores = executor(tasks) else: scores = [score_model(data, n_test, cfg) for cfg in cfg_list] # remove empty results scores = [r for r in scores if r[1] != None] # sort configs by error, asc scores.sort(key=lambda tup: tup[1]) return scores # create a set of exponential smoothing configs to try def exp_smoothing_configs(seasonal=[None]): models = list() # define config lists t_params = ['add', 'mul', None] d_params = [True, False] s_params = ['add', 'mul', None] p_params = seasonal b_params = [True, False] r_params = [True, False] # create config instances for t in t_params: for d in d_params: for s in s_params: for p in p_params: for b in b_params: for r in r_params: cfg = [t,d,s,p,b,r] models.append(cfg) return models # parse dates def custom_parser(x): return datetime.strptime('195'+x, '%Y-%m') if __name__ == '__main__': # load dataset series = read_csv('shampoo.csv', header=0, index_col=0, date_parser=custom_parser) data = series.values print(data.shape) # data split n_test = 12 # model configs cfg_list = exp_smoothing_configs() # grid search scores = grid_search(data, cfg_list, n_test) print('done') # list top 3 configs for cfg, error in scores[:3]: print(cfg, error)

Running the example is fast given there are a small number of observations.

Model configurations and the RMSE are printed as the models are evaluated. The top three model configurations and their error are reported at the end of the run.

We can see that the best result was an RMSE of about 97.91 sales with the following configuration:

**Trend**: Additive**Damped**: False**Seasonal**: None**Seasonal Periods**: None**Box-Cox Transform**: False**Remove Bias**: True

(36, 1) > Model[['add', False, None, None, False, True]] 106.431 > Model[['add', False, None, None, False, False]] 104.874 > Model[[None, False, None, None, False, True]] 99.416 > Model[[None, False, None, None, False, False]] 108.031 > Model[['add', True, None, None, False, True]] 97.918 > Model[['add', True, None, None, False, False]] 103.069 done ['add', True, None, None, False, True] 97.91815887268478 [None, False, None, None, False, True] 99.41551161747742 ['add', True, None, None, False, False] 103.06878140174923

The ‘monthly mean temperatures’ dataset summarizes the monthly average air temperatures in Nottingham Castle, England from 1920 to 1939 in degrees Fahrenheit.

The dataset has an obvious seasonal component and no obvious trend.

You can learn more about the dataset from DataMarket.

Download the dataset directly from here:

Save the file with the filename ‘monthly-mean-temp.csv’ in your current working directory.

We can load this dataset as a Pandas series using the function *read_csv()*.

series = read_csv('monthly-mean-temp.csv', header=0, index_col=0)

The dataset has 20 years, or 240 observations.

We will trim the dataset to the last five years of data (60 observations) in order to speed up the model evaluation process and use the last year, or 12 observations, for the test set.

# trim dataset to 5 years data = data[-(5*12):]

The period of the seasonal component is about one year, or 12 observations.

We will use this as the seasonal period in the call to the *exp_smoothing_configs()* function when preparing the model configurations.

# model configs cfg_list = exp_smoothing_configs(seasonal=[0, 12])

The complete example grid searching the monthly mean temperature time series forecasting problem is listed below.

# grid search ets hyperparameters for monthly mean temp dataset from math import sqrt from multiprocessing import cpu_count from joblib import Parallel from joblib import delayed from warnings import catch_warnings from warnings import filterwarnings from statsmodels.tsa.holtwinters import ExponentialSmoothing from sklearn.metrics import mean_squared_error from pandas import read_csv # one-step Holt Winter’s Exponential Smoothing forecast def exp_smoothing_forecast(history, config): t,d,s,p,b,r = config # define model model = ExponentialSmoothing(history, trend=t, damped=d, seasonal=s, seasonal_periods=p) # fit model model_fit = model.fit(optimized=True, use_boxcox=b, remove_bias=r) # make one step forecast yhat = model_fit.predict(len(history), len(history)) return yhat[0] # root mean squared error or rmse def measure_rmse(actual, predicted): return sqrt(mean_squared_error(actual, predicted)) # split a univariate dataset into train/test sets def train_test_split(data, n_test): return data[:-n_test], data[-n_test:] # walk-forward validation for univariate data def walk_forward_validation(data, n_test, cfg): predictions = list() # split dataset train, test = train_test_split(data, n_test) # seed history with training dataset history = [x for x in train] # step over each time-step in the test set for i in range(len(test)): # fit model and make forecast for history yhat = exp_smoothing_forecast(history, cfg) # store forecast in list of predictions predictions.append(yhat) # add actual observation to history for the next loop history.append(test[i]) # estimate prediction error error = measure_rmse(test, predictions) return error # score a model, return None on failure def score_model(data, n_test, cfg, debug=False): result = None # convert config to a key key = str(cfg) # show all warnings and fail on exception if debugging if debug: result = walk_forward_validation(data, n_test, cfg) else: # one failure during model validation suggests an unstable config try: # never show warnings when grid searching, too noisy with catch_warnings(): filterwarnings("ignore") result = walk_forward_validation(data, n_test, cfg) except: error = None # check for an interesting result if result is not None: print(' > Model[%s] %.3f' % (key, result)) return (key, result) # grid search configs def grid_search(data, cfg_list, n_test, parallel=True): scores = None if parallel: # execute configs in parallel executor = Parallel(n_jobs=cpu_count(), backend='multiprocessing') tasks = (delayed(score_model)(data, n_test, cfg) for cfg in cfg_list) scores = executor(tasks) else: scores = [score_model(data, n_test, cfg) for cfg in cfg_list] # remove empty results scores = [r for r in scores if r[1] != None] # sort configs by error, asc scores.sort(key=lambda tup: tup[1]) return scores # create a set of exponential smoothing configs to try def exp_smoothing_configs(seasonal=[None]): models = list() # define config lists t_params = ['add', 'mul', None] d_params = [True, False] s_params = ['add', 'mul', None] p_params = seasonal b_params = [True, False] r_params = [True, False] # create config instances for t in t_params: for d in d_params: for s in s_params: for p in p_params: for b in b_params: for r in r_params: cfg = [t,d,s,p,b,r] models.append(cfg) return models if __name__ == '__main__': # load dataset series = read_csv('monthly-mean-temp.csv', header=0, index_col=0) data = series.values # trim dataset to 5 years data = data[-(5*12):] print(data.shape) # data split n_test = 12 # model configs cfg_list = exp_smoothing_configs(seasonal=[12]) # grid search scores = grid_search(data, cfg_list, n_test) print('done') # list top 3 configs for cfg, error in scores[:3]: print(cfg, error)

Running the example is relatively slow given the large amount of data.

Model configurations and the RMSE are printed as the models are evaluated. The top three model configurations and their error are reported at the end of the run.

We can see that the best result was an RMSE of about 1.50 degrees with the following configuration:

**Trend**: None**Damped**: False**Seasonal**: Additive**Seasonal Periods**: 12**Box-Cox Transform**: False**Remove Bias**: False

(60, 1) > Model[['add', True, None, 12, True, True]] 4.654 > Model[['add', True, None, 12, True, False]] 4.597 > Model[['add', True, None, 12, False, True]] 4.800 > Model[['add', True, None, 12, False, False]] 4.760 > Model[['add', False, 'add', 12, True, False]] 1.826 > Model[['add', False, 'add', 12, True, True]] 1.860 > Model[['add', False, None, 12, True, True]] 4.980 > Model[['add', False, 'add', 12, False, True]] 1.707 > Model[['add', False, 'add', 12, False, False]] 1.683 > Model[['add', False, None, 12, True, False]] 4.900 > Model[['add', False, None, 12, False, True]] 5.203 > Model[['add', False, None, 12, False, False]] 5.151 > Model[[None, False, 'add', 12, True, True]] 1.508 > Model[[None, False, 'add', 12, True, False]] 1.507 > Model[[None, False, 'add', 12, False, True]] 1.502 > Model[[None, False, 'add', 12, False, False]] 1.502 > Model[[None, False, None, 12, True, True]] 5.188 > Model[[None, False, None, 12, True, False]] 5.143 > Model[[None, False, None, 12, False, True]] 5.187 > Model[[None, False, None, 12, False, False]] 5.143 > Model[['add', True, 'add', 12, True, False]] 1.638 > Model[['add', True, 'add', 12, False, True]] 1.568 > Model[['add', True, 'add', 12, False, False]] 1.555 > Model[['add', True, 'add', 12, True, True]] 1.646 done [None, False, 'add', 12, False, False] 1.5015471290238562 [None, False, 'add', 12, False, True] 1.5015526638829775 [None, False, 'add', 12, True, False] 1.5072252073652794

The ‘monthly car sales’ dataset summarizes the monthly car sales in Quebec, Canada between 1960 and 1968.

The dataset has an obvious trend and seasonal component.

You can learn more about the dataset from DataMarket.

Download the dataset directly from here:

Save the file with the filename ‘monthly-car-sales.csv’ in your current working directory.

We can load this dataset as a Pandas series using the function *read_csv()*.

series = read_csv('monthly-car-sales.csv', header=0, index_col=0)

The dataset has nine years, or 108 observations. We will use the last year, or 12 observations, as the test set.

The period of the seasonal component could be six months or 12 months. We will try both as the seasonal period in the call to the *exp_smoothing_configs()* function when preparing the model configurations.

# model configs cfg_list = exp_smoothing_configs(seasonal=[6,12])

The complete example grid searching the monthly car sales time series forecasting problem is listed below.

# grid search ets models for monthly car sales from math import sqrt from multiprocessing import cpu_count from joblib import Parallel from joblib import delayed from warnings import catch_warnings from warnings import filterwarnings from statsmodels.tsa.holtwinters import ExponentialSmoothing from sklearn.metrics import mean_squared_error from pandas import read_csv from pandas import datetime # one-step Holt Winter’s Exponential Smoothing forecast def exp_smoothing_forecast(history, config): t,d,s,p,b,r = config # define model model = ExponentialSmoothing(history, trend=t, damped=d, seasonal=s, seasonal_periods=p) # fit model model_fit = model.fit(optimized=True, use_boxcox=b, remove_bias=r) # make one step forecast yhat = model_fit.predict(len(history), len(history)) return yhat[0] # root mean squared error or rmse def measure_rmse(actual, predicted): return sqrt(mean_squared_error(actual, predicted)) # split a univariate dataset into train/test sets def train_test_split(data, n_test): return data[:-n_test], data[-n_test:] # walk-forward validation for univariate data def walk_forward_validation(data, n_test, cfg): predictions = list() # split dataset train, test = train_test_split(data, n_test) # seed history with training dataset history = [x for x in train] # step over each time-step in the test set for i in range(len(test)): # fit model and make forecast for history yhat = exp_smoothing_forecast(history, cfg) # store forecast in list of predictions predictions.append(yhat) # add actual observation to history for the next loop history.append(test[i]) # estimate prediction error error = measure_rmse(test, predictions) return error # score a model, return None on failure def score_model(data, n_test, cfg, debug=False): result = None # convert config to a key key = str(cfg) # show all warnings and fail on exception if debugging if debug: result = walk_forward_validation(data, n_test, cfg) else: # one failure during model validation suggests an unstable config try: # never show warnings when grid searching, too noisy with catch_warnings(): filterwarnings("ignore") result = walk_forward_validation(data, n_test, cfg) except: error = None # check for an interesting result if result is not None: print(' > Model[%s] %.3f' % (key, result)) return (key, result) # grid search configs def grid_search(data, cfg_list, n_test, parallel=True): scores = None if parallel: # execute configs in parallel executor = Parallel(n_jobs=cpu_count(), backend='multiprocessing') tasks = (delayed(score_model)(data, n_test, cfg) for cfg in cfg_list) scores = executor(tasks) else: scores = [score_model(data, n_test, cfg) for cfg in cfg_list] # remove empty results scores = [r for r in scores if r[1] != None] # sort configs by error, asc scores.sort(key=lambda tup: tup[1]) return scores # create a set of exponential smoothing configs to try def exp_smoothing_configs(seasonal=[None]): models = list() # define config lists t_params = ['add', 'mul', None] d_params = [True, False] s_params = ['add', 'mul', None] p_params = seasonal b_params = [True, False] r_params = [True, False] # create config instances for t in t_params: for d in d_params: for s in s_params: for p in p_params: for b in b_params: for r in r_params: cfg = [t,d,s,p,b,r] models.append(cfg) return models if __name__ == '__main__': # load dataset series = read_csv('monthly-car-sales.csv', header=0, index_col=0) data = series.values print(data.shape) # data split n_test = 12 # model configs cfg_list = exp_smoothing_configs(seasonal=[6,12]) # grid search scores = grid_search(data, cfg_list, n_test) print('done') # list top 3 configs for cfg, error in scores[:3]: print(cfg, error)

Running the example is slow given the large amount of data.

Model configurations and the RMSE are printed as the models are evaluated. The top three model configurations and their error are reported at the end of the run.

We can see that the best result was an RMSE of about 1,658 sales with the following configuration:

**Trend**: Additive**Damped**: True**Seasonal**: Additive**Seasonal Periods**: 12**Box-Cox Transform**: False**Remove Bias**: True

This is a little surprising as I would have guessed that a six-month seasonal model would be the preferred approach.

(108, 1) > Model[['add', True, 'add', 6, False, True]] 3240.433 > Model[['add', True, 'add', 6, False, False]] 3226.384 > Model[['add', True, 'add', 6, True, False]] 2823.588 > Model[['add', True, 'add', 6, True, True]] 2810.103 > Model[['add', True, 'add', 12, False, False]] 1684.424 > Model[['add', True, 'add', 12, False, True]] 1658.925 > Model[['add', True, None, 6, True, True]] 3933.443 > Model[['add', True, None, 6, True, False]] 3915.510 > Model[['add', True, None, 6, False, True]] 3924.489 > Model[['add', True, None, 6, False, False]] 3905.487 > Model[['add', True, None, 12, True, True]] 3935.659 > Model[['add', True, None, 12, True, False]] 3915.499 > Model[['add', True, 'add', 12, True, False]] 2067.595 > Model[['add', True, 'add', 12, True, True]] 2095.083 > Model[['add', False, 'add', 6, True, True]] 3220.532 > Model[['add', False, 'add', 6, True, False]] 3199.766 > Model[['add', True, None, 12, False, True]] 3934.066 > Model[['add', True, None, 12, False, False]] 3905.367 > Model[['add', False, None, 6, True, True]] 3815.765 > Model[['add', False, None, 6, True, False]] 3813.234 > Model[['add', False, None, 6, False, True]] 3805.651 > Model[['add', False, 'add', 6, False, True]] 3243.478 > Model[['add', False, None, 6, False, False]] 3813.920 > Model[['add', False, None, 12, True, True]] 3815.765 > Model[['add', False, 'add', 6, False, False]] 3226.955 > Model[['add', False, None, 12, True, False]] 3813.234 > Model[['add', False, None, 12, False, True]] 3805.700 > Model[['add', False, None, 12, False, False]] 3809.819 > Model[['add', False, 'add', 12, False, True]] 1675.977 > Model[[None, False, 'add', 6, True, True]] 3204.875 > Model[['add', False, 'add', 12, False, False]] 1733.999 > Model[['add', False, 'add', 12, True, True]] 1833.418 > Model[['add', False, 'add', 12, True, False]] 1833.608 > Model[[None, False, 'add', 6, True, False]] 3190.972 > Model[[None, False, 'add', 6, False, True]] 3147.644 > Model[[None, False, 'add', 6, False, False]] 3049.932 > Model[[None, False, 'add', 12, True, True]] 1834.905 > Model[[None, False, 'add', 12, True, False]] 1872.182 > Model[[None, False, 'add', 12, False, True]] 1751.538 > Model[[None, False, 'add', 12, False, False]] 1800.867 > Model[[None, False, None, 6, True, True]] 3801.741 > Model[[None, False, None, 6, True, False]] 3783.966 > Model[[None, False, None, 6, False, True]] 3801.560 > Model[[None, False, None, 6, False, False]] 3783.966 > Model[[None, False, None, 12, True, True]] 3801.741 > Model[[None, False, None, 12, True, False]] 3783.966 > Model[[None, False, None, 12, False, True]] 3801.560 > Model[[None, False, None, 12, False, False]] 3783.966 done ['add', True, 'add', 12, False, True] 1658.9253551827699 ['add', False, 'add', 12, False, True] 1675.9766454275066 ['add', True, 'add', 12, False, False] 1684.4244861443897

This section lists some ideas for extending the tutorial that you may wish to explore.

**Data Transforms**. Update the framework to support configurable data transforms such as normalization and standardization.**Plot Forecast**. Update the framework to re-fit a model with the best configuration and forecast the entire test dataset, then plot the forecast compared to the actual observations in the test set.**Tune Amount of History**. Update the framework to tune the amount of historical data used to fit the model (e.g. in the case of the 10 years of max temperature data).

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- Chapter 7 Exponential smoothing, Forecasting: principles and practice, 2013.
- Section 6.4. Introduction to Time Series Analysis, Engineering Statistics Handbook, 2012.
- Practical Time Series Forecasting with R, 2016.

- statsmodels.tsa.holtwinters.ExponentialSmoothing API
- statsmodels.tsa.holtwinters.HoltWintersResults API
- Joblib: running Python functions as pipeline jobs

In this tutorial, you discovered how to develop a framework for grid searching all of the exponential smoothing model hyperparameters for univariate time series forecasting.

Specifically, you learned:

- How to develop a framework for grid searching ETS models from scratch using walk-forward validation.
- How to grid search ETS model hyperparameters for daily time series data for births.
- How to grid search ETS model hyperparameters for monthly time series data for shampoo sales, car sales and temperature.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Grid Search Triple Exponential Smoothing for Time Series Forecasting in Python appeared first on Machine Learning Mastery.

]]>The post How to Develop Machine Learning Models for Multivariate Multi-Step Air Pollution Time Series Forecasting appeared first on Machine Learning Mastery.

]]>The EMC Data Science Global Hackathon dataset, or the ‘Air Quality Prediction’ dataset for short, describes weather conditions at multiple sites and requires a prediction of air quality measurements over the subsequent three days.

Machine learning algorithms can be applied to time series forecasting problems and offer benefits such as the ability to handle multiple input variables with noisy complex dependencies.

In this tutorial, you will discover how to develop machine learning models for multi-step time series forecasting of air pollution data.

After completing this tutorial, you will know:

- How to impute missing values and transform time series data so that it can be modeled by supervised learning algorithms.
- How to develop and evaluate a suite of linear algorithms for multi-step time series forecasting.
- How to develop and evaluate a suite of nonlinear algorithms for multi-step time series forecasting.

Let’s get started.

This tutorial is divided into nine parts; they are:

- Problem Description
- Model Evaluation
- Machine Learning Modeling
- Machine Learning Data Preparation
- Model Evaluation Test Harness
- Evaluate Linear Algorithms
- Evaluate Nonlinear Algorithms
- Tune Lag Size

The Air Quality Prediction dataset describes weather conditions at multiple sites and requires a prediction of air quality measurements over the subsequent three days.

Specifically, weather observations such as temperature, pressure, wind speed, and wind direction are provided hourly for eight days for multiple sites. The objective is to predict air quality measurements for the next 3 days at multiple sites. The forecast lead times are not contiguous; instead, specific lead times must be forecast over the 72 hour forecast period. They are:

+1, +2, +3, +4, +5, +10, +17, +24, +48, +72

Further, the dataset is divided into disjoint but contiguous chunks of data, with eight days of data followed by three days that require a forecast.

Not all observations are available at all sites or chunks and not all output variables are available at all sites and chunks. There are large portions of missing data that must be addressed.

The dataset was used as the basis for a short duration machine learning competition (or hackathon) on the Kaggle website in 2012.

Submissions for the competition were evaluated against the true observations that were withheld from participants and scored using Mean Absolute Error (MAE). Submissions required the value of -1,000,000 to be specified in those cases where a forecast was not possible due to missing data. In fact, a template of where to insert missing values was provided and required to be adopted for all submissions (what a pain).

A winning entrant achieved a MAE of 0.21058 on the withheld test set (private leaderboard) using random forest on lagged observations. A writeup of this solution is available in the post:

- Chucking everything into a Random Forest: Ben Hamner on Winning The Air Quality Prediction Hackathon, 2012.

In this tutorial, we will explore how to develop naive forecasts for the problem that can be used as a baseline to determine whether a model has skill on the problem or not.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Before we can evaluate naive forecasting methods, we must develop a test harness.

This includes at least how the data will be prepared and how forecasts will be evaluated.

The first step is to download the dataset and load it into memory.

The dataset can be downloaded for free from the Kaggle website. You may have to create an account and log in, in order to be able to download the dataset.

Download the entire dataset, e.g. “*Download All*” to your workstation and unzip the archive in your current working directory with the folder named ‘*AirQualityPrediction*‘.

Our focus will be the ‘*TrainingData.csv*‘ file that contains the training dataset, specifically data in chunks where each chunk is eight contiguous days of observations and target variables.

We can load the data file into memory using the Pandas read_csv() function and specify the header row on line 0.

# load dataset dataset = read_csv('AirQualityPrediction/TrainingData.csv', header=0)

We can group data by the ‘chunkID’ variable (column index 1).

First, let’s get a list of the unique chunk identifiers.

chunk_ids = unique(values[:, 1])

We can then collect all rows for each chunk identifier and store them in a dictionary for easy access.

chunks = dict() # sort rows by chunk id for chunk_id in chunk_ids: selection = values[:, chunk_ix] == chunk_id chunks[chunk_id] = values[selection, :]

Below defines a function named *to_chunks()* that takes a NumPy array of the loaded data and returns a dictionary of *chunk_id* to rows for the chunk.

# split the dataset by 'chunkID', return a dict of id to rows def to_chunks(values, chunk_ix=1): chunks = dict() # get the unique chunk ids chunk_ids = unique(values[:, chunk_ix]) # group rows by chunk id for chunk_id in chunk_ids: selection = values[:, chunk_ix] == chunk_id chunks[chunk_id] = values[selection, :] return chunks

The complete example that loads the dataset and splits it into chunks is listed below.

# load data and split into chunks from numpy import unique from pandas import read_csv # split the dataset by 'chunkID', return a dict of id to rows def to_chunks(values, chunk_ix=1): chunks = dict() # get the unique chunk ids chunk_ids = unique(values[:, chunk_ix]) # group rows by chunk id for chunk_id in chunk_ids: selection = values[:, chunk_ix] == chunk_id chunks[chunk_id] = values[selection, :] return chunks # load dataset dataset = read_csv('AirQualityPrediction/TrainingData.csv', header=0) # group data by chunks values = dataset.values chunks = to_chunks(values) print('Total Chunks: %d' % len(chunks))

Running the example prints the number of chunks in the dataset.

Total Chunks: 208

Now that we know how to load the data and split it into chunks, we can separate into train and test datasets.

Each chunk covers an interval of eight days of hourly observations, although the number of actual observations within each chunk may vary widely.

We can split each chunk into the first five days of observations for training and the last three for test.

Each observation has a row called ‘*position_within_chunk*‘ that varies from 1 to 192 (8 days * 24 hours). We can therefore take all rows with a value in this column that is less than or equal to 120 (5 * 24) as training data and any values more than 120 as test data.

Further, any chunks that don’t have any observations in the train or test split can be dropped as not viable.

When working with the naive models, we are only interested in the target variables, and none of the input meteorological variables. Therefore, we can remove the input data and have the train and test data only comprised of the 39 target variables for each chunk, as well as the position within chunk and hour of observation.

The *split_train_test()* function below implements this behavior; given a dictionary of chunks, it will split each into a list of train and test chunk data.

# split each chunk into train/test sets def split_train_test(chunks, row_in_chunk_ix=2): train, test = list(), list() # first 5 days of hourly observations for train cut_point = 5 * 24 # enumerate chunks for k,rows in chunks.items(): # split chunk rows by 'position_within_chunk' train_rows = rows[rows[:,row_in_chunk_ix] <= cut_point, :] test_rows = rows[rows[:,row_in_chunk_ix] > cut_point, :] if len(train_rows) == 0 or len(test_rows) == 0: print('>dropping chunk=%d: train=%s, test=%s' % (k, train_rows.shape, test_rows.shape)) continue # store with chunk id, position in chunk, hour and all targets indices = [1,2,5] + [x for x in range(56,train_rows.shape[1])] train.append(train_rows[:, indices]) test.append(test_rows[:, indices]) return train, test

We do not require the entire test dataset; instead, we only require the observations at specific lead times over the three day period, specifically the lead times:

+1, +2, +3, +4, +5, +10, +17, +24, +48, +72

Where, each lead time is relative to the end of the training period.

First, we can put these lead times into a function for easy reference:

# return a list of relative forecast lead times def get_lead_times(): return [1, 2 ,3, 4, 5, 10, 17, 24, 48, 72]

Next, we can reduce the test dataset down to just the data at the preferred lead times.

We can do that by looking at the ‘*position_within_chunk*‘ column and using the lead time as an offset from the end of the training dataset, e.g. 120 + 1, 120 +2, etc.

If we find a matching row in the test set, it is saved, otherwise a row of NaN observations is generated.

The function *to_forecasts()* below implements this and returns a NumPy array with one row for each forecast lead time for each chunk.

# convert the rows in a test chunk to forecasts def to_forecasts(test_chunks, row_in_chunk_ix=1): # get lead times lead_times = get_lead_times() # first 5 days of hourly observations for train cut_point = 5 * 24 forecasts = list() # enumerate each chunk for rows in test_chunks: chunk_id = rows[0, 0] # enumerate each lead time for tau in lead_times: # determine the row in chunk we want for the lead time offset = cut_point + tau # retrieve data for the lead time using row number in chunk row_for_tau = rows[rows[:,row_in_chunk_ix]==offset, :] # check if we have data if len(row_for_tau) == 0: # create a mock row [chunk, position, hour] + [nan...] row = [chunk_id, offset, nan] + [nan for _ in range(39)] forecasts.append(row) else: # store the forecast row forecasts.append(row_for_tau[0]) return array(forecasts)

We can tie all of this together and split the dataset into train and test sets and save the results to new files.

The complete code example is listed below.

# split data into train and test sets from numpy import unique from numpy import nan from numpy import array from numpy import savetxt from pandas import read_csv # split the dataset by 'chunkID', return a dict of id to rows def to_chunks(values, chunk_ix=1): chunks = dict() # get the unique chunk ids chunk_ids = unique(values[:, chunk_ix]) # group rows by chunk id for chunk_id in chunk_ids: selection = values[:, chunk_ix] == chunk_id chunks[chunk_id] = values[selection, :] return chunks # split each chunk into train/test sets def split_train_test(chunks, row_in_chunk_ix=2): train, test = list(), list() # first 5 days of hourly observations for train cut_point = 5 * 24 # enumerate chunks for k,rows in chunks.items(): # split chunk rows by 'position_within_chunk' train_rows = rows[rows[:,row_in_chunk_ix] <= cut_point, :] test_rows = rows[rows[:,row_in_chunk_ix] > cut_point, :] if len(train_rows) == 0 or len(test_rows) == 0: print('>dropping chunk=%d: train=%s, test=%s' % (k, train_rows.shape, test_rows.shape)) continue # store with chunk id, position in chunk, hour and all targets indices = [1,2,5] + [x for x in range(56,train_rows.shape[1])] train.append(train_rows[:, indices]) test.append(test_rows[:, indices]) return train, test # return a list of relative forecast lead times def get_lead_times(): return [1, 2 ,3, 4, 5, 10, 17, 24, 48, 72] # convert the rows in a test chunk to forecasts def to_forecasts(test_chunks, row_in_chunk_ix=1): # get lead times lead_times = get_lead_times() # first 5 days of hourly observations for train cut_point = 5 * 24 forecasts = list() # enumerate each chunk for rows in test_chunks: chunk_id = rows[0, 0] # enumerate each lead time for tau in lead_times: # determine the row in chunk we want for the lead time offset = cut_point + tau # retrieve data for the lead time using row number in chunk row_for_tau = rows[rows[:,row_in_chunk_ix]==offset, :] # check if we have data if len(row_for_tau) == 0: # create a mock row [chunk, position, hour] + [nan...] row = [chunk_id, offset, nan] + [nan for _ in range(39)] forecasts.append(row) else: # store the forecast row forecasts.append(row_for_tau[0]) return array(forecasts) # load dataset dataset = read_csv('AirQualityPrediction/TrainingData.csv', header=0) # group data by chunks values = dataset.values chunks = to_chunks(values) # split into train/test train, test = split_train_test(chunks) # flatten training chunks to rows train_rows = array([row for rows in train for row in rows]) # print(train_rows.shape) print('Train Rows: %s' % str(train_rows.shape)) # reduce train to forecast lead times only test_rows = to_forecasts(test) print('Test Rows: %s' % str(test_rows.shape)) # save datasets savetxt('AirQualityPrediction/naive_train.csv', train_rows, delimiter=',') savetxt('AirQualityPrediction/naive_test.csv', test_rows, delimiter=',')

Running the example first comments that chunk 69 is removed from the dataset for having insufficient data.

We can then see that we have 42 columns in each of the train and test sets, one for the chunk id, position within chunk, hour of day, and the 39 training variables.

We can also see the dramatically smaller version of the test dataset with rows only at the forecast lead times.

The new train and test datasets are saved in the ‘*naive_train.csv*‘ and ‘*naive_test.csv*‘ files respectively.

>dropping chunk=69: train=(0, 95), test=(28, 95) Train Rows: (23514, 42) Test Rows: (2070, 42)

Once forecasts have been made, they need to be evaluated.

It is helpful to have a simpler format when evaluating forecasts. For example, we will use the three-dimensional structure of *[chunks][variables][time]*, where variable is the target variable number from 0 to 38 and time is the lead time index from 0 to 9.

Models are expected to make predictions in this format.

We can also restructure the test dataset to have this dataset for comparison. The *prepare_test_forecasts()* function below implements this.

# convert the test dataset in chunks to [chunk][variable][time] format def prepare_test_forecasts(test_chunks): predictions = list() # enumerate chunks to forecast for rows in test_chunks: # enumerate targets for chunk chunk_predictions = list() for j in range(3, rows.shape[1]): yhat = rows[:, j] chunk_predictions.append(yhat) chunk_predictions = array(chunk_predictions) predictions.append(chunk_predictions) return array(predictions)

We will evaluate a model using the mean absolute error, or MAE. This is the metric that was used in the competition and is a sensible choice given the non-Gaussian distribution of the target variables.

If a lead time contains no data in the test set (e.g. *NaN*), then no error will be calculated for that forecast. If the lead time does have data in the test set but no data in the forecast, then the full magnitude of the observation will be taken as error. Finally, if the test set has an observation and a forecast was made, then the absolute difference will be recorded as the error.

The *calculate_error()* function implements these rules and returns the error for a given forecast.

# calculate the error between an actual and predicted value def calculate_error(actual, predicted): # give the full actual value if predicted is nan if isnan(predicted): return abs(actual) # calculate abs difference return abs(actual - predicted)

Errors are summed across all chunks and all lead times, then averaged.

The overall MAE will be calculated, but we will also calculate a MAE for each forecast lead time. This can help with model selection generally as some models may perform differently at different lead times.

The evaluate_forecasts() function below implements this, calculating the MAE and per-lead time MAE for the provided predictions and expected values in *[chunk][variable][time]* format.

# evaluate a forecast in the format [chunk][variable][time] def evaluate_forecasts(predictions, testset): lead_times = get_lead_times() total_mae, times_mae = 0.0, [0.0 for _ in range(len(lead_times))] total_c, times_c = 0, [0 for _ in range(len(lead_times))] # enumerate test chunks for i in range(len(test_chunks)): # convert to forecasts actual = testset[i] predicted = predictions[i] # enumerate target variables for j in range(predicted.shape[0]): # enumerate lead times for k in range(len(lead_times)): # skip if actual in nan if isnan(actual[j, k]): continue # calculate error error = calculate_error(actual[j, k], predicted[j, k]) # update statistics total_mae += error times_mae[k] += error total_c += 1 times_c[k] += 1 # normalize summed absolute errors total_mae /= total_c times_mae = [times_mae[i]/times_c[i] for i in range(len(times_mae))] return total_mae, times_mae

Once we have the evaluation of a model, we can present it.

The *summarize_error()* function below first prints a one-line summary of a model’s performance then creates a plot of MAE per forecast lead time.

# summarize scores def summarize_error(name, total_mae, times_mae): # print summary lead_times = get_lead_times() formatted = ['+%d %.3f' % (lead_times[i], times_mae[i]) for i in range(len(lead_times))] s_scores = ', '.join(formatted) print('%s: [%.3f MAE] %s' % (name, total_mae, s_scores)) # plot summary pyplot.plot([str(x) for x in lead_times], times_mae, marker='.') pyplot.show()

We are now ready to start exploring the performance of naive forecasting methods.

Machine Learning Modeling

The problem can be modeled with machine learning.

Most machine learning models do not directly support the notion of observations over time. Instead, the lag observations must be treated as input features in order to make predictions.

This is a benefit of machine learning algorithms for time series forecasting. Specifically, that they are able to support large numbers of input features. These could be lag observations for one or multiple input time series.

Other general benefits of machine learning algorithms for time series forecasting over classical methods include:

- Ability to support noisy features and noise in the relationships between variables.
- Ability to handle irrelevant features.
- Ability to support complex relationships between variables.

A challenge with this dataset is the need to make multi-step forecasts. There are two main approaches that machine learning methods can be used to make multi-step forecasts; they are:

**Direct**. A separate model is developed to forecast each forecast lead time.**Recursive**. A single model is developed to make one-step forecasts, and the model is used recursively where prior forecasts are used as input to forecast the subsequent lead time.

The recursive approach can make sense when forecasting a short contiguous block of lead times, whereas the direct approach may make more sense when forecasting discontiguous lead times. The direct approach may be more appropriate for the air pollution forecast problem given that we are interested in forecasting a mixture of 10 contiguous and discontiguous lead times over a three-day period.

The dataset has 39 target variables, and we develop one model per target variable, per forecast lead time. That means that we require (39 * 10) 390 machine learning models.

Key to the use of machine learning algorithms for time series forecasting is the choice of input data. We can think about three main sources of data that can be used as input and mapped to each forecast lead time for a target variable; they are:

**Univariate data**, e.g. lag observations from the target variable that is being forecasted.**Multivariate data**, e.g. lag observations from other variables (weather and targets).**Metadata**, e.g. data about the date or time being forecast.

Data can be drawn from across all chunks, providing a rich dataset for learning a mapping from inputs to the target forecast lead time.

The 39 target variables are actually comprised of 12 variables across 14 sites.

Because of the way the data is provided, the default approach to modeling is to treat each variable-site as independent. It may be possible to collapse data by variable and use the same models for a variable across multiple sites.

Some variables have been purposely mislabeled (e.g different data used variables with the same identifier). Nevertheless, perhaps these mislabeled variables can be identified and excluded from multi-site models.

Before we can explore machine learning models of this dataset, we must prepare the data in such a way that we can fit models.

This requires two data preparation steps:

- Handling missing data.
- Preparing input-output patterns.

For now, we will focus on the 39 target variables and ignore the meteorological and metadata.

Chunks are comprised of five days or less of hourly observations for 39 target variables.

Many of the chunks do not have all five days of data, and none of the chunks have data for all 39 target variables.

In those cases where a chunk has no data for a target variable, a forecast is not required.

In those cases where a chunk does have some data for a target variable, but not all five days worth, there will be gaps in the series. These gaps may be a few hours to over a day of observations in length, sometimes even longer.

Three candidate strategies for dealing with these gaps are as follows:

- Ignore the gaps.
- Use data without gaps.
- Fill the gaps.

We could ignore the gaps. A problem with this would be that that data would not be contiguous when splitting data into inputs and outputs. When training a model, the inputs will not be consistent, but could mean the last n hours of data, or data spread across the last *n* days. This inconsistency will make learning a mapping from inputs to outputs very noisy and perhaps more difficult for the model than it needs to be.

We could use only the data without gaps. This is a good option. A risk is that we may not have much or enough data with which to fit a model.

Finally, we could fill the gaps. This is called data imputation and there are many strategies that could be used to fill the gaps. Three methods that may perform well include:

- Persisting the last observed value forward (linear).
- Use the median value for the hour of day within the chunk.
- Use the median value for the hour of day across chunks.

In this tutorial, we will use the latter approach and fill the gaps by using the median for the time of day across chunks. This method seems to result in more training samples and better model performance after a little testing.

For a given variable, there may be missing observations defined by missing rows. Specifically, each observation has a ‘*position_within_chunk*‘. We expect each chunk in the training dataset to have 120 observations, with ‘*positions_within_chunk*‘ from 1 to 120 inclusively.

Therefore, we can create an array of 120 *NaN* values for each variable, mark all observations in the chunk using the ‘*positions_within_chunk*‘ values, and anything left will be marked *NaN*. We can then plot each variable and look for gaps.

The *variable_to_series()* function below will take the rows for a chunk and a given column index for the target variable and will return a series of 120 time steps for the variable with all available data marked with the value from the chunk.

# layout a variable with breaks in the data for missing positions def variable_to_series(chunk_train, col_ix, n_steps=5*24): # lay out whole series data = [nan for _ in range(n_steps)] # mark all available data for i in range(len(chunk_train)): # get position in chunk position = int(chunk_train[i, 1] - 1) # store data data[position] = chunk_train[i, col_ix] return data

We need to calculate a parallel series of the hour of day for each chunk that we can use for imputing hour specific data for each variable in the chunk.

Given a series of partially filled hours of day, the *interpolate_hours()* function below will fill in the missing hours of day. It does this by finding the first marked hour, then counting forward, filling in the hour of day, then performing the same operation backwards.

# interpolate series of hours (in place) in 24 hour time def interpolate_hours(hours): # find the first hour ix = -1 for i in range(len(hours)): if not isnan(hours[i]): ix = i break # fill-forward hour = hours[ix] for i in range(ix+1, len(hours)): # increment hour hour += 1 # check for a fill if isnan(hours[i]): hours[i] = hour % 24 # fill-backward hour = hours[ix] for i in range(ix-1, -1, -1): # decrement hour hour -= 1 # check for a fill if isnan(hours[i]): hours[i] = hour % 24

We can call the same *variable_to_series()* function (above) to create the series of hours with missing values (column index 2), then call *interpolate_hours()* to fill in the gaps.

# prepare sequence of hours for the chunk hours = variable_to_series(rows, 2) # interpolate hours interpolate_hours(hours)

We can then pass the hours to any impute function that may make use of it.

We can now try filling in missing values in a chunk with values within the same series with the same hour. Specifically, we will find all rows with the same hour on the series and calculate the median value.

The *impute_missing()* below takes all of the rows in a chunk, the prepared sequence of hours of the day for the chunk, and the series with missing values for a variable and the column index for a variable.

It first checks to see if the series is all missing data and returns immediately if this is the case as no impute can be performed. It then enumerates over the time steps of the series and when it detects a time step with no data, it collects all rows in the series with data for the same hour and calculates the median value.

# impute missing data def impute_missing(train_chunks, rows, hours, series, col_ix): # impute missing using the median value for hour in all series imputed = list() for i in range(len(series)): if isnan(series[i]): # collect all rows across all chunks for the hour all_rows = list() for rows in train_chunks: [all_rows.append(row) for row in rows[rows[:,2]==hours[i]]] # calculate the central tendency for target all_rows = array(all_rows) # fill with median value value = nanmedian(all_rows[:, col_ix]) if isnan(value): value = 0.0 imputed.append(value) else: imputed.append(series[i]) return imputed

We need to transform the series for each target variable into rows with inputs and outputs so that we can fit supervised machine learning algorithms.

Specifically, we have a series, like:

[1, 2, 3, 4, 5, 6, 7, 8, 9]

When forecasting the lead time of +1 using 2 lag variables, we would split the series into input (*X*) and output (*y*) patterns as follows:

X, y 1, 2, 3 2, 3, 4 3, 4, 5 4, 5, 6 5, 6, 7 6, 7, 8 7, 8, 9

This first requires that we choose a number of lag observations to use as input. There is no right answer; instead, it is a good idea to test different numbers and see what works.

We then must perform the splitting of the series into the supervised learning format for each of the 10 forecast lead times. For example, forecasting +24 with 2 lag observations might look like:

X, y 1, 2, 24

This process is then repeated for each of the 39 target variables.

The patterns prepared for each lead time for each target variable can then be aggregated across chunks to provide a training dataset for a model.

We must also prepare a test dataset. That is, input data (*X*) for each target variable for each chunk so that we can use it as input to forecast the lead times in the test dataset. If we chose a lag of 2, then the test dataset would be comprised of the last two observations for each target variable for each chunk. Pretty straightforward.

We can start off by defining a function that will create input-output patterns for a given complete (imputed) series.

The *supervised_for_lead_time()* function below will take a series, a number of lag observations to use as input, and a forecast lead time to predict, then will return a list of input/out rows drawn from the series.

# created input/output patterns from a sequence def supervised_for_lead_time(series, n_lag, lead_time): samples = list() # enumerate observations and create input/output patterns for i in range(n_lag, len(series)): end_ix = i + (lead_time - 1) # check if can create a pattern if end_ix >= len(series): break # retrieve input and output start_ix = i - n_lag row = series[start_ix:i] + [series[end_ix]] samples.append(row) return samples

It is important to understand this piece.

We can test this function and explore different numbers of lag variables and forecast lead times on a small test dataset.

Below is a complete example that generates a series of 20 integers and creates a series with two input lags and forecasts the +6 lead time.

# test supervised to input/output patterns from numpy import array # created input/output patterns from a sequence def supervised_for_lead_time(series, n_lag, lead_time): data = list() # enumerate observations and create input/output patterns for i in range(n_lag, len(series)): end_ix = i + (lead_time - 1) # check if can create a pattern if end_ix >= len(series): break # retrieve input and output start_ix = i - n_lag row = series[start_ix:i] + [series[end_ix]] data.append(row) return array(data) # define test dataset data = [x for x in range(20)] # convert to supervised format result = supervised_for_lead_time(data, 2, 6) # display result print(result)

Running the example prints the resulting patterns showing lag observations and their associated forecast lead time.

Experiment with this example to get comfortable with this data transform as it is key to modeling time series using machine learning algorithms.

[[ 0 1 7] [ 1 2 8] [ 2 3 9] [ 3 4 10] [ 4 5 11] [ 5 6 12] [ 6 7 13] [ 7 8 14] [ 8 9 15] [ 9 10 16] [10 11 17] [11 12 18] [12 13 19]]

We can now call *supervised_for_lead_time()* for each forecast lead time for a given target variable series.

The *target_to_supervised()* function below implements this. First the target variable is converted into a series and imputed using the functions developed in the previous section. Then training samples are created for each target lead time. A test sample for the target variable is also created.

Both the training data for each forecast lead time and the test input data are then returned for this target variable.

# create supervised learning data for each lead time for this target def target_to_supervised(chunks, rows, hours, col_ix, n_lag): train_lead_times = list() # get series series = variable_to_series(rows, col_ix) if not has_data(series): return None, [nan for _ in range(n_lag)] # impute imputed = impute_missing(chunks, rows, hours, series, col_ix) # prepare test sample for chunk-variable test_sample = array(imputed[-n_lag:]) # enumerate lead times lead_times = get_lead_times() for lead_time in lead_times: # make input/output data from series train_samples = supervised_for_lead_time(imputed, n_lag, lead_time) train_lead_times.append(train_samples) return train_lead_times, test_sample

We have the pieces; we now need to define the function to drive the data preparation process.

This function builds up the train and test datasets.

The approach is to enumerate each target variable and gather the training data for each lead time from across all of the chunks. At the same time, we collect the samples required as input when making a prediction for the test dataset.

The result is a training dataset that has the dimensions *[var][lead time][sample]* where the final dimension are the rows of training samples for a forecast lead time for a target variable. The function also returns the test dataset with the dimensions *[chunk][var][sample]* where the final dimension is the input data for making a prediction for a target variable for a chunk.

The *data_prep()* function below implements this behavior and takes the data in chunk format and a specified number of lag observations to use as input.

# prepare training [var][lead time][sample] and test [chunk][var][sample] def data_prep(chunks, n_lag, n_vars=39): lead_times = get_lead_times() train_data = [[list() for _ in range(len(lead_times))] for _ in range(n_vars)] test_data = [[list() for _ in range(n_vars)] for _ in range(len(chunks))] # enumerate targets for chunk for var in range(n_vars): # convert target number into column number col_ix = 3 + var # enumerate chunks to forecast for c_id in range(len(chunks)): rows = chunks[c_id] # prepare sequence of hours for the chunk hours = variable_to_series(rows, 2) # interpolate hours interpolate_hours(hours) # check for no data if not has_data(rows[:, col_ix]): continue # convert series into training data for each lead time train, test_sample = target_to_supervised(chunks, rows, hours, col_ix, n_lag) # store test sample for this var-chunk test_data[c_id][var] = test_sample if train is not None: # store samples per lead time for lead_time in range(len(lead_times)): # add all rows to the existing list of rows train_data[var][lead_time].extend(train[lead_time]) # convert all rows for each var-lead time to a numpy array for lead_time in range(len(lead_times)): train_data[var][lead_time] = array(train_data[var][lead_time]) return array(train_data), array(test_data)

We can tie everything together and prepare a train and test dataset with a supervised learning format for machine learning algorithms.

We will use the prior 12 hours of lag observations as input when predicting each forecast lead time.

The resulting train and test datasets are then saved as binary NumPy arrays.

The complete example is listed below.

# prepare data from numpy import loadtxt from numpy import nan from numpy import isnan from numpy import count_nonzero from numpy import unique from numpy import array from numpy import nanmedian from numpy import save # split the dataset by 'chunkID', return a list of chunks def to_chunks(values, chunk_ix=0): chunks = list() # get the unique chunk ids chunk_ids = unique(values[:, chunk_ix]) # group rows by chunk id for chunk_id in chunk_ids: selection = values[:, chunk_ix] == chunk_id chunks.append(values[selection, :]) return chunks # return a list of relative forecast lead times def get_lead_times(): return [1, 2, 3, 4, 5, 10, 17, 24, 48, 72] # interpolate series of hours (in place) in 24 hour time def interpolate_hours(hours): # find the first hour ix = -1 for i in range(len(hours)): if not isnan(hours[i]): ix = i break # fill-forward hour = hours[ix] for i in range(ix+1, len(hours)): # increment hour hour += 1 # check for a fill if isnan(hours[i]): hours[i] = hour % 24 # fill-backward hour = hours[ix] for i in range(ix-1, -1, -1): # decrement hour hour -= 1 # check for a fill if isnan(hours[i]): hours[i] = hour % 24 # return true if the array has any non-nan values def has_data(data): return count_nonzero(isnan(data)) < len(data) # impute missing data def impute_missing(train_chunks, rows, hours, series, col_ix): # impute missing using the median value for hour in all series imputed = list() for i in range(len(series)): if isnan(series[i]): # collect all rows across all chunks for the hour all_rows = list() for rows in train_chunks: [all_rows.append(row) for row in rows[rows[:,2]==hours[i]]] # calculate the central tendency for target all_rows = array(all_rows) # fill with median value value = nanmedian(all_rows[:, col_ix]) if isnan(value): value = 0.0 imputed.append(value) else: imputed.append(series[i]) return imputed # layout a variable with breaks in the data for missing positions def variable_to_series(chunk_train, col_ix, n_steps=5*24): # lay out whole series data = [nan for _ in range(n_steps)] # mark all available data for i in range(len(chunk_train)): # get position in chunk position = int(chunk_train[i, 1] - 1) # store data data[position] = chunk_train[i, col_ix] return data # created input/output patterns from a sequence def supervised_for_lead_time(series, n_lag, lead_time): samples = list() # enumerate observations and create input/output patterns for i in range(n_lag, len(series)): end_ix = i + (lead_time - 1) # check if can create a pattern if end_ix >= len(series): break # retrieve input and output start_ix = i - n_lag row = series[start_ix:i] + [series[end_ix]] samples.append(row) return samples # create supervised learning data for each lead time for this target def target_to_supervised(chunks, rows, hours, col_ix, n_lag): train_lead_times = list() # get series series = variable_to_series(rows, col_ix) if not has_data(series): return None, [nan for _ in range(n_lag)] # impute imputed = impute_missing(chunks, rows, hours, series, col_ix) # prepare test sample for chunk-variable test_sample = array(imputed[-n_lag:]) # enumerate lead times lead_times = get_lead_times() for lead_time in lead_times: # make input/output data from series train_samples = supervised_for_lead_time(imputed, n_lag, lead_time) train_lead_times.append(train_samples) return train_lead_times, test_sample # prepare training [var][lead time][sample] and test [chunk][var][sample] def data_prep(chunks, n_lag, n_vars=39): lead_times = get_lead_times() train_data = [[list() for _ in range(len(lead_times))] for _ in range(n_vars)] test_data = [[list() for _ in range(n_vars)] for _ in range(len(chunks))] # enumerate targets for chunk for var in range(n_vars): # convert target number into column number col_ix = 3 + var # enumerate chunks to forecast for c_id in range(len(chunks)): rows = chunks[c_id] # prepare sequence of hours for the chunk hours = variable_to_series(rows, 2) # interpolate hours interpolate_hours(hours) # check for no data if not has_data(rows[:, col_ix]): continue # convert series into training data for each lead time train, test_sample = target_to_supervised(chunks, rows, hours, col_ix, n_lag) # store test sample for this var-chunk test_data[c_id][var] = test_sample if train is not None: # store samples per lead time for lead_time in range(len(lead_times)): # add all rows to the existing list of rows train_data[var][lead_time].extend(train[lead_time]) # convert all rows for each var-lead time to a numpy array for lead_time in range(len(lead_times)): train_data[var][lead_time] = array(train_data[var][lead_time]) return array(train_data), array(test_data) # load dataset train = loadtxt('AirQualityPrediction/naive_train.csv', delimiter=',') test = loadtxt('AirQualityPrediction/naive_test.csv', delimiter=',') # group data by chunks train_chunks = to_chunks(train) test_chunks = to_chunks(test) # convert training data into supervised learning data n_lag = 12 train_data, test_data = data_prep(train_chunks, n_lag) print(train_data.shape, test_data.shape) # save train and test sets to file save('AirQualityPrediction/supervised_train.npy', train_data) save('AirQualityPrediction/supervised_test.npy', test_data)

Running the example may take a minute.

The result are two binary files containing the train and test datasets that we can load in the following sections for training and evaluating machine learning algorithms on the problem.

Before we can start evaluating algorithms, we need some more elements of the test harness.

First, we need to be able to fit a scikit-learn model on training data. The *fit_model()* function below will make a clone of the model configuration and fit it on the provided training data. We will need to fit many (360) versions of each configured model, so this function will be called a lot.

# fit a single model def fit_model(model, X, y): # clone the model configuration local_model = clone(model) # fit the model local_model.fit(X, y) return local_model

Next, we need to fit a model for each variable and forecast lead time combination.

We can do this by enumerating the training dataset first by the variables and then by the lead times. We can then fit a model and store it in a list of lists with the same structure, specifically: *[var][time][model]*.

The *fit_models()* function below implements this.

# fit one model for each variable and each forecast lead time [var][time][model] def fit_models(model, train): # prepare structure for saving models models = [[list() for _ in range(train.shape[1])] for _ in range(train.shape[0])] # enumerate vars for i in range(train.shape[0]): # enumerate lead times for j in range(train.shape[1]): # get data data = train[i, j] X, y = data[:, :-1], data[:, -1] # fit model local_model = fit_model(model, X, y) models[i][j].append(local_model) return models

Fitting models is the slow part and could benefit from being parallelized, such as with the Joblib library. This is left as an extension.

Once the models are fit, they can be used to make predictions for the test dataset.

The prepared test dataset is organized first by chunk, and then by target variable. Making predictions is fast and involves first checking that a prediction can be made (we have input data) and if so, using the appropriate models for the target variable. Each of the 10 forecast lead times for the variable will then be predicted with each of the direct models for those lead times.

The *make_predictions()* function below implements this, taking the list of lists of models and the loaded test dataset as arguments and returning an array of forecasts with the structure *[chunks][var][time]*.

# return forecasts as [chunks][var][time] def make_predictions(models, test): lead_times = get_lead_times() predictions = list() # enumerate chunks for i in range(test.shape[0]): # enumerate variables chunk_predictions = list() for j in range(test.shape[1]): # get the input pattern for this chunk and target pattern = test[i,j] # assume a nan forecast forecasts = array([nan for _ in range(len(lead_times))]) # check we can make a forecast if has_data(pattern): pattern = pattern.reshape((1, len(pattern))) # forecast each lead time forecasts = list() for k in range(len(lead_times)): yhat = models[j][k][0].predict(pattern) forecasts.append(yhat[0]) forecasts = array(forecasts) # save forecasts fore each lead time for this variable chunk_predictions.append(forecasts) # save forecasts for this chunk chunk_predictions = array(chunk_predictions) predictions.append(chunk_predictions) return array(predictions)

We need a list of models to evaluate.

We can define a generic *get_models()* function that is responsible for defining a dictionary of model-names mapped to configured scikit-learn model objects.

# prepare a list of ml models def get_models(models=dict()): # ... return models

Finally, we need a function to drive the model evaluation process.

Given the dictionary of models, enumerate the models, first fitting the matrix of models on the training data, making predictions of the test dataset, evaluating the predictions, and summarizing the results.

The *evaluate_models()* function below implements this.

# evaluate a suite of models def evaluate_models(models, train, test, actual): for name, model in models.items(): # fit models fits = fit_models(model, train) # make predictions predictions = make_predictions(fits, test) # evaluate forecast total_mae, _ = evaluate_forecasts(predictions, actual) # summarize forecast summarize_error(name, total_mae)

We now have everything we need to evaluate machine learning models.

In this section, we will spot check a suite of linear machine learning algorithms.

Linear algorithms are those that assume that the output is a linear function of the input variables. This is much like the assumptions of classical time series forecasting models like ARIMA.

Spot checking means evaluating a suite of models in order to get a rough idea of what works. We are interested in any models that outperform a simple autoregression model AR(2) that achieves a MAE error of about 0.487.

We will test eight linear machine learning algorithms with their default configuration; specifically:

- Linear Regression
- Lasso Linear Regression
- Ridge Regression
- Elastic Net Regression
- Huber Regression
- Lasso Lars Linear Regression
- Passive Aggressive Regression
- Stochastic Gradient Descent Regression

We can define these models in the *get_models()* function.

# prepare a list of ml models def get_models(models=dict()): # linear models models['lr'] = LinearRegression() models['lasso'] = Lasso() models['ridge'] = Ridge() models['en'] = ElasticNet() models['huber'] = HuberRegressor() models['llars'] = LassoLars() models['pa'] = PassiveAggressiveRegressor(max_iter=1000, tol=1e-3) models['sgd'] = SGDRegressor(max_iter=1000, tol=1e-3) print('Defined %d models' % len(models)) return models

The complete code example is listed below.

# evaluate linear algorithms from numpy import load from numpy import loadtxt from numpy import nan from numpy import isnan from numpy import count_nonzero from numpy import unique from numpy import array from sklearn.base import clone from sklearn.linear_model import LinearRegression from sklearn.linear_model import Lasso from sklearn.linear_model import Ridge from sklearn.linear_model import ElasticNet from sklearn.linear_model import HuberRegressor from sklearn.linear_model import LassoLars from sklearn.linear_model import PassiveAggressiveRegressor from sklearn.linear_model import SGDRegressor # split the dataset by 'chunkID', return a list of chunks def to_chunks(values, chunk_ix=0): chunks = list() # get the unique chunk ids chunk_ids = unique(values[:, chunk_ix]) # group rows by chunk id for chunk_id in chunk_ids: selection = values[:, chunk_ix] == chunk_id chunks.append(values[selection, :]) return chunks # return true if the array has any non-nan values def has_data(data): return count_nonzero(isnan(data)) < len(data) # return a list of relative forecast lead times def get_lead_times(): return [1, 2, 3, 4, 5, 10, 17, 24, 48, 72] # fit a single model def fit_model(model, X, y): # clone the model configuration local_model = clone(model) # fit the model local_model.fit(X, y) return local_model # fit one model for each variable and each forecast lead time [var][time][model] def fit_models(model, train): # prepare structure for saving models models = [[list() for _ in range(train.shape[1])] for _ in range(train.shape[0])] # enumerate vars for i in range(train.shape[0]): # enumerate lead times for j in range(train.shape[1]): # get data data = train[i, j] X, y = data[:, :-1], data[:, -1] # fit model local_model = fit_model(model, X, y) models[i][j].append(local_model) return models # return forecasts as [chunks][var][time] def make_predictions(models, test): lead_times = get_lead_times() predictions = list() # enumerate chunks for i in range(test.shape[0]): # enumerate variables chunk_predictions = list() for j in range(test.shape[1]): # get the input pattern for this chunk and target pattern = test[i,j] # assume a nan forecast forecasts = array([nan for _ in range(len(lead_times))]) # check we can make a forecast if has_data(pattern): pattern = pattern.reshape((1, len(pattern))) # forecast each lead time forecasts = list() for k in range(len(lead_times)): yhat = models[j][k][0].predict(pattern) forecasts.append(yhat[0]) forecasts = array(forecasts) # save forecasts for each lead time for this variable chunk_predictions.append(forecasts) # save forecasts for this chunk chunk_predictions = array(chunk_predictions) predictions.append(chunk_predictions) return array(predictions) # convert the test dataset in chunks to [chunk][variable][time] format def prepare_test_forecasts(test_chunks): predictions = list() # enumerate chunks to forecast for rows in test_chunks: # enumerate targets for chunk chunk_predictions = list() for j in range(3, rows.shape[1]): yhat = rows[:, j] chunk_predictions.append(yhat) chunk_predictions = array(chunk_predictions) predictions.append(chunk_predictions) return array(predictions) # calculate the error between an actual and predicted value def calculate_error(actual, predicted): # give the full actual value if predicted is nan if isnan(predicted): return abs(actual) # calculate abs difference return abs(actual - predicted) # evaluate a forecast in the format [chunk][variable][time] def evaluate_forecasts(predictions, testset): lead_times = get_lead_times() total_mae, times_mae = 0.0, [0.0 for _ in range(len(lead_times))] total_c, times_c = 0, [0 for _ in range(len(lead_times))] # enumerate test chunks for i in range(len(test_chunks)): # convert to forecasts actual = testset[i] predicted = predictions[i] # enumerate target variables for j in range(predicted.shape[0]): # enumerate lead times for k in range(len(lead_times)): # skip if actual in nan if isnan(actual[j, k]): continue # calculate error error = calculate_error(actual[j, k], predicted[j, k]) # update statistics total_mae += error times_mae[k] += error total_c += 1 times_c[k] += 1 # normalize summed absolute errors total_mae /= total_c times_mae = [times_mae[i]/times_c[i] for i in range(len(times_mae))] return total_mae, times_mae # summarize scores def summarize_error(name, total_mae): print('%s: %.3f MAE' % (name, total_mae)) # prepare a list of ml models def get_models(models=dict()): # linear models models['lr'] = LinearRegression() models['lasso'] = Lasso() models['ridge'] = Ridge() models['en'] = ElasticNet() models['huber'] = HuberRegressor() models['llars'] = LassoLars() models['pa'] = PassiveAggressiveRegressor(max_iter=1000, tol=1e-3) models['sgd'] = SGDRegressor(max_iter=1000, tol=1e-3) print('Defined %d models' % len(models)) return models # evaluate a suite of models def evaluate_models(models, train, test, actual): for name, model in models.items(): # fit models fits = fit_models(model, train) # make predictions predictions = make_predictions(fits, test) # evaluate forecast total_mae, _ = evaluate_forecasts(predictions, actual) # summarize forecast summarize_error(name, total_mae) # load supervised datasets train = load('AirQualityPrediction/supervised_train.npy') test = load('AirQualityPrediction/supervised_test.npy') print(train.shape, test.shape) # load test chunks for validation testset = loadtxt('AirQualityPrediction/naive_test.csv', delimiter=',') test_chunks = to_chunks(testset) actual = prepare_test_forecasts(test_chunks) # prepare list of models models = get_models() # evaluate models evaluate_models(models, train, test, actual)

Running the example prints the MAE for each of the evaluated algorithms.

We can see that many of the algorithms show skill compared to a simple AR model, achieving a MAE below 0.487.

Huber regression seems to perform the best (with default configuration), achieving a MAE of 0.434.

This is interesting as Huber regression, or robust regression with Huber loss, is a method that is designed to be robust to outliers in the training dataset. It may suggest that the other methods may perform better with a little more data preparation, such as standardization and/or outlier removal.

lr: 0.454 MAE lasso: 0.624 MAE ridge: 0.454 MAE en: 0.595 MAE huber: 0.434 MAE llars: 0.631 MAE pa: 0.833 MAE sgd: 0.457 MAE

We can use the same framework to evaluate the performance of a suite of nonlinear and ensemble machine learning algorithms.

Specifically:

**Nonlinear Algorithms**

- k-Nearest Neighbors
- Classification and Regression Trees
- Extra Tree
- Support Vector Regression

**Ensemble Algorithms**

- Adaboost
- Bagged Decision Trees
- Random Forest
- Extra Trees
- Gradient Boosting Machines

The *get_models()* function below defines these nine models.

# prepare a list of ml models def get_models(models=dict()): # non-linear models models['knn'] = KNeighborsRegressor(n_neighbors=7) models['cart'] = DecisionTreeRegressor() models['extra'] = ExtraTreeRegressor() models['svmr'] = SVR() # # ensemble models n_trees = 100 models['ada'] = AdaBoostRegressor(n_estimators=n_trees) models['bag'] = BaggingRegressor(n_estimators=n_trees) models['rf'] = RandomForestRegressor(n_estimators=n_trees) models['et'] = ExtraTreesRegressor(n_estimators=n_trees) models['gbm'] = GradientBoostingRegressor(n_estimators=n_trees) print('Defined %d models' % len(models)) return models

The complete code listing is provided below.

# spot check nonlinear algorithms from numpy import load from numpy import loadtxt from numpy import nan from numpy import isnan from numpy import count_nonzero from numpy import unique from numpy import array from sklearn.base import clone from sklearn.neighbors import KNeighborsRegressor from sklearn.tree import DecisionTreeRegressor from sklearn.tree import ExtraTreeRegressor from sklearn.svm import SVR from sklearn.ensemble import AdaBoostRegressor from sklearn.ensemble import BaggingRegressor from sklearn.ensemble import RandomForestRegressor from sklearn.ensemble import ExtraTreesRegressor from sklearn.ensemble import GradientBoostingRegressor # split the dataset by 'chunkID', return a list of chunks def to_chunks(values, chunk_ix=0): chunks = list() # get the unique chunk ids chunk_ids = unique(values[:, chunk_ix]) # group rows by chunk id for chunk_id in chunk_ids: selection = values[:, chunk_ix] == chunk_id chunks.append(values[selection, :]) return chunks # return true if the array has any non-nan values def has_data(data): return count_nonzero(isnan(data)) < len(data) # return a list of relative forecast lead times def get_lead_times(): return [1, 2, 3, 4, 5, 10, 17, 24, 48, 72] # fit a single model def fit_model(model, X, y): # clone the model configuration local_model = clone(model) # fit the model local_model.fit(X, y) return local_model # fit one model for each variable and each forecast lead time [var][time][model] def fit_models(model, train): # prepare structure for saving models models = [[list() for _ in range(train.shape[1])] for _ in range(train.shape[0])] # enumerate vars for i in range(train.shape[0]): # enumerate lead times for j in range(train.shape[1]): # get data data = train[i, j] X, y = data[:, :-1], data[:, -1] # fit model local_model = fit_model(model, X, y) models[i][j].append(local_model) return models # return forecasts as [chunks][var][time] def make_predictions(models, test): lead_times = get_lead_times() predictions = list() # enumerate chunks for i in range(test.shape[0]): # enumerate variables chunk_predictions = list() for j in range(test.shape[1]): # get the input pattern for this chunk and target pattern = test[i,j] # assume a nan forecast forecasts = array([nan for _ in range(len(lead_times))]) # check we can make a forecast if has_data(pattern): pattern = pattern.reshape((1, len(pattern))) # forecast each lead time forecasts = list() for k in range(len(lead_times)): yhat = models[j][k][0].predict(pattern) forecasts.append(yhat[0]) forecasts = array(forecasts) # save forecasts for each lead time for this variable chunk_predictions.append(forecasts) # save forecasts for this chunk chunk_predictions = array(chunk_predictions) predictions.append(chunk_predictions) return array(predictions) # convert the test dataset in chunks to [chunk][variable][time] format def prepare_test_forecasts(test_chunks): predictions = list() # enumerate chunks to forecast for rows in test_chunks: # enumerate targets for chunk chunk_predictions = list() for j in range(3, rows.shape[1]): yhat = rows[:, j] chunk_predictions.append(yhat) chunk_predictions = array(chunk_predictions) predictions.append(chunk_predictions) return array(predictions) # calculate the error between an actual and predicted value def calculate_error(actual, predicted): # give the full actual value if predicted is nan if isnan(predicted): return abs(actual) # calculate abs difference return abs(actual - predicted) # evaluate a forecast in the format [chunk][variable][time] def evaluate_forecasts(predictions, testset): lead_times = get_lead_times() total_mae, times_mae = 0.0, [0.0 for _ in range(len(lead_times))] total_c, times_c = 0, [0 for _ in range(len(lead_times))] # enumerate test chunks for i in range(len(test_chunks)): # convert to forecasts actual = testset[i] predicted = predictions[i] # enumerate target variables for j in range(predicted.shape[0]): # enumerate lead times for k in range(len(lead_times)): # skip if actual in nan if isnan(actual[j, k]): continue # calculate error error = calculate_error(actual[j, k], predicted[j, k]) # update statistics total_mae += error times_mae[k] += error total_c += 1 times_c[k] += 1 # normalize summed absolute errors total_mae /= total_c times_mae = [times_mae[i]/times_c[i] for i in range(len(times_mae))] return total_mae, times_mae # summarize scores def summarize_error(name, total_mae): print('%s: %.3f MAE' % (name, total_mae)) # prepare a list of ml models def get_models(models=dict()): # non-linear models models['knn'] = KNeighborsRegressor(n_neighbors=7) models['cart'] = DecisionTreeRegressor() models['extra'] = ExtraTreeRegressor() models['svmr'] = SVR() # # ensemble models n_trees = 100 models['ada'] = AdaBoostRegressor(n_estimators=n_trees) models['bag'] = BaggingRegressor(n_estimators=n_trees) models['rf'] = RandomForestRegressor(n_estimators=n_trees) models['et'] = ExtraTreesRegressor(n_estimators=n_trees) models['gbm'] = GradientBoostingRegressor(n_estimators=n_trees) print('Defined %d models' % len(models)) return models # evaluate a suite of models def evaluate_models(models, train, test, actual): for name, model in models.items(): # fit models fits = fit_models(model, train) # make predictions predictions = make_predictions(fits, test) # evaluate forecast total_mae, _ = evaluate_forecasts(predictions, actual) # summarize forecast summarize_error(name, total_mae) # load supervised datasets train = load('AirQualityPrediction/supervised_train.npy') test = load('AirQualityPrediction/supervised_test.npy') print(train.shape, test.shape) # load test chunks for validation testset = loadtxt('AirQualityPrediction/naive_test.csv', delimiter=',') test_chunks = to_chunks(testset) actual = prepare_test_forecasts(test_chunks) # prepare list of models models = get_models() # evaluate models evaluate_models(models, train, test, actual)

Running the example, we can see that many algorithms performed well compared to the baseline of an autoregression algorithm, although none performed as well as Huber regression in the previous section.

Both support vector regression and perhaps gradient boosting machines may be worth further investigation of achieving MAEs of 0.437 and 0.450 respectively.

knn: 0.484 MAE cart: 0.631 MAE extra: 0.630 MAE svmr: 0.437 MAE ada: 0.717 MAE bag: 0.471 MAE rf: 0.470 MAE et: 0.469 MAE gbm: 0.450 MAE

In the previous spot check experiments, the number of lag observations was arbitrarily fixed at 12.

We can vary the number of lag observations and evaluate the effect on MAE. Some algorithms may require more or fewer prior observations, but general trends may hold across algorithms.

Prepare the supervised learning dataset with a range of different numbers of lag observations and fit and evaluate the HuberRegressor on each.

I experimented with the following number of lag observations:

[1, 3, 6, 12, 24, 36, 48]

The results were as follows:

1: 0.451 3: 0.445 6: 0.441 12: 0.434 24: 0.423 36: 0.422 48: 0.439

A plot of these results is provided below.

We can see a general trend of decreasing overall MAE with the increase in the number of lag observations, at least to a point after which error begins to rise again.

The results suggest, at least for the HuberRegressor algorithm, that 36 lag observations may be a good configuration achieving a MAE of 0.422.

This section lists some ideas for extending the tutorial that you may wish to explore.

**Data Preparation**. Explore whether simple data preparation such as standardization or statistical outlier removal can improve model performance.**Engineered Features**. Explore whether engineered features such as median value for forecasted hour of day can improve model performance**Meteorological Variables**. Explore whether adding lag meteorological variables to the models can improve performance.**Cross-Site Models**. Explore whether combining target variables of the same type and re-using the models across sites results in a performance improvement.**Algorithm Tuning**. Explore whether tuning the hyperparameters of some of the better performing algorithms can result in performance improvements.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- EMC Data Science Global Hackathon (Air Quality Prediction)
- Chucking everything into a Random Forest: Ben Hamner on Winning The Air Quality Prediction Hackathon
- Winning Code for the EMC Data Science Global Hackathon (Air Quality Prediction)
- General approaches to partitioning the models?

In this tutorial, you discovered how to develop machine learning models for multi-step time series forecasting of air pollution data.

Specifically, you learned:

- How to impute missing values and transform time series data so that it can be modeled by supervised learning algorithms.
- How to develop and evaluate a suite of linear algorithms for multi-step time series forecasting.
- How to develop and evaluate a suite of nonlinear algorithms for multi-step time series forecasting.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Develop Machine Learning Models for Multivariate Multi-Step Air Pollution Time Series Forecasting appeared first on Machine Learning Mastery.

]]>The post How to Develop Autoregressive Forecasting Models for Multi-Step Air Pollution Time Series Forecasting appeared first on Machine Learning Mastery.

]]>The EMC Data Science Global Hackathon dataset, or the ‘Air Quality Prediction’ dataset for short, describes weather conditions at multiple sites and requires a prediction of air quality measurements over the subsequent three days.

Before diving into sophisticated machine learning and deep learning methods for time series forecasting, it is important to find the limits of classical methods, such as developing autoregressive models using the AR or ARIMA method.

In this tutorial, you will discover how to develop autoregressive models for multi-step time series forecasting for a multivariate air pollution time series.

After completing this tutorial, you will know:

- How to analyze and impute missing values for time series data.
- How to develop and evaluate an autoregressive model for multi-step time series forecasting.
- How to improve an autoregressive model using alternate data imputation methods.

Let’s get started.

This tutorial is divided into six parts; they are:

- Problem Description
- Model Evaluation
- Data Analysis
- Develop an Autoregressive Model
- Autoregressive Model with Global Impute Strategy

The Air Quality Prediction dataset describes weather conditions at multiple sites and requires a prediction of air quality measurements over the subsequent three days.

Specifically, weather observations such as temperature, pressure, wind speed, and wind direction are provided hourly for eight days for multiple sites. The objective is to predict air quality measurements for the next 3 days at multiple sites. The forecast lead times are not contiguous; instead, specific lead times must be forecast over the 72 hour forecast period. They are:

+1, +2, +3, +4, +5, +10, +17, +24, +48, +72

Further, the dataset is divided into disjoint but contiguous chunks of data, with eight days of data followed by three days that require a forecast.

Not all observations are available at all sites or chunks and not all output variables are available at all sites and chunks. There are large portions of missing data that must be addressed.

The dataset was used as the basis for a short duration machine learning competition (or hackathon) on the Kaggle website in 2012.

Submissions for the competition were evaluated against the true observations that were withheld from participants and scored using Mean Absolute Error (MAE). Submissions required the value of -1,000,000 to be specified in those cases where a forecast was not possible due to missing data. In fact, a template of where to insert missing values was provided and required to be adopted for all submissions (what a pain).

A winning entrant achieved a MAE of 0.21058 on the withheld test set (private leaderboard) using random forest on lagged observations. A writeup of this solution is available in the post:

- Chucking everything into a Random Forest: Ben Hamner on Winning The Air Quality Prediction Hackathon, 2012.

In this tutorial, we will explore how to develop naive forecasts for the problem that can be used as a baseline to determine whether a model has skill on the problem or not.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Before we can evaluate naive forecasting methods, we must develop a test harness.

This includes at least how the data will be prepared and how forecasts will be evaluated.

The first step is to download the dataset and load it into memory.

The dataset can be downloaded for free from the Kaggle website. You may have to create an account and log in, in order to be able to download the dataset.

Download the entire dataset, e.g. “*Download All*” to your workstation and unzip the archive in your current working directory with the folder named ‘*AirQualityPrediction*‘.

Our focus will be the ‘*TrainingData.csv*‘ file that contains the training dataset, specifically data in chunks where each chunk is eight contiguous days of observations and target variables.

We can load the data file into memory using the Pandas read_csv() function and specify the header row on line 0.

# load dataset dataset = read_csv('AirQualityPrediction/TrainingData.csv', header=0)

We can group data by the ‘chunkID’ variable (column index 1).

First, let’s get a list of the unique chunk identifiers.

chunk_ids = unique(values[:, 1])

We can then collect all rows for each chunk identifier and store them in a dictionary for easy access.

chunks = dict() # sort rows by chunk id for chunk_id in chunk_ids: selection = values[:, chunk_ix] == chunk_id chunks[chunk_id] = values[selection, :]

Below defines a function named *to_chunks()* that takes a NumPy array of the loaded data and returns a dictionary of *chunk_id* to rows for the chunk.

# split the dataset by 'chunkID', return a dict of id to rows def to_chunks(values, chunk_ix=1): chunks = dict() # get the unique chunk ids chunk_ids = unique(values[:, chunk_ix]) # group rows by chunk id for chunk_id in chunk_ids: selection = values[:, chunk_ix] == chunk_id chunks[chunk_id] = values[selection, :] return chunks

The complete example that loads the dataset and splits it into chunks is listed below.

# load data and split into chunks from numpy import unique from pandas import read_csv # split the dataset by 'chunkID', return a dict of id to rows def to_chunks(values, chunk_ix=1): chunks = dict() # get the unique chunk ids chunk_ids = unique(values[:, chunk_ix]) # group rows by chunk id for chunk_id in chunk_ids: selection = values[:, chunk_ix] == chunk_id chunks[chunk_id] = values[selection, :] return chunks # load dataset dataset = read_csv('AirQualityPrediction/TrainingData.csv', header=0) # group data by chunks values = dataset.values chunks = to_chunks(values) print('Total Chunks: %d' % len(chunks))

Running the example prints the number of chunks in the dataset.

Total Chunks: 208

Now that we know how to load the data and split it into chunks, we can separate into train and test datasets.

Each chunk covers an interval of eight days of hourly observations, although the number of actual observations within each chunk may vary widely.

We can split each chunk into the first five days of observations for training and the last three for test.

Each observation has a row called ‘*position_within_chunk*‘ that varies from 1 to 192 (8 days * 24 hours). We can therefore take all rows with a value in this column that is less than or equal to 120 (5 * 24) as training data and any values more than 120 as test data.

Further, any chunks that don’t have any observations in the train or test split can be dropped as not viable.

When working with the naive models, we are only interested in the target variables, and none of the input meteorological variables. Therefore, we can remove the input data and have the train and test data only comprised of the 39 target variables for each chunk, as well as the position within chunk and hour of observation.

The *split_train_test()* function below implements this behavior; given a dictionary of chunks, it will split each into a list of train and test chunk data.

# split each chunk into train/test sets def split_train_test(chunks, row_in_chunk_ix=2): train, test = list(), list() # first 5 days of hourly observations for train cut_point = 5 * 24 # enumerate chunks for k,rows in chunks.items(): # split chunk rows by 'position_within_chunk' train_rows = rows[rows[:,row_in_chunk_ix] <= cut_point, :] test_rows = rows[rows[:,row_in_chunk_ix] > cut_point, :] if len(train_rows) == 0 or len(test_rows) == 0: print('>dropping chunk=%d: train=%s, test=%s' % (k, train_rows.shape, test_rows.shape)) continue # store with chunk id, position in chunk, hour and all targets indices = [1,2,5] + [x for x in range(56,train_rows.shape[1])] train.append(train_rows[:, indices]) test.append(test_rows[:, indices]) return train, test

We do not require the entire test dataset; instead, we only require the observations at specific lead times over the three day period, specifically the lead times:

+1, +2, +3, +4, +5, +10, +17, +24, +48, +72

Where, each lead time is relative to the end of the training period.

First, we can put these lead times into a function for easy reference:

# return a list of relative forecast lead times def get_lead_times(): return [1, 2 ,3, 4, 5, 10, 17, 24, 48, 72]

Next, we can reduce the test dataset down to just the data at the preferred lead times.

We can do that by looking at the ‘*position_within_chunk*‘ column and using the lead time as an offset from the end of the training dataset, e.g. 120 + 1, 120 +2, etc.

If we find a matching row in the test set, it is saved, otherwise a row of NaN observations is generated.

The function *to_forecasts()* below implements this and returns a NumPy array with one row for each forecast lead time for each chunk.

# convert the rows in a test chunk to forecasts def to_forecasts(test_chunks, row_in_chunk_ix=1): # get lead times lead_times = get_lead_times() # first 5 days of hourly observations for train cut_point = 5 * 24 forecasts = list() # enumerate each chunk for rows in test_chunks: chunk_id = rows[0, 0] # enumerate each lead time for tau in lead_times: # determine the row in chunk we want for the lead time offset = cut_point + tau # retrieve data for the lead time using row number in chunk row_for_tau = rows[rows[:,row_in_chunk_ix]==offset, :] # check if we have data if len(row_for_tau) == 0: # create a mock row [chunk, position, hour] + [nan...] row = [chunk_id, offset, nan] + [nan for _ in range(39)] forecasts.append(row) else: # store the forecast row forecasts.append(row_for_tau[0]) return array(forecasts)

We can tie all of this together and split the dataset into train and test sets and save the results to new files.

The complete code example is listed below.

# split data into train and test sets from numpy import unique from numpy import nan from numpy import array from numpy import savetxt from pandas import read_csv # split the dataset by 'chunkID', return a dict of id to rows def to_chunks(values, chunk_ix=1): chunks = dict() # get the unique chunk ids chunk_ids = unique(values[:, chunk_ix]) # group rows by chunk id for chunk_id in chunk_ids: selection = values[:, chunk_ix] == chunk_id chunks[chunk_id] = values[selection, :] return chunks # split each chunk into train/test sets def split_train_test(chunks, row_in_chunk_ix=2): train, test = list(), list() # first 5 days of hourly observations for train cut_point = 5 * 24 # enumerate chunks for k,rows in chunks.items(): # split chunk rows by 'position_within_chunk' train_rows = rows[rows[:,row_in_chunk_ix] <= cut_point, :] test_rows = rows[rows[:,row_in_chunk_ix] > cut_point, :] if len(train_rows) == 0 or len(test_rows) == 0: print('>dropping chunk=%d: train=%s, test=%s' % (k, train_rows.shape, test_rows.shape)) continue # store with chunk id, position in chunk, hour and all targets indices = [1,2,5] + [x for x in range(56,train_rows.shape[1])] train.append(train_rows[:, indices]) test.append(test_rows[:, indices]) return train, test # return a list of relative forecast lead times def get_lead_times(): return [1, 2 ,3, 4, 5, 10, 17, 24, 48, 72] # convert the rows in a test chunk to forecasts def to_forecasts(test_chunks, row_in_chunk_ix=1): # get lead times lead_times = get_lead_times() # first 5 days of hourly observations for train cut_point = 5 * 24 forecasts = list() # enumerate each chunk for rows in test_chunks: chunk_id = rows[0, 0] # enumerate each lead time for tau in lead_times: # determine the row in chunk we want for the lead time offset = cut_point + tau # retrieve data for the lead time using row number in chunk row_for_tau = rows[rows[:,row_in_chunk_ix]==offset, :] # check if we have data if len(row_for_tau) == 0: # create a mock row [chunk, position, hour] + [nan...] row = [chunk_id, offset, nan] + [nan for _ in range(39)] forecasts.append(row) else: # store the forecast row forecasts.append(row_for_tau[0]) return array(forecasts) # load dataset dataset = read_csv('AirQualityPrediction/TrainingData.csv', header=0) # group data by chunks values = dataset.values chunks = to_chunks(values) # split into train/test train, test = split_train_test(chunks) # flatten training chunks to rows train_rows = array([row for rows in train for row in rows]) # print(train_rows.shape) print('Train Rows: %s' % str(train_rows.shape)) # reduce train to forecast lead times only test_rows = to_forecasts(test) print('Test Rows: %s' % str(test_rows.shape)) # save datasets savetxt('AirQualityPrediction/naive_train.csv', train_rows, delimiter=',') savetxt('AirQualityPrediction/naive_test.csv', test_rows, delimiter=',')

Running the example first comments that chunk 69 is removed from the dataset for having insufficient data.

We can then see that we have 42 columns in each of the train and test sets, one for the chunk id, position within chunk, hour of day, and the 39 training variables.

We can also see the dramatically smaller version of the test dataset with rows only at the forecast lead times.

The new train and test datasets are saved in the ‘*naive_train.csv*‘ and ‘*naive_test.csv*‘ files respectively.

>dropping chunk=69: train=(0, 95), test=(28, 95) Train Rows: (23514, 42) Test Rows: (2070, 42)

Once forecasts have been made, they need to be evaluated.

It is helpful to have a simpler format when evaluating forecasts. For example, we will use the three-dimensional structure of *[chunks][variables][time]*, where variable is the target variable number from 0 to 38 and time is the lead time index from 0 to 9.

Models are expected to make predictions in this format.

We can also restructure the test dataset to have this dataset for comparison. The *prepare_test_forecasts()* function below implements this.

# convert the test dataset in chunks to [chunk][variable][time] format def prepare_test_forecasts(test_chunks): predictions = list() # enumerate chunks to forecast for rows in test_chunks: # enumerate targets for chunk chunk_predictions = list() for j in range(3, rows.shape[1]): yhat = rows[:, j] chunk_predictions.append(yhat) chunk_predictions = array(chunk_predictions) predictions.append(chunk_predictions) return array(predictions)

We will evaluate a model using the mean absolute error, or MAE. This is the metric that was used in the competition and is a sensible choice given the non-Gaussian distribution of the target variables.

If a lead time contains no data in the test set (e.g. *NaN*), then no error will be calculated for that forecast. If the lead time does have data in the test set but no data in the forecast, then the full magnitude of the observation will be taken as error. Finally, if the test set has an observation and a forecast was made, then the absolute difference will be recorded as the error.

The *calculate_error()* function implements these rules and returns the error for a given forecast.

# calculate the error between an actual and predicted value def calculate_error(actual, predicted): # give the full actual value if predicted is nan if isnan(predicted): return abs(actual) # calculate abs difference return abs(actual - predicted)

Errors are summed across all chunks and all lead times, then averaged.

The overall MAE will be calculated, but we will also calculate a MAE for each forecast lead time. This can help with model selection generally as some models may perform differently at different lead times.

The evaluate_forecasts() function below implements this, calculating the MAE and per-lead time MAE for the provided predictions and expected values in *[chunk][variable][time]* format.

# evaluate a forecast in the format [chunk][variable][time] def evaluate_forecasts(predictions, testset): lead_times = get_lead_times() total_mae, times_mae = 0.0, [0.0 for _ in range(len(lead_times))] total_c, times_c = 0, [0 for _ in range(len(lead_times))] # enumerate test chunks for i in range(len(test_chunks)): # convert to forecasts actual = testset[i] predicted = predictions[i] # enumerate target variables for j in range(predicted.shape[0]): # enumerate lead times for k in range(len(lead_times)): # skip if actual in nan if isnan(actual[j, k]): continue # calculate error error = calculate_error(actual[j, k], predicted[j, k]) # update statistics total_mae += error times_mae[k] += error total_c += 1 times_c[k] += 1 # normalize summed absolute errors total_mae /= total_c times_mae = [times_mae[i]/times_c[i] for i in range(len(times_mae))] return total_mae, times_mae

Once we have the evaluation of a model, we can present it.

The *summarize_error()* function below first prints a one-line summary of a model’s performance then creates a plot of MAE per forecast lead time.

# summarize scores def summarize_error(name, total_mae, times_mae): # print summary lead_times = get_lead_times() formatted = ['+%d %.3f' % (lead_times[i], times_mae[i]) for i in range(len(lead_times))] s_scores = ', '.join(formatted) print('%s: [%.3f MAE] %s' % (name, total_mae, s_scores)) # plot summary pyplot.plot([str(x) for x in lead_times], times_mae, marker='.') pyplot.show()

We are now ready to start exploring the performance of naive forecasting methods.

The first step in fitting classical time series models to this data is to take a closer look at the data.

There are 208 (actually 207) usable chunks of data, and each chunk has 39 time series to fit; that is a total of 8,073 separate models that would need to be fit on the data. That is a lot of models, but the models are trained on a relatively small amount of data, at most (5 * 24) or 120 observations and the model is linear so it will find a fit quickly.

We have choices about how to configure models to the data; for example:

- One model configuration for all time series (simplest).
- One model configuration for all variables across chunks (reasonable).
- One model configuration per variable per chunk (most complex).

We will investigate the simplest approach of one model configuration for all series, but you may want to explore one or more of the other approaches.

This section is divided into three parts; they are:

- Missing Data
- Impute Missing Data
- Autocorrelation Plots

Classical time series methods require that the time series to be complete, e.g. that there are no missing values.

Therefore the first step is to investigate how complete or incomplete the target variables are.

For a given variable, there may be missing observations defined by missing rows. Specifically, each observation has a ‘*position_within_chunk*‘. We expect each chunk in the training dataset to have 120 observations, with ‘*positions_within_chunk*‘ from 1 to 120 inclusively.

Therefore, we can create an array of 120 nan values for each variable, mark all observations in the chunk using the ‘*positions_within_chunk*‘ values, and anything left will be marked *NaN*. We can then plot each variable and look for gaps.

The *variable_to_series()* function below will take the rows for a chunk and a given column index for the target variable and will return a series of 120 time steps for the variable with all available data marked with the value from the chunk.

# layout a variable with breaks in the data for missing positions def variable_to_series(chunk_train, col_ix, n_steps=5*24): # lay out whole series data = [nan for _ in range(n_steps)] # mark all available data for i in range(len(chunk_train)): # get position in chunk position = int(chunk_train[i, 1] - 1) # store data data[position] = chunk_train[i, col_ix] return data

We can then call this function for each target variable in one chunk and create a line plot.

The function below named *plot_variables()* will implement this and create a figure with 39 line plots stacked horizontally.

# plot variables horizontally with gaps for missing data def plot_variables(chunk_train, n_vars=39): pyplot.figure() for i in range(n_vars): # convert target number into column number col_ix = 3 + i # mark missing obs for variable series = variable_to_series(chunk_train, col_ix) # plot ax = pyplot.subplot(n_vars, 1, i+1) ax.set_xticklabels([]) ax.set_yticklabels([]) pyplot.plot(series) # show plot pyplot.show()

Tying this together, the complete example is listed below. A plot of all variables in the first chunk is created.

# plot missing from numpy import loadtxt from numpy import nan from numpy import unique from matplotlib import pyplot # split the dataset by 'chunkID', return a list of chunks def to_chunks(values, chunk_ix=0): chunks = list() # get the unique chunk ids chunk_ids = unique(values[:, chunk_ix]) # group rows by chunk id for chunk_id in chunk_ids: selection = values[:, chunk_ix] == chunk_id chunks.append(values[selection, :]) return chunks # layout a variable with breaks in the data for missing positions def variable_to_series(chunk_train, col_ix, n_steps=5*24): # lay out whole series data = [nan for _ in range(n_steps)] # mark all available data for i in range(len(chunk_train)): # get position in chunk position = int(chunk_train[i, 1] - 1) # store data data[position] = chunk_train[i, col_ix] return data # plot variables horizontally with gaps for missing data def plot_variables(chunk_train, n_vars=39): pyplot.figure() for i in range(n_vars): # convert target number into column number col_ix = 3 + i # mark missing obs for variable series = variable_to_series(chunk_train, col_ix) # plot ax = pyplot.subplot(n_vars, 1, i+1) ax.set_xticklabels([]) ax.set_yticklabels([]) pyplot.plot(series) # show plot pyplot.show() # load dataset train = loadtxt('AirQualityPrediction/naive_train.csv', delimiter=',') # group data by chunks train_chunks = to_chunks(train) # pick one chunk rows = train_chunks[0] # plot variables plot_variables(rows)

Running the example creates a figure with 39 line plots, one for each target variable in the first chunk.

We can see a seasonal structure in many of the variables. This suggests it may be beneficial to perform a 24-hour seasonal differencing of each series prior to modeling.

The plots are small, and you may need to increase the size of the figure to clearly see the data.

We can see that there are variables for which we have no data. These can be detected and ignored as we cannot model or forecast them.

We can see gaps in many of the series, but the gaps are short, lasting for a few hours at most. These could be imputed either with persistence of previous values or values at the same hours within the same series.

Looking at a few other chunks randomly, many result in plots with much the same observations.

This is not always the case though.

Update the example to plot the 4th chunk in the dataset (index 3).

# pick one chunk rows = train_chunks[3]

The result is a figure that tells a very different story.

We see gaps in the data that last for many hours, perhaps up to a day or more.

These series will require dramatic repair before they can be used to fit a classical model.

Imputing the missing data using persistence or observations within the series with the same hour will likely not be sufficient. They may have to be filled with average values taken across the entire training dataset.

There are many ways to impute the missing data, and we cannot know which is best a priori.

One approach would be to prepare the data using multiple different imputation methods and use the skill of the models fit on the data to help guide the best approach.

Some imputation approaches already suggested include:

- Persist the last observation in the series, also called linear interpolation.
- Fill with values or average values within the series with the same hour of day.
- Fill with values or average values with the same hour of day across the training dataset.

It may also be useful to use combinations, e.g. persist or fill from the series for small gaps and draw from the whole dataset for large gaps.

We can also investigate the effect of imputing methods by filling in the missing data and looking at plots to see if the series looks reasonable. It’s crude, effective, and fast.

First, we need to calculate a parallel series of the hour of day for each chunk that we can use for imputing hour-specific data for each variable in the chunk.

Given a series of partially filled hours of day, the *interpolate_hours()* function below will fill in the missing hours of day. It does this by finding the first marked hour, then counting forward, filling in the hour of day, then performing the same operation backwards.

# interpolate series of hours (in place) in 24 hour time def interpolate_hours(hours): # find the first hour ix = -1 for i in range(len(hours)): if not isnan(hours[i]): ix = i break # fill-forward hour = hours[ix] for i in range(ix+1, len(hours)): # increment hour hour += 1 # check for a fill if isnan(hours[i]): hours[i] = hour % 24 # fill-backward hour = hours[ix] for i in range(ix-1, -1, -1): # decrement hour hour -= 1 # check for a fill if isnan(hours[i]): hours[i] = hour % 24

I’m sure there is a more Pythonic way to write this function, but I wanted to lay it all out to make it obvious what was going on.

We can test this out on a mock list of hours with missing data. The complete example is listed below.

# interpolate hours from numpy import nan from numpy import isnan # interpolate series of hours (in place) in 24 hour time def interpolate_hours(hours): # find the first hour ix = -1 for i in range(len(hours)): if not isnan(hours[i]): ix = i break # fill-forward hour = hours[ix] for i in range(ix+1, len(hours)): # increment hour hour += 1 # check for a fill if isnan(hours[i]): hours[i] = hour % 24 # fill-backward hour = hours[ix] for i in range(ix-1, -1, -1): # decrement hour hour -= 1 # check for a fill if isnan(hours[i]): hours[i] = hour % 24 # define hours with missing data data = [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 0, nan, 2, nan, nan, nan, nan, nan, nan, 9, 10, 11, 12, 13, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan] print(data) # fill in missing hours interpolate_hours(data) print(data)

Running the example first prints the hour data with missing values, then the same sequence with all of the hours filled in correctly.

[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, 0, nan, 2, nan, nan, nan, nan, nan, nan, 9, 10, 11, 12, 13, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan] [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 0, 1]

We can use this function to prepare a series of hours for a chunk that can be used to fill in missing values for a chunk using hour-specific information.

We can call the same *variable_to_series()* function from the previous section to create the series of hours with missing values (column index 2), then call *interpolate_hours()* to fill in the gaps.

# prepare sequence of hours for the chunk hours = variable_to_series(rows, 2) # interpolate hours interpolate_hours(hours)

We can then pass the hours to any impute function that may make use of it.

Let’s try filling in missing values in a chunk with values within the same series with the same hour. Specifically, we will find all rows with the same hour on the series and calculate the median value.

The *impute_missing()* below takes all of the rows in a chunk, the prepared sequence of hours of the day for the chunk, and the series with missing values for a variable and the column index for a variable.

It first checks to see if the series is all missing data and returns immediately if this is the case as no impute can be performed. It then enumerates over the time steps of the series and when it detects a time step with no data, it collects all rows in the series with data for the same hour and calculates the median value.

# impute missing data def impute_missing(rows, hours, series, col_ix): # count missing observations n_missing = count_nonzero(isnan(series)) # calculate ratio of missing ratio = n_missing / float(len(series)) * 100 # check for no data if ratio == 100.0: return series # impute missing using the median value for hour in the series imputed = list() for i in range(len(series)): if isnan(series[i]): # get all rows with the same hour matches = rows[rows[:,2]==hours[i]] # fill with median value value = nanmedian(matches[:, col_ix]) imputed.append(value) else: imputed.append(series[i]) return imputed

To see the impact of this impute strategy, we can update the *plot_variables()* function from the previous section to first plot the imputed series then plot the original series with missing values.

This will allow the imputed values to shine through in the gaps of the original series and we can see if the results look reasonable.

The updated version of the *plot_variables()* function is listed below with this change, calling the *impute_missing()* function to create the imputed version of the series and taking the hours series as an argument.

# plot variables horizontally with gaps for missing data def plot_variables(chunk_train, hours, n_vars=39): pyplot.figure() for i in range(n_vars): # convert target number into column number col_ix = 3 + i # mark missing obs for variable series = variable_to_series(chunk_train, col_ix) ax = pyplot.subplot(n_vars, 1, i+1) ax.set_xticklabels([]) ax.set_yticklabels([]) # imputed imputed = impute_missing(chunk_train, hours, series, col_ix) # plot imputed pyplot.plot(imputed) # plot with missing pyplot.plot(series) # show plot pyplot.show()

Tying all of this together, the complete example is listed below.

# impute missing from numpy import loadtxt from numpy import nan from numpy import isnan from numpy import count_nonzero from numpy import unique from numpy import nanmedian from matplotlib import pyplot # split the dataset by 'chunkID', return a list of chunks def to_chunks(values, chunk_ix=0): chunks = list() # get the unique chunk ids chunk_ids = unique(values[:, chunk_ix]) # group rows by chunk id for chunk_id in chunk_ids: selection = values[:, chunk_ix] == chunk_id chunks.append(values[selection, :]) return chunks # impute missing data def impute_missing(rows, hours, series, col_ix): # count missing observations n_missing = count_nonzero(isnan(series)) # calculate ratio of missing ratio = n_missing / float(len(series)) * 100 # check for no data if ratio == 100.0: return series # impute missing using the median value for hour in the series imputed = list() for i in range(len(series)): if isnan(series[i]): # get all rows with the same hour matches = rows[rows[:,2]==hours[i]] # fill with median value value = nanmedian(matches[:, col_ix]) imputed.append(value) else: imputed.append(series[i]) return imputed # interpolate series of hours (in place) in 24 hour time def interpolate_hours(hours): # find the first hour ix = -1 for i in range(len(hours)): if not isnan(hours[i]): ix = i break # fill-forward hour = hours[ix] for i in range(ix+1, len(hours)): # increment hour hour += 1 # check for a fill if isnan(hours[i]): hours[i] = hour % 24 # fill-backward hour = hours[ix] for i in range(ix-1, -1, -1): # decrement hour hour -= 1 # check for a fill if isnan(hours[i]): hours[i] = hour % 24 # layout a variable with breaks in the data for missing positions def variable_to_series(chunk_train, col_ix, n_steps=5*24): # lay out whole series data = [nan for _ in range(n_steps)] # mark all available data for i in range(len(chunk_train)): # get position in chunk position = int(chunk_train[i, 1] - 1) # store data data[position] = chunk_train[i, col_ix] return data # plot variables horizontally with gaps for missing data def plot_variables(chunk_train, hours, n_vars=39): pyplot.figure() for i in range(n_vars): # convert target number into column number col_ix = 3 + i # mark missing obs for variable series = variable_to_series(chunk_train, col_ix) ax = pyplot.subplot(n_vars, 1, i+1) ax.set_xticklabels([]) ax.set_yticklabels([]) # imputed imputed = impute_missing(chunk_train, hours, series, col_ix) # plot imputed pyplot.plot(imputed) # plot with missing pyplot.plot(series) # show plot pyplot.show() # load dataset train = loadtxt('AirQualityPrediction/naive_train.csv', delimiter=',') # group data by chunks train_chunks = to_chunks(train) # pick one chunk rows = train_chunks[0] # prepare sequence of hours for the chunk hours = variable_to_series(rows, 2) # interpolate hours interpolate_hours(hours) # plot variables plot_variables(rows, hours)

Running the example creates a single figure with 39 line plots: one for each target variable in the first chunk in the training dataset.

We can see that the series is orange, showing the original data and the gaps have been imputed and are marked in blue.

The blue segments seem reasonable.

We can try the same approach on the 4th chunk in the dataset that has a lot more missing data.

# pick one chunk rows = train_chunks[0]

Running the example creates the same kind of figure, but here we can see the large missing segments filled in with imputed values.

Again, the sequences seem reasonable, even showing daily seasonal cycle structure where appropriate.

This looks like a good start; you can explore other imputation strategies and see how they compare either in terms of line plots or on the resulting model skill.

Now that we know how to fill in the missing values, we can take a look at autocorrelation plots for the series data.

Autocorrelation plots summarize the relationship of each observation with observations at prior time steps. Together with partial autocorrelation plots, they can be used to determine the configuration for an ARMA model.

The statsmodels library provides the plot_acf() and plot_pacf() functions that can be used to plot ACF and PACF plots respectively.

We can update the *plot_variables()* to create these plots, one of each type for each of the 39 series. That is a lot of plots.

We will stack all ACF plots on the left vertically and all PACF plots on the right vertically. That is two columns of 39 plots. We will limit the lags considered by the plot to 24 time steps (hours) and ignore the correlation of each variable with itself as it is redundant.

The updated *plot_variables()* function for plotting ACF and PACF plots is listed below.

# plot acf and pacf plots for each imputed variable series def plot_variables(chunk_train, hours, n_vars=39): pyplot.figure() n_plots = n_vars * 2 j = 0 lags = 24 for i in range(1, n_plots, 2): # convert target number into column number col_ix = 3 + j j += 1 # get series series = variable_to_series(chunk_train, col_ix) imputed = impute_missing(chunk_train, hours, series, col_ix) # acf axis = pyplot.subplot(n_vars, 2, i) plot_acf(imputed, ax=axis, lags=lags, zero=False) axis.set_title('') axis.set_xticklabels([]) axis.set_yticklabels([]) # pacf axis = pyplot.subplot(n_vars, 2, i+1) plot_pacf(imputed, ax=axis, lags=lags, zero=False) axis.set_title('') axis.set_xticklabels([]) axis.set_yticklabels([]) # show plot pyplot.show()

The complete example is listed below.

# acf and pacf plots from numpy import loadtxt from numpy import nan from numpy import isnan from numpy import count_nonzero from numpy import unique from numpy import nanmedian from matplotlib import pyplot from statsmodels.graphics.tsaplots import plot_acf from statsmodels.graphics.tsaplots import plot_pacf # split the dataset by 'chunkID', return a list of chunks def to_chunks(values, chunk_ix=0): chunks = list() # get the unique chunk ids chunk_ids = unique(values[:, chunk_ix]) # group rows by chunk id for chunk_id in chunk_ids: selection = values[:, chunk_ix] == chunk_id chunks.append(values[selection, :]) return chunks # impute missing data def impute_missing(rows, hours, series, col_ix): # count missing observations n_missing = count_nonzero(isnan(series)) # calculate ratio of missing ratio = n_missing / float(len(series)) * 100 # check for no data if ratio == 100.0: return series # impute missing using the median value for hour in the series imputed = list() for i in range(len(series)): if isnan(series[i]): # get all rows with the same hour matches = rows[rows[:,2]==hours[i]] # fill with median value value = nanmedian(matches[:, col_ix]) imputed.append(value) else: imputed.append(series[i]) return imputed # interpolate series of hours (in place) in 24 hour time def interpolate_hours(hours): # find the first hour ix = -1 for i in range(len(hours)): if not isnan(hours[i]): ix = i break # fill-forward hour = hours[ix] for i in range(ix+1, len(hours)): # increment hour hour += 1 # check for a fill if isnan(hours[i]): hours[i] = hour % 24 # fill-backward hour = hours[ix] for i in range(ix-1, -1, -1): # decrement hour hour -= 1 # check for a fill if isnan(hours[i]): hours[i] = hour % 24 # layout a variable with breaks in the data for missing positions def variable_to_series(chunk_train, col_ix, n_steps=5*24): # lay out whole series data = [nan for _ in range(n_steps)] # mark all available data for i in range(len(chunk_train)): # get position in chunk position = int(chunk_train[i, 1] - 1) # store data data[position] = chunk_train[i, col_ix] return data # plot acf and pacf plots for each imputed variable series def plot_variables(chunk_train, hours, n_vars=39): pyplot.figure() n_plots = n_vars * 2 j = 0 lags = 24 for i in range(1, n_plots, 2): # convert target number into column number col_ix = 3 + j j += 1 # get series series = variable_to_series(chunk_train, col_ix) imputed = impute_missing(chunk_train, hours, series, col_ix) # acf axis = pyplot.subplot(n_vars, 2, i) plot_acf(imputed, ax=axis, lags=lags, zero=False) axis.set_title('') axis.set_xticklabels([]) axis.set_yticklabels([]) # pacf axis = pyplot.subplot(n_vars, 2, i+1) plot_pacf(imputed, ax=axis, lags=lags, zero=False) axis.set_title('') axis.set_xticklabels([]) axis.set_yticklabels([]) # show plot pyplot.show() # load dataset train = loadtxt('AirQualityPrediction/naive_train.csv', delimiter=',') # group data by chunks train_chunks = to_chunks(train) # pick one chunk rows = train_chunks[0] # prepare sequence of hours for the chunk hours = variable_to_series(rows, 2) # interpolate hours interpolate_hours(hours) # plot variables plot_variables(rows, hours)

Running the example creates a figure with a lot of plots for the target variables in the first chunk of the training dataset.

You may need to increase the size of the plot window to better see the details of each plot.

We can see on the left that most ACF plots show significant correlations (dots above the significance region) at lags 1-2 steps, maybe lags 1-3 steps in some cases, with a slow, steady decrease over the lags

Similarly, on the right, we can see significant lags in the PACF plot at 1-2 time steps with steep fall-off.

This strongly suggests an autocorrelation process with an order of perhaps 1, 2, or 3, e.g. AR(3).

In the ACF plots on the left we can also see a daily cycle in the correlations. This may suggest some benefit in a seasonal differencing of the data prior to modeling or the use of an AR model capable of seasonal differencing.

We can repeat this analysis of the target variables for other chunks and we see much the same picture.

It suggests we may be able to get away with a general AR model configuration for all series across all chunks.

In this section, we will develop an autoregressive model for the imputed target series data.

The first step is to implement a general function for making a forecast for each chunk.

The function tasks the training dataset and the input columns (chunk id, position in chunk, and hour) for the test set and returns forecasts for all chunks with the expected 3D format of *[chunk][variable][time]*.

The function enumerates the chunks in the forecast, then enumerates the 39 target columns, calling another new function named *forecast_variable()* in order to make a prediction for each lead time for a given target variable.

The complete function is listed below.

# forecast for each chunk, returns [chunk][variable][time] def forecast_chunks(train_chunks, test_input): lead_times = get_lead_times() predictions = list() # enumerate chunks to forecast for i in range(len(train_chunks)): # prepare sequence of hours for the chunk hours = variable_to_series(train_chunks[i], 2) # interpolate hours interpolate_hours(hours) # enumerate targets for chunk chunk_predictions = list() for j in range(39): yhat = forecast_variable(hours, train_chunks[i], test_input[i], lead_times, j) chunk_predictions.append(yhat) chunk_predictions = array(chunk_predictions) predictions.append(chunk_predictions) return array(predictions)

We can now implement a version of the *forecast_variable()*.

For each variable, we first check if there is no data (e.g. all NaNs) and if so, we return a forecast that is a NaN for each forecast lead time.

We then create a series from the variable using the *variable_to_series()* and then impute the missing values using the median within the series by calling *impute_missing()*, both of which were developed in the previous section.

Finally, we call a new function named *fit_and_forecast()* that fits a model and predicts the 10 forecast lead times.

# forecast all lead times for one variable def forecast_variable(hours, chunk_train, chunk_test, lead_times, target_ix): # convert target number into column number col_ix = 3 + target_ix # check for no data if not has_data(chunk_train[:, col_ix]): forecast = [nan for _ in range(len(lead_times))] return forecast # get series series = variable_to_series(chunk_train, col_ix) # impute imputed = impute_missing(chunk_train, hours, series, col_ix) # fit AR model and forecast forecast = fit_and_forecast(imputed) return forecast

We will fit an AR model to a given imputed series. To do this, we will use the statsmodels ARIMA class. We will use ARIMA instead of AR to offer some flexibility if you would like to explore any of the family of ARIMA models.

First, we must define the model, including the order of the autoregressive process, such as AR(1).

# define the model model = ARIMA(series, order=(1,0,0))

Next, the model is fit on the imputed series. We turn off the verbose information during the fit by setting *disp* to *False*.

# fit the model model_fit = model.fit(disp=False)

The fit model is then used to forecast the next 72 hours beyond the end of the series.

# forecast 72 hours yhat = model_fit.predict(len(series), len(series)+72)

We are only interested in specific lead times, so we prepare an array of those lead times, subtract 1 to turn them into array indices, then use them to select the values at the 10 forecast lead times in which we are interested.

# extract lead times lead_times = array(get_lead_times()) indices = lead_times - 1 return yhat[indices]

The statsmodels ARIMA models use linear algebra libraries to fit the model under the covers, and sometimes the fit process can be unstable on some data. As such, it can throw an exception or report a lot of warnings.

We will trap exceptions and return a *NaN* forecast, and ignore all warnings during the fit and evaluation.

The *fit_and_forecast()* function below ties all of this together.

# fit AR model and generate a forecast def fit_and_forecast(series): # define the model model = ARIMA(series, order=(1,0,0)) # return a nan forecast in case of exception try: # ignore statsmodels warnings with catch_warnings(): filterwarnings("ignore") # fit the model model_fit = model.fit(disp=False) # forecast 72 hours yhat = model_fit.predict(len(series), len(series)+72) # extract lead times lead_times = array(get_lead_times()) indices = lead_times - 1 return yhat[indices] except: return [nan for _ in range(len(get_lead_times()))]

We are now ready to evaluate an autoregressive process for each of the 39 series in each of the 207 training chunks.

We will start off by testing an AR(1) process.

The complete code example is listed below.

# autoregression forecast from numpy import loadtxt from numpy import nan from numpy import isnan from numpy import count_nonzero from numpy import unique from numpy import array from numpy import nanmedian from statsmodels.tsa.arima_model import ARIMA from matplotlib import pyplot from warnings import catch_warnings from warnings import filterwarnings # split the dataset by 'chunkID', return a list of chunks def to_chunks(values, chunk_ix=0): chunks = list() # get the unique chunk ids chunk_ids = unique(values[:, chunk_ix]) # group rows by chunk id for chunk_id in chunk_ids: selection = values[:, chunk_ix] == chunk_id chunks.append(values[selection, :]) return chunks # return a list of relative forecast lead times def get_lead_times(): return [1, 2, 3, 4, 5, 10, 17, 24, 48, 72] # interpolate series of hours (in place) in 24 hour time def interpolate_hours(hours): # find the first hour ix = -1 for i in range(len(hours)): if not isnan(hours[i]): ix = i break # fill-forward hour = hours[ix] for i in range(ix+1, len(hours)): # increment hour hour += 1 # check for a fill if isnan(hours[i]): hours[i] = hour % 24 # fill-backward hour = hours[ix] for i in range(ix-1, -1, -1): # decrement hour hour -= 1 # check for a fill if isnan(hours[i]): hours[i] = hour % 24 # return true if the array has any non-nan values def has_data(data): return count_nonzero(isnan(data)) < len(data) # impute missing data def impute_missing(rows, hours, series, col_ix): # impute missing using the median value for hour in the series imputed = list() for i in range(len(series)): if isnan(series[i]): # get all rows with the same hour matches = rows[rows[:,2]==hours[i]] # fill with median value value = nanmedian(matches[:, col_ix]) if isnan(value): value = 0.0 imputed.append(value) else: imputed.append(series[i]) return imputed # layout a variable with breaks in the data for missing positions def variable_to_series(chunk_train, col_ix, n_steps=5*24): # lay out whole series data = [nan for _ in range(n_steps)] # mark all available data for i in range(len(chunk_train)): # get position in chunk position = int(chunk_train[i, 1] - 1) # store data data[position] = chunk_train[i, col_ix] return data # fit AR model and generate a forecast def fit_and_forecast(series): # define the model model = ARIMA(series, order=(1,0,0)) # return a nan forecast in case of exception try: # ignore statsmodels warnings with catch_warnings(): filterwarnings("ignore") # fit the model model_fit = model.fit(disp=False) # forecast 72 hours yhat = model_fit.predict(len(series), len(series)+72) # extract lead times lead_times = array(get_lead_times()) indices = lead_times - 1 return yhat[indices] except: return [nan for _ in range(len(get_lead_times()))] # forecast all lead times for one variable def forecast_variable(hours, chunk_train, chunk_test, lead_times, target_ix): # convert target number into column number col_ix = 3 + target_ix # check for no data if not has_data(chunk_train[:, col_ix]): forecast = [nan for _ in range(len(lead_times))] return forecast # get series series = variable_to_series(chunk_train, col_ix) # impute imputed = impute_missing(chunk_train, hours, series, col_ix) # fit AR model and forecast forecast = fit_and_forecast(imputed) return forecast # forecast for each chunk, returns [chunk][variable][time] def forecast_chunks(train_chunks, test_input): lead_times = get_lead_times() predictions = list() # enumerate chunks to forecast for i in range(len(train_chunks)): # prepare sequence of hours for the chunk hours = variable_to_series(train_chunks[i], 2) # interpolate hours interpolate_hours(hours) # enumerate targets for chunk chunk_predictions = list() for j in range(39): yhat = forecast_variable(hours, train_chunks[i], test_input[i], lead_times, j) chunk_predictions.append(yhat) chunk_predictions = array(chunk_predictions) predictions.append(chunk_predictions) return array(predictions) # convert the test dataset in chunks to [chunk][variable][time] format def prepare_test_forecasts(test_chunks): predictions = list() # enumerate chunks to forecast for rows in test_chunks: # enumerate targets for chunk chunk_predictions = list() for j in range(3, rows.shape[1]): yhat = rows[:, j] chunk_predictions.append(yhat) chunk_predictions = array(chunk_predictions) predictions.append(chunk_predictions) return array(predictions) # calculate the error between an actual and predicted value def calculate_error(actual, predicted): # give the full actual value if predicted is nan if isnan(predicted): return abs(actual) # calculate abs difference return abs(actual - predicted) # evaluate a forecast in the format [chunk][variable][time] def evaluate_forecasts(predictions, testset): lead_times = get_lead_times() total_mae, times_mae = 0.0, [0.0 for _ in range(len(lead_times))] total_c, times_c = 0, [0 for _ in range(len(lead_times))] # enumerate test chunks for i in range(len(test_chunks)): # convert to forecasts actual = testset[i] predicted = predictions[i] # enumerate target variables for j in range(predicted.shape[0]): # enumerate lead times for k in range(len(lead_times)): # skip if actual in nan if isnan(actual[j, k]): continue # calculate error error = calculate_error(actual[j, k], predicted[j, k]) # update statistics total_mae += error times_mae[k] += error total_c += 1 times_c[k] += 1 # normalize summed absolute errors total_mae /= total_c times_mae = [times_mae[i]/times_c[i] for i in range(len(times_mae))] return total_mae, times_mae # summarize scores def summarize_error(name, total_mae, times_mae): # print summary lead_times = get_lead_times() formatted = ['+%d %.3f' % (lead_times[i], times_mae[i]) for i in range(len(lead_times))] s_scores = ', '.join(formatted) print('%s: [%.3f MAE] %s' % (name, total_mae, s_scores)) # plot summary pyplot.plot([str(x) for x in lead_times], times_mae, marker='.') pyplot.show() # load dataset train = loadtxt('AirQualityPrediction/naive_train.csv', delimiter=',') test = loadtxt('AirQualityPrediction/naive_test.csv', delimiter=',') # group data by chunks train_chunks = to_chunks(train) test_chunks = to_chunks(test) # forecast test_input = [rows[:, :3] for rows in test_chunks] forecast = forecast_chunks(train_chunks, test_input) # evaluate forecast actual = prepare_test_forecasts(test_chunks) total_mae, times_mae = evaluate_forecasts(forecast, actual) # summarize forecast summarize_error('AR', total_mae, times_mae)

Running the example first reports the overall MAE for the test set, followed by the MAE for each forecast lead time.

We can see that the model achieves a MAE of about 0.492, which is less than a MAE 0.520 achieved by a naive persistence model. This shows that indeed the approach has some skill.

AR: [0.492 MAE] +1 0.225, +2 0.342, +3 0.410, +4 0.475, +5 0.512, +10 0.593, +17 0.586, +24 0.588, +48 0.588, +72 0.604

A line plot of MAE per forecast lead time is created, showing the linear increase in forecast error with the increase in forecast lead time.

We can change the code to test other AR models. Specifically the order of the ARIMA model in the *fit_and_forecast()* function.

An AR(2) model can be defined as:

model = ARIMA(series, order=(2,0,0))

Running the code with this update shows a further drop in error to an overall MAE of about 0.490.

AR: [0.490 MAE] +1 0.229, +2 0.342, +3 0.412, +4 0.470, +5 0.503, +10 0.563, +17 0.576, +24 0.605, +48 0.597, +72 0.608

We can also try an AR(3):

model = ARIMA(series, order=(3,0,0))

Re-running the example with the update shows an increase in the overall MAE compared to an AR(2).

An AR(2) might be a good global level configuration to use, although it is expected that models tailored to each variable or each series may perform better overall.

AR: [0.491 MAE] +1 0.232, +2 0.345, +3 0.412, +4 0.472, +5 0.504, +10 0.556, +17 0.575, +24 0.607, +48 0.599, +72 0.611

We can evaluate the AR(2) model with an alternate imputation strategy.

Instead of calculating the median value for the same hour across the series in the chunk, we can calculate the same value across the variable in all chunks.

We can update the *impute_missing()* to take all training chunks as an argument, then collect rows from all chunks for a given hour in order to calculate the median value used to impute. The updated version of the function is listed below.

# impute missing data def impute_missing(train_chunks, rows, hours, series, col_ix): # impute missing using the median value for hour in all series imputed = list() for i in range(len(series)): if isnan(series[i]): # collect all rows across all chunks for the hour all_rows = list() for rows in train_chunks: [all_rows.append(row) for row in rows[rows[:,2]==hours[i]]] # calculate the central tendency for target all_rows = array(all_rows) # fill with median value value = nanmedian(all_rows[:, col_ix]) if isnan(value): value = 0.0 imputed.append(value) else: imputed.append(series[i]) return imputed

In order to pass the train_chunks to the *impute_missing()* function, we must update the *forecast_variable()* function to also take *train_chunks* as an argument and pass it along, and in turn update the *forecast_chunks()* function to pass *train_chunks*.

The complete example using a global imputation strategy is listed below.

# autoregression forecast with global impute strategy from numpy import loadtxt from numpy import nan from numpy import isnan from numpy import count_nonzero from numpy import unique from numpy import array from numpy import nanmedian from statsmodels.tsa.arima_model import ARIMA from matplotlib import pyplot from warnings import catch_warnings from warnings import filterwarnings # split the dataset by 'chunkID', return a list of chunks def to_chunks(values, chunk_ix=0): chunks = list() # get the unique chunk ids chunk_ids = unique(values[:, chunk_ix]) # group rows by chunk id for chunk_id in chunk_ids: selection = values[:, chunk_ix] == chunk_id chunks.append(values[selection, :]) return chunks # return a list of relative forecast lead times def get_lead_times(): return [1, 2, 3, 4, 5, 10, 17, 24, 48, 72] # interpolate series of hours (in place) in 24 hour time def interpolate_hours(hours): # find the first hour ix = -1 for i in range(len(hours)): if not isnan(hours[i]): ix = i break # fill-forward hour = hours[ix] for i in range(ix+1, len(hours)): # increment hour hour += 1 # check for a fill if isnan(hours[i]): hours[i] = hour % 24 # fill-backward hour = hours[ix] for i in range(ix-1, -1, -1): # decrement hour hour -= 1 # check for a fill if isnan(hours[i]): hours[i] = hour % 24 # return true if the array has any non-nan values def has_data(data): return count_nonzero(isnan(data)) < len(data) # impute missing data def impute_missing(train_chunks, rows, hours, series, col_ix): # impute missing using the median value for hour in all series imputed = list() for i in range(len(series)): if isnan(series[i]): # collect all rows across all chunks for the hour all_rows = list() for rows in train_chunks: [all_rows.append(row) for row in rows[rows[:,2]==hours[i]]] # calculate the central tendency for target all_rows = array(all_rows) # fill with median value value = nanmedian(all_rows[:, col_ix]) if isnan(value): value = 0.0 imputed.append(value) else: imputed.append(series[i]) return imputed # layout a variable with breaks in the data for missing positions def variable_to_series(chunk_train, col_ix, n_steps=5*24): # lay out whole series data = [nan for _ in range(n_steps)] # mark all available data for i in range(len(chunk_train)): # get position in chunk position = int(chunk_train[i, 1] - 1) # store data data[position] = chunk_train[i, col_ix] return data # fit AR model and generate a forecast def fit_and_forecast(series): # define the model model = ARIMA(series, order=(2,0,0)) # return a nan forecast in case of exception try: # ignore statsmodels warnings with catch_warnings(): filterwarnings("ignore") # fit the model model_fit = model.fit(disp=False) # forecast 72 hours yhat = model_fit.predict(len(series), len(series)+72) # extract lead times lead_times = array(get_lead_times()) indices = lead_times - 1 return yhat[indices] except: return [nan for _ in range(len(get_lead_times()))] # forecast all lead times for one variable def forecast_variable(hours, train_chunks, chunk_train, chunk_test, lead_times, target_ix): # convert target number into column number col_ix = 3 + target_ix # check for no data if not has_data(chunk_train[:, col_ix]): forecast = [nan for _ in range(len(lead_times))] return forecast # get series series = variable_to_series(chunk_train, col_ix) # impute imputed = impute_missing(train_chunks, chunk_train, hours, series, col_ix) # fit AR model and forecast forecast = fit_and_forecast(imputed) return forecast # forecast for each chunk, returns [chunk][variable][time] def forecast_chunks(train_chunks, test_input): lead_times = get_lead_times() predictions = list() # enumerate chunks to forecast for i in range(len(train_chunks)): # prepare sequence of hours for the chunk hours = variable_to_series(train_chunks[i], 2) # interpolate hours interpolate_hours(hours) # enumerate targets for chunk chunk_predictions = list() for j in range(39): yhat = forecast_variable(hours, train_chunks, train_chunks[i], test_input[i], lead_times, j) chunk_predictions.append(yhat) chunk_predictions = array(chunk_predictions) predictions.append(chunk_predictions) return array(predictions) # convert the test dataset in chunks to [chunk][variable][time] format def prepare_test_forecasts(test_chunks): predictions = list() # enumerate chunks to forecast for rows in test_chunks: # enumerate targets for chunk chunk_predictions = list() for j in range(3, rows.shape[1]): yhat = rows[:, j] chunk_predictions.append(yhat) chunk_predictions = array(chunk_predictions) predictions.append(chunk_predictions) return array(predictions) # calculate the error between an actual and predicted value def calculate_error(actual, predicted): # give the full actual value if predicted is nan if isnan(predicted): return abs(actual) # calculate abs difference return abs(actual - predicted) # evaluate a forecast in the format [chunk][variable][time] def evaluate_forecasts(predictions, testset): lead_times = get_lead_times() total_mae, times_mae = 0.0, [0.0 for _ in range(len(lead_times))] total_c, times_c = 0, [0 for _ in range(len(lead_times))] # enumerate test chunks for i in range(len(test_chunks)): # convert to forecasts actual = testset[i] predicted = predictions[i] # enumerate target variables for j in range(predicted.shape[0]): # enumerate lead times for k in range(len(lead_times)): # skip if actual in nan if isnan(actual[j, k]): continue # calculate error error = calculate_error(actual[j, k], predicted[j, k]) # update statistics total_mae += error times_mae[k] += error total_c += 1 times_c[k] += 1 # normalize summed absolute errors total_mae /= total_c times_mae = [times_mae[i]/times_c[i] for i in range(len(times_mae))] return total_mae, times_mae # summarize scores def summarize_error(name, total_mae, times_mae): # print summary lead_times = get_lead_times() formatted = ['+%d %.3f' % (lead_times[i], times_mae[i]) for i in range(len(lead_times))] s_scores = ', '.join(formatted) print('%s: [%.3f MAE] %s' % (name, total_mae, s_scores)) # plot summary pyplot.plot([str(x) for x in lead_times], times_mae, marker='.') pyplot.show() # load dataset train = loadtxt('AirQualityPrediction/naive_train.csv', delimiter=',') test = loadtxt('AirQualityPrediction/naive_test.csv', delimiter=',') # group data by chunks train_chunks = to_chunks(train) test_chunks = to_chunks(test) # forecast test_input = [rows[:, :3] for rows in test_chunks] forecast = forecast_chunks(train_chunks, test_input) # evaluate forecast actual = prepare_test_forecasts(test_chunks) total_mae, times_mae = evaluate_forecasts(forecast, actual) # summarize forecast summarize_error('AR', total_mae, times_mae)

Running the example shows a further drop in the overall MAE to about 0.487.

It may be interesting to explore imputation strategies that alternate the method used to fill in missing values based on how much missing data a series has or the gap being filled.

AR: [0.487 MAE] +1 0.228, +2 0.339, +3 0.409, +4 0.469, +5 0.499, +10 0.560, +17 0.573, +24 0.600, +48 0.595, +72 0.606

A line plot of MAE vs. forecast lead time is also created.

This section lists some ideas for extending the tutorial that you may wish to explore.

**Imputation Strategies**. Develop and evaluate one additional alternate imputation strategy for the missing data in each series.**Data Preparation**. Explore whether data preparation techniques applied to each can improve model skill, such as standardization, normalization, and power transforms.**Differencing**. Explore whether differencing, such as 1-step or 24-step (seasonal differencing), can make each series stationary, and in turn result in better forecasts.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- A Standard Multivariate, Multi-Step, and Multi-Site Time Series Forecasting Problem
- A Gentle Introduction to Autocorrelation and Partial Autocorrelation
- How to Grid Search ARIMA Model Hyperparameters with Python

- statsmodels.graphics.tsaplots.plot_acf API
- statsmodels.graphics.tsaplots.plot_pacf API
- statsmodels.tsa.arima_model.ARIMA API

- EMC Data Science Global Hackathon (Air Quality Prediction)
- Chucking everything into a Random Forest: Ben Hamner on Winning The Air Quality Prediction Hackathon
- Winning Code for the EMC Data Science Global Hackathon (Air Quality Prediction)
- General approaches to partitioning the models?

In this tutorial, you discovered how to develop autoregressive models for multi-step time series forecasting for a multivariate air pollution time series.

Specifically, you learned:

- How to analyze and impute missing values for time series data.
- How to develop and evaluate an autoregressive model for multi-step time series forecasting.
- How to improve an autoregressive model using alternate data imputation methods.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Develop Autoregressive Forecasting Models for Multi-Step Air Pollution Time Series Forecasting appeared first on Machine Learning Mastery.

]]>The post How to Develop Baseline Forecasts for Multi-Site Multivariate Air Pollution Time Series Forecasting appeared first on Machine Learning Mastery.

]]>The EMC Data Science Global Hackathon dataset, or the ‘*Air Quality Prediction*‘ dataset for short, describes weather conditions at multiple sites and requires a prediction of air quality measurements over the subsequent three days.

An important first step when working with a new time series forecasting dataset is to develop a baseline in model performance by which the skill of all other more sophisticated strategies can be compared. Baseline forecasting strategies are simple and fast. They are referred to as ‘naive’ strategies because they assume very little or nothing about the specific forecasting problem.

In this tutorial, you will discover how to develop naive forecasting methods for the multistep multivariate air pollution time series forecasting problem.

After completing this tutorial, you will know:

- How to develop a test harness for evaluating forecasting strategies for the air pollution dataset.
- How to develop global naive forecast strategies that use data from the entire training dataset.
- How to develop local naive forecast strategies that use data from the specific interval that is being forecasted.

Let’s get started.

This tutorial is divided into six parts; they are:

- Problem Description
- Naive Methods
- Model Evaluation
- Global Naive Methods
- Chunk Naive Methods
- Summary of Results

The Air Quality Prediction dataset describes weather conditions at multiple sites and requires a prediction of air quality measurements over the subsequent three days.

Specifically, weather observations such as temperature, pressure, wind speed, and wind direction are provided hourly for eight days for multiple sites. The objective is to predict air quality measurements for the next 3 days at multiple sites. The forecast lead times are not contiguous; instead, specific lead times must be forecast over the 72 hour forecast period. They are:

+1, +2, +3, +4, +5, +10, +17, +24, +48, +72

Further, the dataset is divided into disjoint but contiguous chunks of data, with eight days of data followed by three days that require a forecast.

Not all observations are available at all sites or chunks and not all output variables are available at all sites and chunks. There are large portions of missing data that must be addressed.

The dataset was used as the basis for a short duration machine learning competition (or hackathon) on the Kaggle website in 2012.

Submissions for the competition were evaluated against the true observations that were withheld from participants and scored using Mean Absolute Error (MAE). Submissions required the value of -1,000,000 to be specified in those cases where a forecast was not possible due to missing data. In fact, a template of where to insert missing values was provided and required to be adopted for all submissions (what a pain).

A winning entrant achieved a MAE of 0.21058 on the withheld test set (private leaderboard) using random forest on lagged observations. A writeup of this solution is available in the post:

- Chucking everything into a Random Forest: Ben Hamner on Winning The Air Quality Prediction Hackathon, 2012.

In this tutorial, we will explore how to develop naive forecasts for the problem that can be used as a baseline to determine whether a model has skill on the problem or not.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

A baseline in forecast performance provides a point of comparison.

It is a point of reference for all other modeling techniques on your problem. If a model achieves performance at or below the baseline, the technique should be fixed or abandoned.

The technique used to generate a forecast to calculate the baseline performance must be easy to implement and naive of problem-specific details. The principle is that if a sophisticated forecast method cannot outperform a model that uses little or no problem-specific information, then it does not have skill.

There are problem-agnostic forecast methods that can and should be used first, followed by naive methods that use a modicum of problem-specific information.

Two examples of problem agnostic naive forecast methods that could be used include:

- Persist the last observed value for each series.
- Forecast the average of observed values for each series.

The data is divided into chunks, or intervals, of time. Each chunk of time has multiple variables at multiple sites to forecast. The persistence forecast method makes sense at this chunk-level of organization of the data.

Other persistence methods could be explored; for example:

- Forecast observations from the previous day for the next three days for each series.
- Forecast observations from the previous three days for the next three days for each series.

These are desirable baseline methods to explore, but the large amount of missing data and discontiguous structure of most of the data chunks make them challenging to implement without non-trivial data preparation.

Forecasting the average observations for each series can be elaborated further; for example:

- Forecast the global (across-chunk) average value for each series.
- Forecast the local (within-chunk) average value for each series.

A three-day forecast is required for each series with different start-times, e.g. times of day. As such, the forecast lead times for each chunk will fall on different hours of the day.

A further elaboration of forecasting the average value is to incorporate the hour of day that is being forecasted; for example:

- Forecast the global (across-chunk) average value for the hour of day for each forecast lead time.
- Forecast the local (within-chunk) average value for the hour of day for each forecast lead time.

Many variables are measured at multiple sites; as such, it may be possible to use information across series, such as in the calculation of averages or averages per hour of day for forecast lead times. These are interesting, but may exceed the mandate of naive.

This is a good starting point, although there may be further elaborations of the naive methods that you may want to consider and explore as an exercise. Remember, the goal is to use very little problem specific information in order to develop a forecast baseline.

In summary, we will investigate five different naive forecasting methods for this problem, the best of which will provide a lower-bound on performance by which other models can be compared. They are:

- Global Average Value per Series
- Global Average Value for Forecast Lead Time per Series
- Local Persisted Value per Series
- Local Average Value per Series
- Local Average Value for Forecast Lead Time per Series

Before we can evaluate naive forecasting methods, we must develop a test harness.

This includes at least how the data will be prepared and how forecasts will be evaluated.

The first step is to download the dataset and load it into memory.

The dataset can be downloaded for free from the Kaggle website. You may have to create an account and log in, in order to be able to download the dataset.

Download the entire dataset, e.g. “*Download All*” to your workstation and unzip the archive in your current working directory with the folder named ‘*AirQualityPrediction*‘.

Our focus will be the ‘*TrainingData.csv*‘ file that contains the training dataset, specifically data in chunks where each chunk is eight contiguous days of observations and target variables.

We can load the data file into memory using the Pandas read_csv() function and specify the header row on line 0.

# load dataset dataset = read_csv('AirQualityPrediction/TrainingData.csv', header=0)

We can group data by the ‘chunkID’ variable (column index 1).

First, let’s get a list of the unique chunk identifiers.

chunk_ids = unique(values[:, 1])

We can then collect all rows for each chunk identifier and store them in a dictionary for easy access.

chunks = dict() # sort rows by chunk id for chunk_id in chunk_ids: selection = values[:, chunk_ix] == chunk_id chunks[chunk_id] = values[selection, :]

Below defines a function named *to_chunks()* that takes a NumPy array of the loaded data and returns a dictionary of *chunk_id* to rows for the chunk.

# split the dataset by 'chunkID', return a dict of id to rows def to_chunks(values, chunk_ix=1): chunks = dict() # get the unique chunk ids chunk_ids = unique(values[:, chunk_ix]) # group rows by chunk id for chunk_id in chunk_ids: selection = values[:, chunk_ix] == chunk_id chunks[chunk_id] = values[selection, :] return chunks

The complete example that loads the dataset and splits it into chunks is listed below.

# load data and split into chunks from numpy import unique from pandas import read_csv # split the dataset by 'chunkID', return a dict of id to rows def to_chunks(values, chunk_ix=1): chunks = dict() # get the unique chunk ids chunk_ids = unique(values[:, chunk_ix]) # group rows by chunk id for chunk_id in chunk_ids: selection = values[:, chunk_ix] == chunk_id chunks[chunk_id] = values[selection, :] return chunks # load dataset dataset = read_csv('AirQualityPrediction/TrainingData.csv', header=0) # group data by chunks values = dataset.values chunks = to_chunks(values) print('Total Chunks: %d' % len(chunks))

Running the example prints the number of chunks in the dataset.

Total Chunks: 208

Now that we know how to load the data and split it into chunks, we can separate into train and test datasets.

Each chunk covers an interval of eight days of hourly observations, although the number of actual observations within each chunk may vary widely.

We can split each chunk into the first five days of observations for training and the last three for test.

Each observation has a row called ‘*position_within_chunk*‘ that varies from 1 to 192 (8 days * 24 hours). We can therefore take all rows with a value in this column that is less than or equal to 120 (5 * 24) as training data and any values more than 120 as test data.

Further, any chunks that don’t have any observations in the train or test split can be dropped as not viable.

When working with the naive models, we are only interested in the target variables, and none of the input meteorological variables. Therefore, we can remove the input data and have the train and test data only comprised of the 39 target variables for each chunk, as well as the position within chunk and hour of observation.

The *split_train_test()* function below implements this behavior; given a dictionary of chunks, it will split each into a list of train and test chunk data.

# split each chunk into train/test sets def split_train_test(chunks, row_in_chunk_ix=2): train, test = list(), list() # first 5 days of hourly observations for train cut_point = 5 * 24 # enumerate chunks for k,rows in chunks.items(): # split chunk rows by 'position_within_chunk' train_rows = rows[rows[:,row_in_chunk_ix] <= cut_point, :] test_rows = rows[rows[:,row_in_chunk_ix] > cut_point, :] if len(train_rows) == 0 or len(test_rows) == 0: print('>dropping chunk=%d: train=%s, test=%s' % (k, train_rows.shape, test_rows.shape)) continue # store with chunk id, position in chunk, hour and all targets indices = [1,2,5] + [x for x in range(56,train_rows.shape[1])] train.append(train_rows[:, indices]) test.append(test_rows[:, indices]) return train, test

We do not require the entire test dataset; instead, we only require the observations at specific lead times over the three day period, specifically the lead times:

+1, +2, +3, +4, +5, +10, +17, +24, +48, +72

Where, each lead time is relative to the end of the training period.

First, we can put these lead times into a function for easy reference:

# return a list of relative forecast lead times def get_lead_times(): return [1, 2 ,3, 4, 5, 10, 17, 24, 48, 72]

Next, we can reduce the test dataset down to just the data at the preferred lead times.

We can do that by looking at the ‘*position_within_chunk*‘ column and using the lead time as an offset from the end of the training dataset, e.g. 120 + 1, 120 +2, etc.

If we find a matching row in the test set, it is saved, otherwise a row of NaN observations is generated.

The function *to_forecasts()* below implements this and returns a NumPy array with one row for each forecast lead time for each chunk.

# convert the rows in a test chunk to forecasts def to_forecasts(test_chunks, row_in_chunk_ix=1): # get lead times lead_times = get_lead_times() # first 5 days of hourly observations for train cut_point = 5 * 24 forecasts = list() # enumerate each chunk for rows in test_chunks: chunk_id = rows[0, 0] # enumerate each lead time for tau in lead_times: # determine the row in chunk we want for the lead time offset = cut_point + tau # retrieve data for the lead time using row number in chunk row_for_tau = rows[rows[:,row_in_chunk_ix]==offset, :] # check if we have data if len(row_for_tau) == 0: # create a mock row [chunk, position, hour] + [nan...] row = [chunk_id, offset, nan] + [nan for _ in range(39)] forecasts.append(row) else: # store the forecast row forecasts.append(row_for_tau[0]) return array(forecasts)

We can tie all of this together and split the dataset into train and test sets and save the results to new files.

The complete code example is listed below.

# split data into train and test sets from numpy import unique from numpy import nan from numpy import array from numpy import savetxt from pandas import read_csv # split the dataset by 'chunkID', return a dict of id to rows def to_chunks(values, chunk_ix=1): chunks = dict() # get the unique chunk ids chunk_ids = unique(values[:, chunk_ix]) # group rows by chunk id for chunk_id in chunk_ids: selection = values[:, chunk_ix] == chunk_id chunks[chunk_id] = values[selection, :] return chunks # split each chunk into train/test sets def split_train_test(chunks, row_in_chunk_ix=2): train, test = list(), list() # first 5 days of hourly observations for train cut_point = 5 * 24 # enumerate chunks for k,rows in chunks.items(): # split chunk rows by 'position_within_chunk' train_rows = rows[rows[:,row_in_chunk_ix] <= cut_point, :] test_rows = rows[rows[:,row_in_chunk_ix] > cut_point, :] if len(train_rows) == 0 or len(test_rows) == 0: print('>dropping chunk=%d: train=%s, test=%s' % (k, train_rows.shape, test_rows.shape)) continue # store with chunk id, position in chunk, hour and all targets indices = [1,2,5] + [x for x in range(56,train_rows.shape[1])] train.append(train_rows[:, indices]) test.append(test_rows[:, indices]) return train, test # return a list of relative forecast lead times def get_lead_times(): return [1, 2 ,3, 4, 5, 10, 17, 24, 48, 72] # convert the rows in a test chunk to forecasts def to_forecasts(test_chunks, row_in_chunk_ix=1): # get lead times lead_times = get_lead_times() # first 5 days of hourly observations for train cut_point = 5 * 24 forecasts = list() # enumerate each chunk for rows in test_chunks: chunk_id = rows[0, 0] # enumerate each lead time for tau in lead_times: # determine the row in chunk we want for the lead time offset = cut_point + tau # retrieve data for the lead time using row number in chunk row_for_tau = rows[rows[:,row_in_chunk_ix]==offset, :] # check if we have data if len(row_for_tau) == 0: # create a mock row [chunk, position, hour] + [nan...] row = [chunk_id, offset, nan] + [nan for _ in range(39)] forecasts.append(row) else: # store the forecast row forecasts.append(row_for_tau[0]) return array(forecasts) # load dataset dataset = read_csv('AirQualityPrediction/TrainingData.csv', header=0) # group data by chunks values = dataset.values chunks = to_chunks(values) # split into train/test train, test = split_train_test(chunks) # flatten training chunks to rows train_rows = array([row for rows in train for row in rows]) # print(train_rows.shape) print('Train Rows: %s' % str(train_rows.shape)) # reduce train to forecast lead times only test_rows = to_forecasts(test) print('Test Rows: %s' % str(test_rows.shape)) # save datasets savetxt('AirQualityPrediction/naive_train.csv', train_rows, delimiter=',') savetxt('AirQualityPrediction/naive_test.csv', test_rows, delimiter=',')

Running the example first comments that chunk 69 is removed from the dataset for having insufficient data.

We can then see that we have 42 columns in each of the train and test sets, one for the chunk id, position within chunk, hour of day, and the 39 training variables.

We can also see the dramatically smaller version of the test dataset with rows only at the forecast lead times.

The new train and test datasets are saved in the ‘*naive_train.csv*‘ and ‘*naive_test.csv*‘ files respectively.

>dropping chunk=69: train=(0, 95), test=(28, 95) Train Rows: (23514, 42) Test Rows: (2070, 42)

Once forecasts have been made, they need to be evaluated.

It is helpful to have a simpler format when evaluating forecasts. For example, we will use the three-dimensional structure of *[chunks][variables][time]*, where variable is the target variable number from 0 to 38 and time is the lead time index from 0 to 9.

Models are expected to make predictions in this format.

We can also restructure the test dataset to have this dataset for comparison. The *prepare_test_forecasts()* function below implements this.

# convert the test dataset in chunks to [chunk][variable][time] format def prepare_test_forecasts(test_chunks): predictions = list() # enumerate chunks to forecast for rows in test_chunks: # enumerate targets for chunk chunk_predictions = list() for j in range(3, rows.shape[1]): yhat = rows[:, j] chunk_predictions.append(yhat) chunk_predictions = array(chunk_predictions) predictions.append(chunk_predictions) return array(predictions)

We will evaluate a model using the mean absolute error, or MAE. This is the metric that was used in the competition and is a sensible choice given the non-Gaussian distribution of the target variables.

If a lead time contains no data in the test set (e.g. *NaN*), then no error will be calculated for that forecast. If the lead time does have data in the test set but no data in the forecast, then the full magnitude of the observation will be taken as error. Finally, if the test set has an observation and a forecast was made, then the absolute difference will be recorded as the error.

The *calculate_error()* function implements these rules and returns the error for a given forecast.

# calculate the error between an actual and predicted value def calculate_error(actual, predicted): # give the full actual value if predicted is nan if isnan(predicted): return abs(actual) # calculate abs difference return abs(actual - predicted)

Errors are summed across all chunks and all lead times, then averaged.

The overall MAE will be calculated, but we will also calculate a MAE for each forecast lead time. This can help with model selection generally as some models may perform differently at different lead times.

The evaluate_forecasts() function below implements this, calculating the MAE and per-lead time MAE for the provided predictions and expected values in *[chunk][variable][time]* format.

# evaluate a forecast in the format [chunk][variable][time] def evaluate_forecasts(predictions, testset): lead_times = get_lead_times() total_mae, times_mae = 0.0, [0.0 for _ in range(len(lead_times))] total_c, times_c = 0, [0 for _ in range(len(lead_times))] # enumerate test chunks for i in range(len(test_chunks)): # convert to forecasts actual = testset[i] predicted = predictions[i] # enumerate target variables for j in range(predicted.shape[0]): # enumerate lead times for k in range(len(lead_times)): # skip if actual in nan if isnan(actual[j, k]): continue # calculate error error = calculate_error(actual[j, k], predicted[j, k]) # update statistics total_mae += error times_mae[k] += error total_c += 1 times_c[k] += 1 # normalize summed absolute errors total_mae /= total_c times_mae = [times_mae[i]/times_c[i] for i in range(len(times_mae))] return total_mae, times_mae

Once we have the evaluation of a model, we can present it.

The *summarize_error()* function below first prints a one-line summary of a model’s performance then creates a plot of MAE per forecast lead time.

# summarize scores def summarize_error(name, total_mae, times_mae): # print summary lead_times = get_lead_times() formatted = ['+%d %.3f' % (lead_times[i], times_mae[i]) for i in range(len(lead_times))] s_scores = ', '.join(formatted) print('%s: [%.3f MAE] %s' % (name, total_mae, s_scores)) # plot summary pyplot.plot([str(x) for x in lead_times], times_mae, marker='.') pyplot.show()

We are now ready to start exploring the performance of naive forecasting methods.

In this section, we will explore naive forecast methods that use all data in the training dataset, not constrained to the chunk for which we are making a prediction.

We will look at two approaches:

- Forecast Average Value per Series
- Forecast Average Value for Hour-of-Day per Series

The first step is to implement a general function for making a forecast for each chunk.

The function takes the training dataset and the input columns (chunk id, position in chunk, and hour) for the test set and returns forecasts for all chunks with the expected 3D format of *[chunk][variable][time]*.

The function enumerates the chunks in the forecast, then enumerates the 39 target columns, calling another new function named *forecast_variable()* in order to make a prediction for each lead time for a given target variable.

The complete function is listed below.

# forecast for each chunk, returns [chunk][variable][time] def forecast_chunks(train_chunks, test_input): lead_times = get_lead_times() predictions = list() # enumerate chunks to forecast for i in range(len(train_chunks)): # enumerate targets for chunk chunk_predictions = list() for j in range(39): yhat = forecast_variable(train_chunks, train_chunks[i], test_input[i], lead_times, j) chunk_predictions.append(yhat) chunk_predictions = array(chunk_predictions) predictions.append(chunk_predictions) return array(predictions)

We can now implement a version of the *forecast_variable()* that calculates the mean for a given series and forecasts that mean for each lead time.

First, we must collect all observations in the target column across all chunks, then calculate the average of the observations while also ignoring the NaN values. The *nanmean()* NumPy function will calculate the mean of an array and ignore *NaN* values.

The *forecast_variable()* function below implements this behavior.

# forecast all lead times for one variable def forecast_variable(train_chunks, chunk_train, chunk_test, lead_times, target_ix): # convert target number into column number col_ix = 3 + target_ix # collect obs from all chunks all_obs = list() for chunk in train_chunks: all_obs += [x for x in chunk[:, col_ix]] # return the average, ignoring nan value = nanmean(all_obs) return [value for _ in lead_times]

We now have everything we need.

The complete example of forecasting the global mean for each series across all forecast lead times is listed below.

# forecast global mean from numpy import loadtxt from numpy import nan from numpy import isnan from numpy import count_nonzero from numpy import unique from numpy import array from numpy import nanmean from matplotlib import pyplot # split the dataset by 'chunkID', return a list of chunks def to_chunks(values, chunk_ix=0): chunks = list() # get the unique chunk ids chunk_ids = unique(values[:, chunk_ix]) # group rows by chunk id for chunk_id in chunk_ids: selection = values[:, chunk_ix] == chunk_id chunks.append(values[selection, :]) return chunks # return a list of relative forecast lead times def get_lead_times(): return [1, 2 ,3, 4, 5, 10, 17, 24, 48, 72] # forecast all lead times for one variable def forecast_variable(train_chunks, chunk_train, chunk_test, lead_times, target_ix): # convert target number into column number col_ix = 3 + target_ix # collect obs from all chunks all_obs = list() for chunk in train_chunks: all_obs += [x for x in chunk[:, col_ix]] # return the average, ignoring nan value = nanmean(all_obs) return [value for _ in lead_times] # forecast for each chunk, returns [chunk][variable][time] def forecast_chunks(train_chunks, test_input): lead_times = get_lead_times() predictions = list() # enumerate chunks to forecast for i in range(len(train_chunks)): # enumerate targets for chunk chunk_predictions = list() for j in range(39): yhat = forecast_variable(train_chunks, train_chunks[i], test_input[i], lead_times, j) chunk_predictions.append(yhat) chunk_predictions = array(chunk_predictions) predictions.append(chunk_predictions) return array(predictions) # convert the test dataset in chunks to [chunk][variable][time] format def prepare_test_forecasts(test_chunks): predictions = list() # enumerate chunks to forecast for rows in test_chunks: # enumerate targets for chunk chunk_predictions = list() for j in range(3, rows.shape[1]): yhat = rows[:, j] chunk_predictions.append(yhat) chunk_predictions = array(chunk_predictions) predictions.append(chunk_predictions) return array(predictions) # calculate the error between an actual and predicted value def calculate_error(actual, predicted): # give the full actual value if predicted is nan if isnan(predicted): return abs(actual) # calculate abs difference return abs(actual - predicted) # evaluate a forecast in the format [chunk][variable][time] def evaluate_forecasts(predictions, testset): lead_times = get_lead_times() total_mae, times_mae = 0.0, [0.0 for _ in range(len(lead_times))] total_c, times_c = 0, [0 for _ in range(len(lead_times))] # enumerate test chunks for i in range(len(test_chunks)): # convert to forecasts actual = testset[i] predicted = predictions[i] # enumerate target variables for j in range(predicted.shape[0]): # enumerate lead times for k in range(len(lead_times)): # skip if actual in nan if isnan(actual[j, k]): continue # calculate error error = calculate_error(actual[j, k], predicted[j, k]) # update statistics total_mae += error times_mae[k] += error total_c += 1 times_c[k] += 1 # normalize summed absolute errors total_mae /= total_c times_mae = [times_mae[i]/times_c[i] for i in range(len(times_mae))] return total_mae, times_mae # summarize scores def summarize_error(name, total_mae, times_mae): # print summary lead_times = get_lead_times() formatted = ['+%d %.3f' % (lead_times[i], times_mae[i]) for i in range(len(lead_times))] s_scores = ', '.join(formatted) print('%s: [%.3f MAE] %s' % (name, total_mae, s_scores)) # plot summary pyplot.plot([str(x) for x in lead_times], times_mae, marker='.') pyplot.show() # load dataset train = loadtxt('AirQualityPrediction/naive_train.csv', delimiter=',') test = loadtxt('AirQualityPrediction/naive_test.csv', delimiter=',') # group data by chunks train_chunks = to_chunks(train) test_chunks = to_chunks(test) # forecast test_input = [rows[:, :3] for rows in test_chunks] forecast = forecast_chunks(train_chunks, test_input) # evaluate forecast actual = prepare_test_forecasts(test_chunks) total_mae, times_mae = evaluate_forecasts(forecast, actual) # summarize forecast summarize_error('Global Mean', total_mae, times_mae)

Running the example first prints the overall MAE of 0.634, followed by the MAE scores for each forecast lead time.

# Global Mean: [0.634 MAE] +1 0.635, +2 0.629, +3 0.638, +4 0.650, +5 0.649, +10 0.635, +17 0.634, +24 0.641, +48 0.613, +72 0.618

A line plot is created showing the MAE scores for each forecast lead time from +1 hour to +72 hours.

We cannot see any obvious relationship in forecast lead time to forecast error as we might expect with a more skillful model.

We can update the example to forecast the global median instead of the mean.

The median may make more sense to use as a central tendency than the mean for this data given the non-Gaussian like distribution the data seems to show.

NumPy provides the *nanmedian()* function that we can use in place of *nanmean()* in the *forecast_variable()* function.

The complete updated example is listed below.

# forecast global median from numpy import loadtxt from numpy import nan from numpy import isnan from numpy import count_nonzero from numpy import unique from numpy import array from numpy import nanmedian from matplotlib import pyplot # split the dataset by 'chunkID', return a list of chunks def to_chunks(values, chunk_ix=0): chunks = list() # get the unique chunk ids chunk_ids = unique(values[:, chunk_ix]) # group rows by chunk id for chunk_id in chunk_ids: selection = values[:, chunk_ix] == chunk_id chunks.append(values[selection, :]) return chunks # return a list of relative forecast lead times def get_lead_times(): return [1, 2 ,3, 4, 5, 10, 17, 24, 48, 72] # forecast all lead times for one variable def forecast_variable(train_chunks, chunk_train, chunk_test, lead_times, target_ix): # convert target number into column number col_ix = 3 + target_ix # collect obs from all chunks all_obs = list() for chunk in train_chunks: all_obs += [x for x in chunk[:, col_ix]] # return the average, ignoring nan value = nanmedian(all_obs) return [value for _ in lead_times] # forecast for each chunk, returns [chunk][variable][time] def forecast_chunks(train_chunks, test_input): lead_times = get_lead_times() predictions = list() # enumerate chunks to forecast for i in range(len(train_chunks)): # enumerate targets for chunk chunk_predictions = list() for j in range(39): yhat = forecast_variable(train_chunks, train_chunks[i], test_input[i], lead_times, j) chunk_predictions.append(yhat) chunk_predictions = array(chunk_predictions) predictions.append(chunk_predictions) return array(predictions) # convert the test dataset in chunks to [chunk][variable][time] format def prepare_test_forecasts(test_chunks): predictions = list() # enumerate chunks to forecast for rows in test_chunks: # enumerate targets for chunk chunk_predictions = list() for j in range(3, rows.shape[1]): yhat = rows[:, j] chunk_predictions.append(yhat) chunk_predictions = array(chunk_predictions) predictions.append(chunk_predictions) return array(predictions) # calculate the error between an actual and predicted value def calculate_error(actual, predicted): # give the full actual value if predicted is nan if isnan(predicted): return abs(actual) # calculate abs difference return abs(actual - predicted) # evaluate a forecast in the format [chunk][variable][time] def evaluate_forecasts(predictions, testset): lead_times = get_lead_times() total_mae, times_mae = 0.0, [0.0 for _ in range(len(lead_times))] total_c, times_c = 0, [0 for _ in range(len(lead_times))] # enumerate test chunks for i in range(len(test_chunks)): # convert to forecasts actual = testset[i] predicted = predictions[i] # enumerate target variables for j in range(predicted.shape[0]): # enumerate lead times for k in range(len(lead_times)): # skip if actual in nan if isnan(actual[j, k]): continue # calculate error error = calculate_error(actual[j, k], predicted[j, k]) # update statistics total_mae += error times_mae[k] += error total_c += 1 times_c[k] += 1 # normalize summed absolute errors total_mae /= total_c times_mae = [times_mae[i]/times_c[i] for i in range(len(times_mae))] return total_mae, times_mae # summarize scores def summarize_error(name, total_mae, times_mae): # print summary lead_times = get_lead_times() formatted = ['+%d %.3f' % (lead_times[i], times_mae[i]) for i in range(len(lead_times))] s_scores = ', '.join(formatted) print('%s: [%.3f MAE] %s' % (name, total_mae, s_scores)) # plot summary pyplot.plot([str(x) for x in lead_times], times_mae, marker='.') pyplot.show() # load dataset train = loadtxt('AirQualityPrediction/naive_train.csv', delimiter=',') test = loadtxt('AirQualityPrediction/naive_test.csv', delimiter=',') # group data by chunks train_chunks = to_chunks(train) test_chunks = to_chunks(test) # forecast test_input = [rows[:, :3] for rows in test_chunks] forecast = forecast_chunks(train_chunks, test_input) # evaluate forecast actual = prepare_test_forecasts(test_chunks) total_mae, times_mae = evaluate_forecasts(forecast, actual) # summarize forecast summarize_error('Global Median', total_mae, times_mae)

Running the example shows a drop in MAE to about 0.59, suggesting that indeed using the median as the central tendency may be a better baseline strategy.

Global Median: [0.598 MAE] +1 0.601, +2 0.594, +3 0.600, +4 0.611, +5 0.615, +10 0.594, +17 0.592, +24 0.602, +48 0.585, +72 0.580

A line plot of MAE per lead time is also created.

We can update the naive model for calculating a central tendency by series to only include rows that have the same hour of day as the forecast lead time.

For example, if the +1 lead time has the hour 6 (e.g. 0600 or 6AM), then we can find all other rows in the training dataset across all chunks for that hour and calculate the median value for a given target variable from those rows.

We record the hour of day on the test dataset and make it available to the model when making forecasts. One wrinkle is that in some cases the test dataset did not have a record for a given lead time and one had to be invented with *NaN* values, including a *NaN* value for the hour. In these cases, no forecast is required so we will skip them and forecast a *NaN* value.

The *forecast_variable()* function below implements this behavior, returning forecasts for each lead time for a given variable.

It is not very efficient, and it might be a lot more efficient to pre-calculate the median values for each hour for each variable first and then forecast using a lookup table. Efficiency is not a concern at this point as we are looking for a baseline of model performance.

# forecast all lead times for one variable def forecast_variable(train_chunks, chunk_train, chunk_test, lead_times, target_ix): forecast = list() # convert target number into column number col_ix = 3 + target_ix # enumerate lead times for i in range(len(lead_times)): # get the hour for this forecast lead time hour = chunk_test[i, 2] # check for no test data if isnan(hour): forecast.append(nan) continue # get all rows in training for this hour all_rows = list() for rows in train_chunks: [all_rows.append(row) for row in rows[rows[:,2]==hour]] # calculate the central tendency for target all_rows = array(all_rows) value = nanmedian(all_rows[:, col_ix]) forecast.append(value) return forecast

The complete example of forecasting the global median value by hour of the day across is listed below.

# forecast global median by hour of day from numpy import loadtxt from numpy import nan from numpy import isnan from numpy import count_nonzero from numpy import unique from numpy import array from numpy import nanmedian from matplotlib import pyplot # split the dataset by 'chunkID', return a list of chunks def to_chunks(values, chunk_ix=0): chunks = list() # get the unique chunk ids chunk_ids = unique(values[:, chunk_ix]) # group rows by chunk id for chunk_id in chunk_ids: selection = values[:, chunk_ix] == chunk_id chunks.append(values[selection, :]) return chunks # return a list of relative forecast lead times def get_lead_times(): return [1, 2, 3, 4, 5, 10, 17, 24, 48, 72] # forecast all lead times for one variable def forecast_variable(train_chunks, chunk_train, chunk_test, lead_times, target_ix): forecast = list() # convert target number into column number col_ix = 3 + target_ix # enumerate lead times for i in range(len(lead_times)): # get the hour for this forecast lead time hour = chunk_test[i, 2] # check for no test data if isnan(hour): forecast.append(nan) continue # get all rows in training for this hour all_rows = list() for rows in train_chunks: [all_rows.append(row) for row in rows[rows[:,2]==hour]] # calculate the central tendency for target all_rows = array(all_rows) value = nanmedian(all_rows[:, col_ix]) forecast.append(value) return forecast # forecast for each chunk, returns [chunk][variable][time] def forecast_chunks(train_chunks, test_input): lead_times = get_lead_times() predictions = list() # enumerate chunks to forecast for i in range(len(train_chunks)): # enumerate targets for chunk chunk_predictions = list() for j in range(39): yhat = forecast_variable(train_chunks, train_chunks[i], test_input[i], lead_times, j) chunk_predictions.append(yhat) chunk_predictions = array(chunk_predictions) predictions.append(chunk_predictions) return array(predictions) # convert the test dataset in chunks to [chunk][variable][time] format def prepare_test_forecasts(test_chunks): predictions = list() # enumerate chunks to forecast for rows in test_chunks: # enumerate targets for chunk chunk_predictions = list() for j in range(3, rows.shape[1]): yhat = rows[:, j] chunk_predictions.append(yhat) chunk_predictions = array(chunk_predictions) predictions.append(chunk_predictions) return array(predictions) # calculate the error between an actual and predicted value def calculate_error(actual, predicted): # give the full actual value if predicted is nan if isnan(predicted): return abs(actual) # calculate abs difference return abs(actual - predicted) # evaluate a forecast in the format [chunk][variable][time] def evaluate_forecasts(predictions, testset): lead_times = get_lead_times() total_mae, times_mae = 0.0, [0.0 for _ in range(len(lead_times))] total_c, times_c = 0, [0 for _ in range(len(lead_times))] # enumerate test chunks for i in range(len(test_chunks)): # convert to forecasts actual = testset[i] predicted = predictions[i] # enumerate target variables for j in range(predicted.shape[0]): # enumerate lead times for k in range(len(lead_times)): # skip if actual in nan if isnan(actual[j, k]): continue # calculate error error = calculate_error(actual[j, k], predicted[j, k]) # update statistics total_mae += error times_mae[k] += error total_c += 1 times_c[k] += 1 # normalize summed absolute errors total_mae /= total_c times_mae = [times_mae[i]/times_c[i] for i in range(len(times_mae))] return total_mae, times_mae # summarize scores def summarize_error(name, total_mae, times_mae): # print summary lead_times = get_lead_times() formatted = ['+%d %.3f' % (lead_times[i], times_mae[i]) for i in range(len(lead_times))] s_scores = ', '.join(formatted) print('%s: [%.3f MAE] %s' % (name, total_mae, s_scores)) # plot summary pyplot.plot([str(x) for x in lead_times], times_mae, marker='.') pyplot.show() # load dataset train = loadtxt('AirQualityPrediction/naive_train.csv', delimiter=',') test = loadtxt('AirQualityPrediction/naive_test.csv', delimiter=',') # group data by chunks train_chunks = to_chunks(train) test_chunks = to_chunks(test) # forecast test_input = [rows[:, :3] for rows in test_chunks] forecast = forecast_chunks(train_chunks, test_input) # evaluate forecast actual = prepare_test_forecasts(test_chunks) total_mae, times_mae = evaluate_forecasts(forecast, actual) # summarize forecast summarize_error('Global Median by Hour', total_mae, times_mae)

Running the example summarizes the performance of the model with a MAE of 0.567, which is an improvement over the global median for each series.

Global Median by Hour: [0.567 MAE] +1 0.573, +2 0.565, +3 0.567, +4 0.579, +5 0.589, +10 0.559, +17 0.565, +24 0.567, +48 0.558, +72 0.551

A line plot of the MAE by forecast lead time is also created showing that +72 had the lowest overall forecast error. This is interesting, and may suggest that hour-based information may be useful in more sophisticated models.

It is possible that using information specific to the chunk may have more predictive power than using global information from the entire training dataset.

We can explore this with three local or chunk-specific naive forecasting methods; they are:

- Forecast Last Observation per Series
- Forecast Average Value per Series
- Forecast Average Value for Hour-of-Day per Series

The last two of which are the chunk-specific version of the global strategies that were evaluated in the previous section.

Forecasting the last non-NaN observation for a chunk is perhaps the simplest model, classically called the persistence model or the naive model.

The *forecast_variable()* function below implements this forecast strategy.

# forecast all lead times for one variable def forecast_variable(train_chunks, chunk_train, chunk_test, lead_times, target_ix): # convert target number into column number col_ix = 3 + target_ix # extract the history for the series history = chunk_train[:, col_ix] # persist a nan if we do not find any valid data persisted = nan # enumerate history in verse order looking for the first non-nan for value in reversed(history): if not isnan(value): persisted = value break # persist the same value for all lead times forecast = [persisted for _ in range(len(lead_times))] return forecast

The complete example for evaluating the persistence forecast strategy on the test set is listed below.

# persist last observation from numpy import loadtxt from numpy import nan from numpy import isnan from numpy import count_nonzero from numpy import unique from numpy import array from numpy import nanmedian from matplotlib import pyplot # split the dataset by 'chunkID', return a list of chunks def to_chunks(values, chunk_ix=0): chunks = list() # get the unique chunk ids chunk_ids = unique(values[:, chunk_ix]) # group rows by chunk id for chunk_id in chunk_ids: selection = values[:, chunk_ix] == chunk_id chunks.append(values[selection, :]) return chunks # return a list of relative forecast lead times def get_lead_times(): return [1, 2, 3, 4, 5, 10, 17, 24, 48, 72] # forecast all lead times for one variable def forecast_variable(train_chunks, chunk_train, chunk_test, lead_times, target_ix): # convert target number into column number col_ix = 3 + target_ix # extract the history for the series history = chunk_train[:, col_ix] # persist a nan if we do not find any valid data persisted = nan # enumerate history in verse order looking for the first non-nan for value in reversed(history): if not isnan(value): persisted = value break # persist the same value for all lead times forecast = [persisted for _ in range(len(lead_times))] return forecast # forecast for each chunk, returns [chunk][variable][time] def forecast_chunks(train_chunks, test_input): lead_times = get_lead_times() predictions = list() # enumerate chunks to forecast for i in range(len(train_chunks)): # enumerate targets for chunk chunk_predictions = list() for j in range(39): yhat = forecast_variable(train_chunks, train_chunks[i], test_input[i], lead_times, j) chunk_predictions.append(yhat) chunk_predictions = array(chunk_predictions) predictions.append(chunk_predictions) return array(predictions) # convert the test dataset in chunks to [chunk][variable][time] format def prepare_test_forecasts(test_chunks): predictions = list() # enumerate chunks to forecast for rows in test_chunks: # enumerate targets for chunk chunk_predictions = list() for j in range(3, rows.shape[1]): yhat = rows[:, j] chunk_predictions.append(yhat) chunk_predictions = array(chunk_predictions) predictions.append(chunk_predictions) return array(predictions) # calculate the error between an actual and predicted value def calculate_error(actual, predicted): # give the full actual value if predicted is nan if isnan(predicted): return abs(actual) # calculate abs difference return abs(actual - predicted) # evaluate a forecast in the format [chunk][variable][time] def evaluate_forecasts(predictions, testset): lead_times = get_lead_times() total_mae, times_mae = 0.0, [0.0 for _ in range(len(lead_times))] total_c, times_c = 0, [0 for _ in range(len(lead_times))] # enumerate test chunks for i in range(len(test_chunks)): # convert to forecasts actual = testset[i] predicted = predictions[i] # enumerate target variables for j in range(predicted.shape[0]): # enumerate lead times for k in range(len(lead_times)): # skip if actual in nan if isnan(actual[j, k]): continue # calculate error error = calculate_error(actual[j, k], predicted[j, k]) # update statistics total_mae += error times_mae[k] += error total_c += 1 times_c[k] += 1 # normalize summed absolute errors total_mae /= total_c times_mae = [times_mae[i]/times_c[i] for i in range(len(times_mae))] return total_mae, times_mae # summarize scores def summarize_error(name, total_mae, times_mae): # print summary lead_times = get_lead_times() formatted = ['+%d %.3f' % (lead_times[i], times_mae[i]) for i in range(len(lead_times))] s_scores = ', '.join(formatted) print('%s: [%.3f MAE] %s' % (name, total_mae, s_scores)) # plot summary pyplot.plot([str(x) for x in lead_times], times_mae, marker='.') pyplot.show() # load dataset train = loadtxt('AirQualityPrediction/naive_train.csv', delimiter=',') test = loadtxt('AirQualityPrediction/naive_test.csv', delimiter=',') # group data by chunks train_chunks = to_chunks(train) test_chunks = to_chunks(test) # forecast test_input = [rows[:, :3] for rows in test_chunks] forecast = forecast_chunks(train_chunks, test_input) # evaluate forecast actual = prepare_test_forecasts(test_chunks) total_mae, times_mae = evaluate_forecasts(forecast, actual) # summarize forecast summarize_error('Persistence', total_mae, times_mae)

Running the example prints the overall MAE and the MAE per forecast lead time.

We can see that the persistence forecast appears to out-perform all of the global strategies evaluated in the previous section.

This adds some support that the reasonable assumption that chunk-specific information is important in modeling this problem.

Persistence: [0.520 MAE] +1 0.217, +2 0.330, +3 0.400, +4 0.471, +5 0.515, +10 0.648, +17 0.656, +24 0.589, +48 0.671, +72 0.708

A line plot of MAE per forecast lead time is created.

Importantly, this plot shows the expected behavior of increasing error with the increase in forecast lead time. Namely, the further one predicts into the future, the more challenging it is, and in turn, the more error one would be expected to make.

Instead of persisting the last observation for the series, we can persist the average value for the series using only the data in the chunk.

Specifically, we can calculate the median of the series, which as we found in the previous section seems to lead to better performance.

The *forecast_variable()* implements this local strategy.

# forecast all lead times for one variable def forecast_variable(train_chunks, chunk_train, chunk_test, lead_times, target_ix): # convert target number into column number col_ix = 3 + target_ix # extract the history for the series history = chunk_train[:, col_ix] # calculate the central tendency value = nanmedian(history) # persist the same value for all lead times forecast = [value for _ in range(len(lead_times))] return forecast

The complete example is listed below.

# forecast local median from numpy import loadtxt from numpy import nan from numpy import isnan from numpy import count_nonzero from numpy import unique from numpy import array from numpy import nanmedian from matplotlib import pyplot # split the dataset by 'chunkID', return a list of chunks def to_chunks(values, chunk_ix=0): chunks = list() # get the unique chunk ids chunk_ids = unique(values[:, chunk_ix]) # group rows by chunk id for chunk_id in chunk_ids: selection = values[:, chunk_ix] == chunk_id chunks.append(values[selection, :]) return chunks # return a list of relative forecast lead times def get_lead_times(): return [1, 2, 3, 4, 5, 10, 17, 24, 48, 72] # forecast all lead times for one variable def forecast_variable(train_chunks, chunk_train, chunk_test, lead_times, target_ix): # convert target number into column number col_ix = 3 + target_ix # extract the history for the series history = chunk_train[:, col_ix] # calculate the central tendency value = nanmedian(history) # persist the same value for all lead times forecast = [value for _ in range(len(lead_times))] return forecast # forecast for each chunk, returns [chunk][variable][time] def forecast_chunks(train_chunks, test_input): lead_times = get_lead_times() predictions = list() # enumerate chunks to forecast for i in range(len(train_chunks)): # enumerate targets for chunk chunk_predictions = list() for j in range(39): yhat = forecast_variable(train_chunks, train_chunks[i], test_input[i], lead_times, j) chunk_predictions.append(yhat) chunk_predictions = array(chunk_predictions) predictions.append(chunk_predictions) return array(predictions) # convert the test dataset in chunks to [chunk][variable][time] format def prepare_test_forecasts(test_chunks): predictions = list() # enumerate chunks to forecast for rows in test_chunks: # enumerate targets for chunk chunk_predictions = list() for j in range(3, rows.shape[1]): yhat = rows[:, j] chunk_predictions.append(yhat) chunk_predictions = array(chunk_predictions) predictions.append(chunk_predictions) return array(predictions) # calculate the error between an actual and predicted value def calculate_error(actual, predicted): # give the full actual value if predicted is nan if isnan(predicted): return abs(actual) # calculate abs difference return abs(actual - predicted) # evaluate a forecast in the format [chunk][variable][time] def evaluate_forecasts(predictions, testset): lead_times = get_lead_times() total_mae, times_mae = 0.0, [0.0 for _ in range(len(lead_times))] total_c, times_c = 0, [0 for _ in range(len(lead_times))] # enumerate test chunks for i in range(len(test_chunks)): # convert to forecasts actual = testset[i] predicted = predictions[i] # enumerate target variables for j in range(predicted.shape[0]): # enumerate lead times for k in range(len(lead_times)): # skip if actual in nan if isnan(actual[j, k]): continue # calculate error error = calculate_error(actual[j, k], predicted[j, k]) # update statistics total_mae += error times_mae[k] += error total_c += 1 times_c[k] += 1 # normalize summed absolute errors total_mae /= total_c times_mae = [times_mae[i]/times_c[i] for i in range(len(times_mae))] return total_mae, times_mae # summarize scores def summarize_error(name, total_mae, times_mae): # print summary lead_times = get_lead_times() formatted = ['+%d %.3f' % (lead_times[i], times_mae[i]) for i in range(len(lead_times))] s_scores = ', '.join(formatted) print('%s: [%.3f MAE] %s' % (name, total_mae, s_scores)) # plot summary pyplot.plot([str(x) for x in lead_times], times_mae, marker='.') pyplot.show() # load dataset train = loadtxt('AirQualityPrediction/naive_train.csv', delimiter=',') test = loadtxt('AirQualityPrediction/naive_test.csv', delimiter=',') # group data by chunks train_chunks = to_chunks(train) test_chunks = to_chunks(test) # forecast test_input = [rows[:, :3] for rows in test_chunks] forecast = forecast_chunks(train_chunks, test_input) # evaluate forecast actual = prepare_test_forecasts(test_chunks) total_mae, times_mae = evaluate_forecasts(forecast, actual) # summarize forecast summarize_error('Local Median', total_mae, times_mae)

Running the example summarizes the performance of this naive strategy, showing a MAE of about 0.568, which is worse than the above persistence strategy.

Local Median: [0.568 MAE] +1 0.535, +2 0.542, +3 0.550, +4 0.568, +5 0.568, +10 0.562, +17 0.567, +24 0.605, +48 0.590, +72 0.593

A line plot of MAE per forecast lead time is also created showing the familiar increasing curve of error per lead time.

Finally, we can dial in the persistence strategy by using the average value per series for the specific hour of day at each forecast lead time.

This approach was found to be effective at the global strategy. It may be effective using only the data from the chunk, although at the risk of using a much smaller data sample.

The *forecast_variable()* function below implements this strategy, first finding all rows with the hour of the forecast lead time, then calculating the median of those rows for the given target variable.

# forecast all lead times for one variable def forecast_variable(train_chunks, chunk_train, chunk_test, lead_times, target_ix): forecast = list() # convert target number into column number col_ix = 3 + target_ix # enumerate lead times for i in range(len(lead_times)): # get the hour for this forecast lead time hour = chunk_test[i, 2] # check for no test data if isnan(hour): forecast.append(nan) continue # select rows in chunk with this hour selected = chunk_train[chunk_train[:,2]==hour] # calculate the central tendency for target value = nanmedian(selected[:, col_ix]) forecast.append(value) return forecast

The complete example is listed below.

# forecast local median per hour of day from numpy import loadtxt from numpy import nan from numpy import isnan from numpy import unique from numpy import array from numpy import nanmedian from matplotlib import pyplot # split the dataset by 'chunkID', return a list of chunks def to_chunks(values, chunk_ix=0): chunks = list() # get the unique chunk ids chunk_ids = unique(values[:, chunk_ix]) # group rows by chunk id for chunk_id in chunk_ids: selection = values[:, chunk_ix] == chunk_id chunks.append(values[selection, :]) return chunks # return a list of relative forecast lead times def get_lead_times(): return [1, 2, 3, 4, 5, 10, 17, 24, 48, 72] # forecast all lead times for one variable def forecast_variable(train_chunks, chunk_train, chunk_test, lead_times, target_ix): forecast = list() # convert target number into column number col_ix = 3 + target_ix # enumerate lead times for i in range(len(lead_times)): # get the hour for this forecast lead time hour = chunk_test[i, 2] # check for no test data if isnan(hour): forecast.append(nan) continue # select rows in chunk with this hour selected = chunk_train[chunk_train[:,2]==hour] # calculate the central tendency for target value = nanmedian(selected[:, col_ix]) forecast.append(value) return forecast # forecast for each chunk, returns [chunk][variable][time] def forecast_chunks(train_chunks, test_input): lead_times = get_lead_times() predictions = list() # enumerate chunks to forecast for i in range(len(train_chunks)): # enumerate targets for chunk chunk_predictions = list() for j in range(39): yhat = forecast_variable(train_chunks, train_chunks[i], test_input[i], lead_times, j) chunk_predictions.append(yhat) chunk_predictions = array(chunk_predictions) predictions.append(chunk_predictions) return array(predictions) # convert the test dataset in chunks to [chunk][variable][time] format def prepare_test_forecasts(test_chunks): predictions = list() # enumerate chunks to forecast for rows in test_chunks: # enumerate targets for chunk chunk_predictions = list() for j in range(3, rows.shape[1]): yhat = rows[:, j] chunk_predictions.append(yhat) chunk_predictions = array(chunk_predictions) predictions.append(chunk_predictions) return array(predictions) # calculate the error between an actual and predicted value def calculate_error(actual, predicted): # give the full actual value if predicted is nan if isnan(predicted): return abs(actual) # calculate abs difference return abs(actual - predicted) # evaluate a forecast in the format [chunk][variable][time] def evaluate_forecasts(predictions, testset): lead_times = get_lead_times() total_mae, times_mae = 0.0, [0.0 for _ in range(len(lead_times))] total_c, times_c = 0, [0 for _ in range(len(lead_times))] # enumerate test chunks for i in range(len(test_chunks)): # convert to forecasts actual = testset[i] predicted = predictions[i] # enumerate target variables for j in range(predicted.shape[0]): # enumerate lead times for k in range(len(lead_times)): # skip if actual in nan if isnan(actual[j, k]): continue # calculate error error = calculate_error(actual[j, k], predicted[j, k]) # update statistics total_mae += error times_mae[k] += error total_c += 1 times_c[k] += 1 # normalize summed absolute errors total_mae /= total_c times_mae = [times_mae[i]/times_c[i] for i in range(len(times_mae))] return total_mae, times_mae # summarize scores def summarize_error(name, total_mae, times_mae): # print summary lead_times = get_lead_times() formatted = ['+%d %.3f' % (lead_times[i], times_mae[i]) for i in range(len(lead_times))] s_scores = ', '.join(formatted) print('%s: [%.3f MAE] %s' % (name, total_mae, s_scores)) # plot summary pyplot.plot([str(x) for x in lead_times], times_mae, marker='.') pyplot.show() # load dataset train = loadtxt('AirQualityPrediction/naive_train.csv', delimiter=',') test = loadtxt('AirQualityPrediction/naive_test.csv', delimiter=',') # group data by chunks train_chunks = to_chunks(train) test_chunks = to_chunks(test) # forecast test_input = [rows[:, :3] for rows in test_chunks] forecast = forecast_chunks(train_chunks, test_input) # evaluate forecast actual = prepare_test_forecasts(test_chunks) total_mae, times_mae = evaluate_forecasts(forecast, actual) # summarize forecast summarize_error('Local Median by Hour', total_mae, times_mae)

Running the example prints the overall MAE of about 0.574, which is worse than the global variation of the same strategy.

As suspected, this is likely due to the small sample size, that is at most five rows of training data contributing to each forecast.

Local Median by Hour: [0.574 MAE] +1 0.561, +2 0.559, +3 0.568, +4 0.577, +5 0.577, +10 0.556, +17 0.551, +24 0.588, +48 0.601, +72 0.608

A line plot of MAE per forecast lead time is also created showing the familiar increasing curve of error per lead time.

We can summarize the performance of all of the naive forecast methods reviewed in this tutorial.

The example below lists each method using a shorthand of ‘*g*‘ and ‘*l*‘ for global and local and ‘*h*‘ for the hour-of-day variations. The example creates a bar chart so that we can compare the naive strategies based on their relative performance.

# summary of results from matplotlib import pyplot # results results = { 'g-mean':0.634, 'g-med':0.598, 'g-med-h':0.567, 'l-per':0.520, 'l-med':0.568, 'l-med-h':0.574} # plot pyplot.bar(results.keys(), results.values()) locs, labels = pyplot.xticks() pyplot.setp(labels, rotation=30) pyplot.show()

Running the example creates a bar chart comparing the MAE for each of the six strategies.

We can see that the persistence strategy was better than all of the other methods and that the second best strategy was the global median for each series that used the hour of day.

Models evaluated on this train/test separation of the dataset must achieve an overall MAE lower than 0.520 in order to be considered skillful.

This section lists some ideas for extending the tutorial that you may wish to explore.

**Cross-Site Naive Forecast**. Develop a naive forecast strategy that uses information about each variable across sites, e.g. different target variables for the same variable at different sites.**Hybrid Approach**. Develop a hybrid forecast strategy that combines elements of two or more of the naive forecast strategies at different lead times described in this tutorial.**Ensemble of Naive Methods**. Develop an ensemble forecast strategy that creates a linear combination of two or more forecast strategies described in this tutorial.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- A Standard Multivariate, Multi-Step, and Multi-Site Time Series Forecasting Problem
- How to Make Baseline Predictions for Time Series Forecasting with Python

- EMC Data Science Global Hackathon (Air Quality Prediction)
- Chucking everything into a Random Forest: Ben Hamner on Winning The Air Quality Prediction Hackathon
- Winning Code for the EMC Data Science Global Hackathon (Air Quality Prediction)
- General approaches to partitioning the models?

In this tutorial, you discovered how to develop naive forecasting methods for the multistep multivariate air pollution time series forecasting problem.

Specifically, you learned:

- How to develop a test harness for evaluating forecasting strategies for the air pollution dataset.
- How to develop global naive forecast strategies that use data from the entire training dataset.
- How to develop local naive forecast strategies that use data from the specific interval that is being forecasted.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Develop Baseline Forecasts for Multi-Site Multivariate Air Pollution Time Series Forecasting appeared first on Machine Learning Mastery.

]]>The post How to Load, Visualize, and Explore a Complex Multivariate Multistep Time Series Forecasting Dataset appeared first on Machine Learning Mastery.

]]>The EMC Data Science Global Hackathon dataset, or the ‘*Air Quality Prediction*‘ dataset for short, describes weather conditions at multiple sites and requires a prediction of air quality measurements over the subsequent three days.

In this tutorial, you will discover and explore the Air Quality Prediction dataset that represents a challenging multivariate, multi-site, and multi-step time series forecasting problem.

After completing this tutorial, you will know:

- How to load and explore the chunk-structure of the dataset.
- How to explore and visualize the input and target variables for the dataset.
- How to use the new understanding to outline a suite of methods for framing the problem, preparing the data, and modeling the dataset.

Let’s get started.

This tutorial is divided into seven parts; they are:

- Problem Description
- Load Dataset
- Chunk Data Structure
- Input Variables
- Target Variables
- A Wrinkle With Target Variables
- Thoughts on Modeling

The EMC Data Science Global Hackathon dataset, or the ‘*Air Quality Prediction*‘ dataset for short, describes weather conditions at multiple sites and requires a prediction of air quality measurements over the subsequent three days.

Specifically, weather observations such as temperature, pressure, wind speed, and wind direction are provided hourly for eight days for multiple sites. The objective is to predict air quality measurements for the next three days at multiple sites. The forecast lead times are not contiguous; instead, specific lead times must be forecast over the 72 hour forecast period; they are:

+1, +2, +3, +4, +5, +10, +17, +24, +48, +72

In this tutorial, we will explore this dataset in order to better understand the nature of the forecast problem and suggest approaches for how it may be modeled.

The first step is to download the dataset and load it into memory.

Download the entire dataset, e.g. “*Download All*” to your workstation and unzip the archive in your current working directory with the folder named ‘*AirQualityPrediction*‘

You should have five files in the *AirQualityPrediction/* folder; they are:

- SiteLocations.csv
- SiteLocations_with_more_sites.csv
- SubmissionZerosExceptNAs.csv
- TrainingData.csv
- sample_code.r

Our focus will be the ‘*TrainingData.csv*‘ that contains the training dataset, specifically data in chunks where each chunk is eight contiguous days of observations and target variables.

The test dataset (remaining three days of each chunk) is not available for this dataset at the time of writing.

Open the ‘*TrainingData.csv*‘ file and review the contents. The unzipped data file is relatively small (21 megabytes) and will easily fit into RAM.

Reviewing the contents of the file, we can see that the data file contains a header row.

We can also see that missing data is marked with the ‘*NA*‘ value, which Pandas will automatically convert to *NumPy.NaN*.

We can see that the ‘*weekday*‘ column contains the day as a string, whereas all other data is numeric.

Below are the first few lines of the data file for reference.

"rowID","chunkID","position_within_chunk","month_most_common","weekday","hour","Solar.radiation_64","WindDirection..Resultant_1","WindDirection..Resultant_1018","WindSpeed..Resultant_1","WindSpeed..Resultant_1018","Ambient.Max.Temperature_14","Ambient.Max.Temperature_22","Ambient.Max.Temperature_50","Ambient.Max.Temperature_52","Ambient.Max.Temperature_57","Ambient.Max.Temperature_76","Ambient.Max.Temperature_2001","Ambient.Max.Temperature_3301","Ambient.Max.Temperature_6005","Ambient.Min.Temperature_14","Ambient.Min.Temperature_22","Ambient.Min.Temperature_50","Ambient.Min.Temperature_52","Ambient.Min.Temperature_57","Ambient.Min.Temperature_76","Ambient.Min.Temperature_2001","Ambient.Min.Temperature_3301","Ambient.Min.Temperature_6005","Sample.Baro.Pressure_14","Sample.Baro.Pressure_22","Sample.Baro.Pressure_50","Sample.Baro.Pressure_52","Sample.Baro.Pressure_57","Sample.Baro.Pressure_76","Sample.Baro.Pressure_2001","Sample.Baro.Pressure_3301","Sample.Baro.Pressure_6005","Sample.Max.Baro.Pressure_14","Sample.Max.Baro.Pressure_22","Sample.Max.Baro.Pressure_50","Sample.Max.Baro.Pressure_52","Sample.Max.Baro.Pressure_57","Sample.Max.Baro.Pressure_76","Sample.Max.Baro.Pressure_2001","Sample.Max.Baro.Pressure_3301","Sample.Max.Baro.Pressure_6005","Sample.Min.Baro.Pressure_14","Sample.Min.Baro.Pressure_22","Sample.Min.Baro.Pressure_50","Sample.Min.Baro.Pressure_52","Sample.Min.Baro.Pressure_57","Sample.Min.Baro.Pressure_76","Sample.Min.Baro.Pressure_2001","Sample.Min.Baro.Pressure_3301","Sample.Min.Baro.Pressure_6005","target_1_57","target_10_4002","target_10_8003","target_11_1","target_11_32","target_11_50","target_11_64","target_11_1003","target_11_1601","target_11_4002","target_11_8003","target_14_4002","target_14_8003","target_15_57","target_2_57","target_3_1","target_3_50","target_3_57","target_3_1601","target_3_4002","target_3_6006","target_4_1","target_4_50","target_4_57","target_4_1018","target_4_1601","target_4_2001","target_4_4002","target_4_4101","target_4_6006","target_4_8003","target_5_6006","target_7_57","target_8_57","target_8_4002","target_8_6004","target_8_8003","target_9_4002","target_9_8003" 1,1,1,10,"Saturday",21,0.01,117,187,0.3,0.3,NA,NA,NA,14.9,NA,NA,NA,NA,NA,NA,NA,NA,5.8,NA,NA,NA,NA,NA,NA,NA,NA,747,NA,NA,NA,NA,NA,NA,NA,NA,750,NA,NA,NA,NA,NA,NA,NA,NA,743,NA,NA,NA,NA,NA,2.67923294292042,6.1816228132982,NA,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,NA,2.38965627997991,NA,5.56815355612325,0.690015329704154,NA,NA,NA,NA,NA,NA,2.84349016287551,0.0920223353681394,1.69321097077376,0.368089341472558,0.184044670736279,0.368089341472558,0.276067006104418,0.892616653070952,1.74842437199465,NA,NA,5.1306307034019,1.34160578423204,2.13879182993514,3.01375212399952,NA,5.67928016629218,NA 2,1,2,10,"Saturday",22,0.01,231,202,0.5,0.6,NA,NA,NA,14.9,NA,NA,NA,NA,NA,NA,NA,NA,5.8,NA,NA,NA,NA,NA,NA,NA,NA,747,NA,NA,NA,NA,NA,NA,NA,NA,750,NA,NA,NA,NA,NA,NA,NA,NA,743,NA,NA,NA,NA,NA,2.67923294292042,8.47583334194495,NA,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,NA,1.99138023331659,NA,5.56815355612325,0.923259948195698,NA,NA,NA,NA,NA,NA,3.1011527019063,0.0920223353681394,1.94167127626774,0.368089341472558,0.184044670736279,0.368089341472558,0.368089341472558,1.73922213845783,2.14412041407765,NA,NA,5.1306307034019,1.19577906855465,2.72209869264472,3.88871241806389,NA,7.42675098668978,NA 3,1,3,10,"Saturday",23,0.01,247,227,0.5,1.5,NA,NA,NA,14.9,NA,NA,NA,NA,NA,NA,NA,NA,5.8,NA,NA,NA,NA,NA,NA,NA,NA,747,NA,NA,NA,NA,NA,NA,NA,NA,750,NA,NA,NA,NA,NA,NA,NA,NA,743,NA,NA,NA,NA,NA,2.67923294292042,8.92192983362627,NA,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,NA,1.7524146053186,NA,5.56815355612325,0.680296803933673,NA,NA,NA,NA,NA,NA,3.06434376775904,0.0920223353681394,2.52141198908702,0.460111676840697,0.184044670736279,0.368089341472558,0.368089341472558,1.7852333061419,1.93246904273093,NA,NA,5.13639545700122,1.40965825154816,3.11096993445111,3.88871241806389,NA,7.68373198968942,NA 4,1,4,10,"Sunday",0,0.01,219,218,0.2,1.2,NA,NA,NA,14,NA,NA,NA,NA,NA,NA,NA,NA,4.8,NA,NA,NA,NA,NA,NA,NA,NA,751,NA,NA,NA,NA,NA,NA,NA,NA,754,NA,NA,NA,NA,NA,NA,NA,NA,748,NA,NA,NA,NA,NA,2.67923294292042,5.09824561921501,NA,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,NA,2.38965627997991,NA,5.6776192223642,0.612267123540305,NA,NA,NA,NA,NA,NA,3.21157950434806,0.184044670736279,2.374176252498,0.460111676840697,0.184044670736279,0.368089341472558,0.276067006104418,1.86805340797323,2.08890701285676,NA,NA,5.21710200739181,1.47771071886428,2.04157401948354,3.20818774490271,NA,4.83124285639335,NA ...

# load dataset dataset = read_csv('AirQualityPrediction/TrainingData.csv', header=0)

We can also get a quick idea of how much missing data there is in the dataset. We can do that by first trimming the first few columns to remove the string weekday data and convert the remaining columns to floating point values.

# trim and transform to floats values = dataset.values data = values[:, 6:].astype('float32')

We can then calculate the total number of missing observations and the percentage of values that are missing.

# summarize amount of missing data total_missing = data.size - count_nonzero(isnan(data)) percent_missing = total_missing / data.size * 100 print('Total Missing: %d/%d (%.1f%%)' % (total_missing, data.size, percent_missing))

The complete example is listed below.

# load dataset from numpy import isnan from numpy import count_nonzero from pandas import read_csv # load dataset dataset = read_csv('AirQualityPrediction/TrainingData.csv', header=0) # summarize print(dataset.shape) # trim and transform to floats values = dataset.values data = values[:, 6:].astype('float32') # summarize amount of missing data total_missing = data.size - count_nonzero(isnan(data)) percent_missing = total_missing / data.size * 100 print('Total Missing: %d/%d (%.1f%%)' % (total_missing, data.size, percent_missing))

Running the example first prints the shape of the loaded dataset.

We can see that we have about 37,000 rows and 95 columns. We know these numbers are misleading given that the data is in fact divided into chunks and the columns are divided into the same observations at different sites.

We can also see that a little over 40% of the data is missing. This is a lot. The data is very patchy and we are going to have to understand this well before modeling the problem.

(37821, 95) Total Missing: 1443977/3366069 (42.9%)

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

A good starting point is to look at the data in terms of the chunks.

We can group data by the ‘chunkID’ variable (column index 1).

If each chunk is eight days and the observations are hourly, then we would expect (8 * 24) or 192 rows of data per chunk.

If there are 37,821 rows of data, then there must be chunks with more or less than 192 hours as 37,821/192 is about 196.9 chunks.

Let’s first split the data into chunks. We can first get a list of the unique chunk identifiers.

chunk_ids = unique(values[:, 1])

*to_chunks()* that takes a NumPy array of the loaded data and returns a dictionary of *chunk_id* to rows for the chunk.

The ‘*position_within_chunk*‘ in the data file indicates the order of a row within a chunk. At this stage, we assume that rows are already ordered and do not need to be sorted. A skim of the raw data file seems to confirm this assumption.

Once the data is sorted into chunks, we can calculate the number of rows in each chunk and have a look at the distribution, such as with a box and whisker plot.

# plot distribution of chunk durations def plot_chunk_durations(chunks): # chunk durations in hours chunk_durations = [len(v) for k,v in chunks.items()] # boxplot pyplot.subplot(2, 1, 1) pyplot.boxplot(chunk_durations) # histogram pyplot.subplot(2, 1, 2) pyplot.hist(chunk_durations) # histogram pyplot.show()

The complete example that ties all of this together is listed below

# split data into chunks from numpy import unique from pandas import read_csv from matplotlib import pyplot # split the dataset by 'chunkID', return a dict of id to rows def to_chunks(values, chunk_ix=1): chunks = dict() # get the unique chunk ids chunk_ids = unique(values[:, chunk_ix]) # group rows by chunk id for chunk_id in chunk_ids: selection = values[:, chunk_ix] == chunk_id chunks[chunk_id] = values[selection, :] return chunks # plot distribution of chunk durations def plot_chunk_durations(chunks): # chunk durations in hours chunk_durations = [len(v) for k,v in chunks.items()] # boxplot pyplot.subplot(2, 1, 1) pyplot.boxplot(chunk_durations) # histogram pyplot.subplot(2, 1, 2) pyplot.hist(chunk_durations) # histogram pyplot.show() # load dataset dataset = read_csv('AirQualityPrediction/TrainingData.csv', header=0) # group data by chunks values = dataset.values chunks = to_chunks(values) print('Total Chunks: %d' % len(chunks)) # plot chunk durations plot_chunk_durations(chunks)

Running the example first prints the number of chunks in the dataset.

We can see that there are 208, which suggests that indeed the number of hourly observations must vary across the chunks.

Total Chunks: 208

A box and whisker plot and a histogram plot of chunk durations is created. We can see that indeed the median is 192, meaning that most chunks have eight days of observations or close to it.

We can also see a long tail of durations down to about 25 rows. Although there are not many of these cases, we would expect that will be challenging to forecast given the lack of data.

The distribution also raises questions about how contiguous the observations within each chunk may be.

It may be helpful to get an idea of how contiguous (or not) the observations are within those chunks that do not have the full eight days of data.

One approach to considering this is to create a line plot for each discontiguous chunk and show the gaps in the observations.

We can do this on a single plot. Each chunk has a unique identifier, from 1 to 208, and we can use this as the value for the series and mark missing observations within the eight day interval via *NaN* values that will not appear on the plot.

Inverting this, we can assume that we have NaN values for all time steps within a chunk, then use the ‘*position_within_chunk*‘ column (index 2) to determine the time steps that do have values and mark them with the chunk id.

The *plot_discontinuous_chunks()* below implements this behavior, creating one series or line for each chunk with missing rows all on the same plot. The expectation is that breaks in the line will help us see how contiguous or discontiguous these incomplete chunks happen to be.

# plot chunks that do not have all data def plot_discontiguous_chunks(chunks, row_in_chunk_ix=2): n_steps = 8 * 24 for c_id,rows in chunks.items(): # skip chunks with all data if rows.shape[0] == n_steps: continue # create empty series series = [nan for _ in range(n_steps)] # mark all rows with data for row in rows: # convert to zero offset r_id = row[row_in_chunk_ix] - 1 # mark value series[r_id] = c_id # plot pyplot.plot(series) pyplot.show()

The complete example is listed below.

# plot discontiguous chunks from numpy import nan from numpy import unique from pandas import read_csv from matplotlib import pyplot # split the dataset by 'chunkID', return a dict of id to rows def to_chunks(values, chunk_ix=1): chunks = dict() # get the unique chunk ids chunk_ids = unique(values[:, chunk_ix]) # group rows by chunk id for chunk_id in chunk_ids: selection = values[:, chunk_ix] == chunk_id chunks[chunk_id] = values[selection, :] return chunks # plot chunks that do not have all data def plot_discontiguous_chunks(chunks, row_in_chunk_ix=2): n_steps = 8 * 24 for c_id,rows in chunks.items(): # skip chunks with all data if rows.shape[0] == n_steps: continue # create empty series series = [nan for _ in range(n_steps)] # mark all rows with data for row in rows: # convert to zero offset r_id = row[row_in_chunk_ix] - 1 # mark value series[r_id] = c_id # plot pyplot.plot(series) pyplot.show() # load dataset dataset = read_csv('AirQualityPrediction/TrainingData.csv', header=0) # group data by chunks values = dataset.values chunks = to_chunks(values) # plot discontiguous chunks plot_discontiguous_chunks(chunks)

Running the example creates a single figure with one line for each of the chunks with missing data.

The number and lengths of the breaks in the line for each chunk give an idea of how discontiguous the observations within each chunk happen to be.

Many of the chunks do have long stretches of contiguous data, which is a good sign for modeling.

There are cases where chunks have very few observations and those observations that are present are in small contiguous patches. These may be challenging to model.

Further, not all of these chunks have observations at the end of chunk: the period right before a forecast is required. These specifically will be a challenge for those models that seek to persist recent observations.

The discontinuous nature of the series data within the chunks will also make it challenging to evaluate models. For example, one cannot simply split chunk data in half, train on the first half and test on the second when the observations are patchy. At least, when the incomplete chunk data is considered.

The discontiguous nature of the chunks also suggests that it may be important to look at the hours covered by each chunk.

The time of day is important in environmental data, and models that assume that each chunk covers the same daily or weekly cycle may stumble if the start and end time of day vary across chunks.

We can quickly check this by plotting the distribution of the first hour (in a 24 hour day) of each chunk.

The number of bins in the histogram is set to 24 so we can clearly see the distribution for each hour of the day in 24-hour time.

Further, when collecting the first hour of the chunk, we are careful to only collect it from those chunks that have all eight days of data, in case a chunk with missing data does not have observations at the beginning of the chunk, which we know happens.

# plot distribution of chunk start hour def plot_chunk_start_hour(chunks, hour_in_chunk_ix=5): # chunk start hour chunk_start_hours = [v[0, hour_in_chunk_ix] for k,v in chunks.items() if len(v)==192] # boxplot pyplot.subplot(2, 1, 1) pyplot.boxplot(chunk_start_hours) # histogram pyplot.subplot(2, 1, 2) pyplot.hist(chunk_start_hours, bins=24) # histogram pyplot.show()

The complete example is listed below.

# plot distribution of chunk start hour from numpy import nan from numpy import unique from pandas import read_csv from matplotlib import pyplot # split the dataset by 'chunkID', return a dict of id to rows def to_chunks(values, chunk_ix=1): chunks = dict() # get the unique chunk ids chunk_ids = unique(values[:, chunk_ix]) # group rows by chunk id for chunk_id in chunk_ids: selection = values[:, chunk_ix] == chunk_id chunks[chunk_id] = values[selection, :] return chunks # plot distribution of chunk start hour def plot_chunk_start_hour(chunks, hour_in_chunk_ix=5): # chunk start hour chunk_start_hours = [v[0, hour_in_chunk_ix] for k,v in chunks.items() if len(v)==192] # boxplot pyplot.subplot(2, 1, 1) pyplot.boxplot(chunk_start_hours) # histogram pyplot.subplot(2, 1, 2) pyplot.hist(chunk_start_hours, bins=24) # histogram pyplot.show() # load dataset dataset = read_csv('AirQualityPrediction/TrainingData.csv', header=0) # group data by chunks values = dataset.values chunks = to_chunks(values) # plot distribution of chunk start hour plot_chunk_start_hour(chunks)

Running the example creates a box and whisker plot and a histogram of the first hour within each chunk.

We can see a reasonably uniform distribution of the start time across the 24 hours in the day.

Further, this means that the interval to be forecast for each chunk will also vary across the 24 hour period. This adds a wrinkle for models that might expect a standard three day forecast period (midnight to midnight).

Now that we have some idea of the chunk-structure of the data, let’s take a closer look at the input variables that describe the meteorological observations.

There are 56 input variables.

The first six (indexes 0 to 5) are metadata information for the chunk and time of the observations. They are:

rowID chunkID position_within_chunk month_most_common weekday hour

The remaining 50 describe meteorological information for specific sites; they are:

Solar.radiation_64 WindDirection..Resultant_1 WindDirection..Resultant_1018 WindSpeed..Resultant_1 WindSpeed..Resultant_1018 Ambient.Max.Temperature_14 Ambient.Max.Temperature_22 Ambient.Max.Temperature_50 Ambient.Max.Temperature_52 Ambient.Max.Temperature_57 Ambient.Max.Temperature_76 Ambient.Max.Temperature_2001 Ambient.Max.Temperature_3301 Ambient.Max.Temperature_6005 Ambient.Min.Temperature_14 Ambient.Min.Temperature_22 Ambient.Min.Temperature_50 Ambient.Min.Temperature_52 Ambient.Min.Temperature_57 Ambient.Min.Temperature_76 Ambient.Min.Temperature_2001 Ambient.Min.Temperature_3301 Ambient.Min.Temperature_6005 Sample.Baro.Pressure_14 Sample.Baro.Pressure_22 Sample.Baro.Pressure_50 Sample.Baro.Pressure_52 Sample.Baro.Pressure_57 Sample.Baro.Pressure_76 Sample.Baro.Pressure_2001 Sample.Baro.Pressure_3301 Sample.Baro.Pressure_6005 Sample.Max.Baro.Pressure_14 Sample.Max.Baro.Pressure_22 Sample.Max.Baro.Pressure_50 Sample.Max.Baro.Pressure_52 Sample.Max.Baro.Pressure_57 Sample.Max.Baro.Pressure_76 Sample.Max.Baro.Pressure_2001 Sample.Max.Baro.Pressure_3301 Sample.Max.Baro.Pressure_6005 Sample.Min.Baro.Pressure_14 Sample.Min.Baro.Pressure_22 Sample.Min.Baro.Pressure_50 Sample.Min.Baro.Pressure_52 Sample.Min.Baro.Pressure_57 Sample.Min.Baro.Pressure_76 Sample.Min.Baro.Pressure_2001 Sample.Min.Baro.Pressure_3301 Sample.Min.Baro.Pressure_6005

Really, there are only eight meteorological input variables:

Solar.radiation WindDirection..Resultant WindSpeed..Resultant Ambient.Max.Temperature Ambient.Min.Temperature Sample.Baro.Pressure Sample.Max.Baro.Pressure Sample.Min.Baro.Pressure

These variables are recorded across 23 unique sites; they are:

1, 14, 22, 50, 52, 57, 64, 76, 1018, 2001, 3301, 6005

The data is beautifully complex.

Not all variables are recorded at all sites.

There is some overlap in the site identifiers used in the target variables, such as 1, 50, 64, etc.

There are site identifiers used in the target variables that are not used in the input variables, such as 4002. There are also site identifiers that are used in the input that are not used in the target identifiers, such as 15.

This suggests, at the very least, that not all variables are recorded at all locations. That recording stations are heterogeneous across sites. Further, there might be something special about sites that only collect measures of a given type or collect all measurements.

Let’s take a closer look at the data for the input variables.

We can start off by looking at the structure and distribution of inputs per chunk.

The first few chunks that have all eight days of observations have the chunkId of 1, 3, and 5.

We can enumerate all of the input columns and create one line plot for each. This will create a time series line plot for each input variable to give a rough idea of how each varies across time.

We can repeat this for a few chunks to get an idea how the temporal structure may differ across chunks.

The function below named *plot_chunk_inputs()* takes the data in chunk format and a list of chunk ids to plot. It will create a figure with 50 line plots, one for each input variable, and n lines per plot, one for each chunk.

# plot all inputs for one or more chunk ids def plot_chunk_inputs(chunks, c_ids): pyplot.figure() inputs = range(6, 56) for i in range(len(inputs)): ax = pyplot.subplot(len(inputs), 1, i+1) ax.set_xticklabels([]) ax.set_yticklabels([]) column = inputs[i] for chunk_id in c_ids: rows = chunks[chunk_id] pyplot.plot(rows[:,column]) pyplot.show()

The complete example is listed below.

# plot inputs for a chunk from numpy import unique from pandas import read_csv from matplotlib import pyplot # split the dataset by 'chunkID', return a dict of id to rows def to_chunks(values, chunk_ix=1): chunks = dict() # get the unique chunk ids chunk_ids = unique(values[:, chunk_ix]) # group rows by chunk id for chunk_id in chunk_ids: selection = values[:, chunk_ix] == chunk_id chunks[chunk_id] = values[selection, :] return chunks # plot all inputs for one or more chunk ids def plot_chunk_inputs(chunks, c_ids): pyplot.figure() inputs = range(6, 56) for i in range(len(inputs)): ax = pyplot.subplot(len(inputs), 1, i+1) ax.set_xticklabels([]) ax.set_yticklabels([]) column = inputs[i] for chunk_id in c_ids: rows = chunks[chunk_id] pyplot.plot(rows[:,column]) pyplot.show() # load data dataset = read_csv('AirQualityPrediction/TrainingData.csv', header=0) # group data by chunks values = dataset.values chunks = to_chunks(values) # plot inputs for some chunks plot_chunk_inputs(chunks, [1])

Running the example creates a single figure with 50 line plots, one for each of the meteorological input variables.

The plots are hard to see, so you may want to increase the size of the created figure.

We can see that the observations for the first five variables look pretty complete; these are solar radiation, wind speed, and wind direction. The rest of the variables appear pretty patchy, at least for this chunk.

We can update the example and plot the input variables for the first three chunks with the full eight days of observations.

plot_chunk_inputs(chunks, [1, 3 ,5])

Running the example creates the same 50 line plots, each with three series or lines per plot, one for each chunk.

Again, the figure makes the individual plots hard to see, so you may need to increase the size of the figure to better review the patterns.

We can see that these three figures do show similar structures within each line plot. This is helpful finding as it suggests that it may be useful to model the same variables across multiple chunks.

It does raise the question as to whether the distribution of the variables differs greatly across sites.

We can look at the distribution of input variables crudely using box and whisker plots.

The *plot_chunk_input_boxplots()* below will create one box and whisker per input feature for the data for one chunk.

# boxplot for input variables for a chuck def plot_chunk_input_boxplots(chunks, c_id): rows = chunks[c_id] pyplot.boxplot(rows[:,6:56]) pyplot.show()

The complete example is listed below.

# boxplots of inputs for a chunk from numpy import unique from numpy import isnan from numpy import count_nonzero from pandas import read_csv from matplotlib import pyplot # split the dataset by 'chunkID', return a dict of id to rows def to_chunks(values, chunk_ix=1): chunks = dict() # get the unique chunk ids chunk_ids = unique(values[:, chunk_ix]) # group rows by chunk id for chunk_id in chunk_ids: selection = values[:, chunk_ix] == chunk_id chunks[chunk_id] = values[selection, :] return chunks # boxplot for input variables for a chuck def plot_chunk_input_boxplots(chunks, c_id): rows = chunks[c_id] pyplot.boxplot(rows[:,6:56]) pyplot.show() # load data dataset = read_csv('TrainingData.csv', header=0) # group data by chunks values = dataset.values chunks = to_chunks(values) # boxplot for input variables plot_chunk_input_boxplots(chunks, 1)

Running the example creates 50 boxplots, one for each input variable for the observations in the first chunk in the training dataset.

We can see that variables of the same type may have the same spread of observations, and each group of variables appears to have differing units. Perhaps degrees for wind direction, hectopascales for pressure, degrees Celsius for temperature, and so on.

It may be interesting to further investigate the distribution and spread of observations for each of the eight variable types. This is left as a further exercise.

We have some rough ideas about the input variables, and perhaps they may be useful in predicting the target variables. We cannot be sure.

We can now turn our attention to the target variables.

The goal of the forecast problem is to predict multiple variables across multiple sites for three days.

There are 39 time series variables to predict.

From the column header, they are:

"target_1_57","target_10_4002","target_10_8003","target_11_1","target_11_32","target_11_50","target_11_64","target_11_1003","target_11_1601","target_11_4002","target_11_8003","target_14_4002","target_14_8003","target_15_57","target_2_57","target_3_1","target_3_50","target_3_57","target_3_1601","target_3_4002","target_3_6006","target_4_1","target_4_50","target_4_57","target_4_1018","target_4_1601","target_4_2001","target_4_4002","target_4_4101","target_4_6006","target_4_8003","target_5_6006","target_7_57","target_8_57","target_8_4002","target_8_6004","target_8_8003","target_9_4002","target_9_8003"

The naming convention for these column headers is:

target_[variable identifier]_[site identifier]]

We can convert these column headers into a small dataset of variable ids and site ids with a little regex.

The results are as follows:

var, site 1,57 10,4002 10,8003 11,1 11,32 11,50 11,64 11,1003 11,1601 11,4002 11,8003 14,4002 14,8003 15,57 2,57 3,1 3,50 3,57 3,1601 3,4002 3,6006 4,1 4,50 4,57 4,1018 4,1601 4,2001 4,4002 4,4101 4,6006 4,8003 5,6006 7,57 8,57 8,4002 8,6004 8,8003 9,4002 9,8003

Helpfully, the targets are grouped by variable id.

We can see that one variable may have to be predicted across multiple sites; for example, variable 11 predicted at sites 1, 32, 50, and so on:

var, site 11,1 11,32 11,50 11,64 11,1003 11,1601 11,4002 11,8003

We can see that different variables may need to be predicted for a given site. For example, site 50 requires variables 11, 3, and 4:

var, site 11,50 3,50 4,50

We can save the small dataset of targets to a file called ‘*targets.txt*‘ and load it up for some quick analysis.

# summarize targets from numpy import unique from pandas import read_csv # load dataset dataset = read_csv('targets.txt', header=0) values = dataset.values # summarize unique print('Unique Variables: %d' % len(unique(values[:, 0]))) print('Unique Sites: %d' % len(unique(values[:, 1])))

Running the example prints the number of unique variables and sites.

We can see that 39 target variables is far less than (12*14) 168 if we were predicting all variables for all sites.

Unique Variables: 12 Unique Sites: 14

Let’s take a closer look at the data for the target variables.

We can start off by looking at the structure and distribution of targets per chunk.

The first few chunks that have all eight days of observations have the chunkId of 1, 3, and 5.

We can enumerate all of the target columns and create one line plot for each. This will create a time series line plot for each target variable to give a rough idea of how it varies across time.

We can repeat this for a few chunks to get a rough idea of how the temporal structure may vary across chunks.

The function below, named *plot_chunk_targets()*, takes the data in chunk format and a list of chunk ids to plot. It will create a figure with 39 line plots, one for each target variable, and n lines per plot, one for each chunk.

# plot all targets for one or more chunk ids def plot_chunk_targets(chunks, c_ids): pyplot.figure() targets = range(56, 95) for i in range(len(targets)): ax = pyplot.subplot(len(targets), 1, i+1) ax.set_xticklabels([]) ax.set_yticklabels([]) column = targets[i] for chunk_id in c_ids: rows = chunks[chunk_id] pyplot.plot(rows[:,column]) pyplot.show()

The complete example is listed below.

# plot targets for a chunk from numpy import unique from pandas import read_csv from matplotlib import pyplot # split the dataset by 'chunkID', return a dict of id to rows def to_chunks(values, chunk_ix=1): chunks = dict() # get the unique chunk ids chunk_ids = unique(values[:, chunk_ix]) # group rows by chunk id for chunk_id in chunk_ids: selection = values[:, chunk_ix] == chunk_id chunks[chunk_id] = values[selection, :] return chunks # plot all targets for one or more chunk ids def plot_chunk_targets(chunks, c_ids): pyplot.figure() targets = range(56, 95) for i in range(len(targets)): ax = pyplot.subplot(len(targets), 1, i+1) ax.set_xticklabels([]) ax.set_yticklabels([]) column = targets[i] for chunk_id in c_ids: rows = chunks[chunk_id] pyplot.plot(rows[:,column]) pyplot.show() # load data dataset = read_csv('AirQualityPrediction/TrainingData.csv', header=0) # group data by chunks values = dataset.values chunks = to_chunks(values) # plot targets for some chunks plot_chunk_targets(chunks, [1])

Running the example creates a single figure with 39 line plots for chunk identifier “1”.

The plots are small, but give a rough idea of the temporal structure for the variables.

We can see that there are more than a few variables that have no data for this chunk. These cannot be forecasted directly, and probably not indirectly.

This suggests that in addition to not having all variables for all sites, that even those specified in the column header may not be present for some chunks.

We can also see breaks in some of the series for missing values. This suggests that even though we may have observations for each time step within the chunk, that we may not have a contiguous series for all variables in the chunk.

There is a cyclic structure to many of the plots. Most have eight peaks, very likely corresponding to the eight days of observations within the chunk. This seasonal structure could be modeled directly, and perhaps removed from the data when modeling and added back to the forecasted interval.

There does not appear to be any trend to the series.

We can re-run the example and plot the target variables for the first three chunks with complete data.

# plot targets for some chunks plot_chunk_targets(chunks, [1, 3 ,5])

Running the example creates a figure with 39 plots and three time series per plot, one for the targets for each chunk.

The plot is busy, and you may want to increase the size of the plot window to better see the comparison across the chunks for the target variables.

For many of the variables that have a cyclic daily structure, we can see the structure repeated across the chunks.

This is encouraging as it suggests that modeling a variable for a site may be helpful across chunks.

Further, plots 3-to-10 correspond to variable 11 across seven different sites. The string similarity in temporal structure across these plots suggest that modeling the data per variable which is used across sites may be beneficial.

It is also useful to take a look at the distribution of the target variables.

We can start by taking a look at the distribution of each target variable for one chuck by creating box and whisker plots for each target variable.

A separate boxplot can be created for each target side-by-side, allowing the shape and range of values to be directly compared on the same scale.

# boxplot for target variables for a chuck def plot_chunk_targets_boxplots(chunks, c_id): rows = chunks[c_id] pyplot.boxplot(rows[:,56:]) pyplot.show()

The complete example is listed below.

# boxplots of targets for a chunk from numpy import unique from numpy import isnan from numpy import count_nonzero from pandas import read_csv from matplotlib import pyplot # split the dataset by 'chunkID', return a dict of id to rows def to_chunks(values, chunk_ix=1): chunks = dict() # get the unique chunk ids chunk_ids = unique(values[:, chunk_ix]) # group rows by chunk id for chunk_id in chunk_ids: selection = values[:, chunk_ix] == chunk_id chunks[chunk_id] = values[selection, :] return chunks # boxplot for target variables for a chuck def plot_chunk_targets_boxplots(chunks, c_id): rows = chunks[c_id] pyplot.boxplot(rows[:,56:]) pyplot.show() # load data dataset = read_csv('AirQualityPrediction/TrainingData.csv', header=0) # group data by chunks values = dataset.values chunks = to_chunks(values) # boxplot for target variables plot_chunk_targets_boxplots(chunks, 1)

Running the example creates a figure containing 39 boxplots, one for each of the 39 target variables for the first chunk.

We can see that many of the variables have a median close to zero or one; we can also see a large asymmetrical spread for most variables, suggesting the variables likely have a skew with outliers.

It is encouraging that the boxplots from 4-10 for variable 11 across seven sites show a similar distribution. This is further supporting evidence that data may be grouped by variable and used to fit a model that could be used across sites.

We can re-create this plot using data across all chunks to see dataset-wide patterns.

The complete example is listed below.

# boxplots of targets for all chunks from pandas import read_csv from matplotlib import pyplot # boxplot for all target variables def plot_target_boxplots(values): pyplot.boxplot(values[:,56:]) pyplot.show() # load data dataset = read_csv('AirQualityPrediction/TrainingData.csv', header=0) # boxplot for target variables values = dataset.values plot_target_boxplots(values)

Running the example creates a new figure showing 39 box and whisker plots for the entire training dataset regardless of chunk.

It is a little bit of a mess, where the circle outliers obscure the main data distributions.

We can see that outlier values do extend into the range 5-to-10 units. This suggests there might be some use in standardizing and/or rescaling the targets when modeling.

Perhaps the most useful finding is that there are some targets that do not have any (or very much) data regardless of chunk. These columns probably should be excluded from the dataset.

We can investigate the apparent missing data further by creating a bar chart of the ratio of missing data per column, excluding the metadata columns at the beginning (e.g. the first five columns).

The *plot_col_percentage_missing()* function below implements this.

# bar chart of the ratio of missing data per column def plot_col_percentage_missing(values, ix_start=5): ratios = list() # skip early columns, with meta data or strings for col in range(ix_start, values.shape[1]): col_data = values[:, col].astype('float32') ratio = count_nonzero(isnan(col_data)) / len(col_data) * 100 ratios.append(ratio) if ratio > 90.0: print(col, ratio) col_id = [x for x in range(ix_start, values.shape[1])] pyplot.bar(col_id, ratios) pyplot.show()

The complete example is listed below.

# summarize missing data per column from numpy import isnan from numpy import count_nonzero from pandas import read_csv from matplotlib import pyplot # bar chart of the ratio of missing data per column def plot_col_percentage_missing(values, ix_start=5): ratios = list() # skip early columns, with meta data or strings for col in range(ix_start, values.shape[1]): col_data = values[:, col].astype('float32') ratio = count_nonzero(isnan(col_data)) / len(col_data) * 100 ratios.append(ratio) if ratio > 90.0: print(ratio) col_id = [x for x in range(ix_start, values.shape[1])] pyplot.bar(col_id, ratios) pyplot.show() # load data dataset = read_csv('AirQualityPrediction/TrainingData.csv', header=0) # plot ratio of missing data per column values = dataset.values plot_col_percentage_missing(values)

Running the example first prints the column id (zero offset) and the ratio of missing data, if the ratio is above 90%.

We can see that there are in fact no columns with zero non-NaN data, but perhaps two dozen (12) that have above 90% missing data.

Interestingly, seven of these are target variables (index 56 or higher).

11 91.48885539779488 20 91.48885539779488 29 91.48885539779488 38 91.48885539779488 47 91.48885539779488 58 95.38880516115385 66 96.9805134713519 68 95.38880516115385 72 97.31630575606145 86 95.38880516115385 92 95.38880516115385 94 95.38880516115385

A bar chart of column index number to ratio of missing data is created.

We can see that there might be some stratification to the ratio of missing data, with a cluster below 10%, a cluster around 70%, and a cluster above 90%.

We can also see a separation between input variable and target variables where the former are quite regular as they show the same variable type measured across different sites.

Such small amounts of data for some target variables suggest the need to leverage other factors besides past observations in order to make predictions.

The distribution of the target variables are not neat and may be non-Gaussian at the least, or highly multimodal at worst.

We can check this by looking at histograms of the target variables, for the data for a single chunk.

A problem with the *hist()* function in matplotlib is that it is not robust to NaN values. We can overcome this by checking that each column has non-NaN values prior to plotting and excluding the rows with NaN values.

The function below does this and creates one histogram for each target variable for one or more chunks.

# plot distribution of targets for one or more chunk ids def plot_chunk_targets_hist(chunks, c_ids): pyplot.figure() targets = range(56, 95) for i in range(len(targets)): ax = pyplot.subplot(len(targets), 1, i+1) ax.set_xticklabels([]) ax.set_yticklabels([]) column = targets[i] for chunk_id in c_ids: rows = chunks[chunk_id] # extract column of interest col = rows[:,column].astype('float32') # check for some data to plot if count_nonzero(isnan(col)) < len(rows): # only plot non-nan values pyplot.hist(col[~isnan(col)], bins=100) pyplot.show()

The complete example is listed below.

# plot distribution of targets for a chunk from numpy import unique from numpy import isnan from numpy import count_nonzero from pandas import read_csv from matplotlib import pyplot # split the dataset by 'chunkID', return a dict of id to rows def to_chunks(values, chunk_ix=1): chunks = dict() # get the unique chunk ids chunk_ids = unique(values[:, chunk_ix]) # group rows by chunk id for chunk_id in chunk_ids: selection = values[:, chunk_ix] == chunk_id chunks[chunk_id] = values[selection, :] return chunks # plot distribution of targets for one or more chunk ids def plot_chunk_targets_hist(chunks, c_ids): pyplot.figure() targets = range(56, 95) for i in range(len(targets)): ax = pyplot.subplot(len(targets), 1, i+1) ax.set_xticklabels([]) ax.set_yticklabels([]) column = targets[i] for chunk_id in c_ids: rows = chunks[chunk_id] # extract column of interest col = rows[:,column].astype('float32') # check for some data to plot if count_nonzero(isnan(col)) < len(rows): # only plot non-nan values pyplot.hist(col[~isnan(col)], bins=100) pyplot.show() # load data dataset = read_csv('AirQualityPrediction/TrainingData.csv', header=0) # group data by chunks values = dataset.values chunks = to_chunks(values) # plot targets for some chunks plot_chunk_targets_hist(chunks, [1])

Running the example creates a figure with 39 histograms, one for each target variable for the first chunk.

The plot is hard to read, but the large number of bins goes to show the distribution of the variables.

It might be fair to say that perhaps none of the target variables have an obvious Gaussian distribution. Many may have a skewed distribution with a long right tail.

Other variables have what appears to be quite a discrete distribution that might be an artifact of the chosen measurement device or measurement scale.

We can re-create the same plot with target variables for all chunks.

The complete example is listed below.

# plot distribution of all targets from numpy import isnan from numpy import count_nonzero from pandas import read_csv from matplotlib import pyplot # plot histogram for each target variable def plot_target_hist(values): pyplot.figure() targets = range(56, 95) for i in range(len(targets)): ax = pyplot.subplot(len(targets), 1, i+1) ax.set_xticklabels([]) ax.set_yticklabels([]) column = targets[i] # extract column of interest col = values[:,column].astype('float32') # check for some data to plot if count_nonzero(isnan(col)) < len(values): # only plot non-nan values pyplot.hist(col[~isnan(col)], bins=100) pyplot.show() # load data dataset = read_csv('AirQualityPrediction/TrainingData.csv', header=0) # plot targets for all chunks values = dataset.values plot_target_hist(values)

Running the example creates a figure with 39 histograms, one for each of the target variables in the training dataset.

We can see fuller distributions, which are more insightful.

The first handful of plots perhaps show a highly skewed distribution, the core of which may or may not be Gaussian.

We can see many Gaussian-like distributions with gaps, suggesting discrete measurements imposed on a Gaussian-distributed continuous variable.

We can also see some variables that show an exponential distribution.

Together, this suggests either the use of power transforms to explore reshaping the data to be more Gaussian, and/or the use of nonparametric modeling methods that are not dependent upon a Gaussian distribution of the variables. For example, classical linear methods may be expected to have a hard time.

After the end of the competition, the person who provided the data, David Chudzicki, summarized the true meaning of the 12 output variables.

This was provided in a form post titled “what the target variables really were“, reproduced partially below:

Description Target Variable Carbon monoxide, 8 Sulfur dioxide, 4 SO2 max 5-min avg, 3 Nitric oxide (NO), 10 Nitrogen dioxide (NO2), 14 Oxides of nitrogen (NOx), 9 Ozone, 11 PM10 Total 0-10um STP, 5 OC CSN Unadjusted PM2.5 LC TOT, 15 Total Nitrate PM2.5 LC, 2 EC CSN PM2.5 LC TOT, 1 Total Carbon PM2.5 LC TOT, 7 Sulfate PM2.5 LC, 8 PM2.5 Raw Data, 4 PM2.5 AQI & Speciation Mass, 3

This is interesting as we can see that the target variables are meteorological in nature and related to air quality as the name of the competition suggests.

A problem is that there are 15 variables and only 12 different types of target variables in the dataset.

The cause of this problem is that the same target variable in the dataset may be used to represent different target variables. Specifically:

- Target 8 could be data for ‘
*Carbon monoxide*‘ or ‘*Sulfate PM2.5 LC*‘. - Target 4 could be data for ‘
*Sulfur dioxide*‘ or ‘*PM2.5 Raw Data*‘. - Target 3 could be data for ‘
*SO2 max 5-min avg*‘ or ‘*PM2.5 AQI & Speciation Mass*‘.

From the names of the variables, the doubling-up of data into the same target variable was done so with variables with differing chemical characters and perhaps even measures, e.g. it appears to be accidental rather than strategic.

It is not clear, but it is likely that a target represents one variable within a chunk but may represent different variables across chunks. Alternately, it may be possible that the variables differ across sites within each chunk. In the former case, it means that models that expect consistency in these target variables across chunks, which is a very reasonable assumption, may have difficulty. In the latter, models can treat the variable-site combinations as distinct variables.

It may be possible to tease out the differences by comparing the distribution and scales of these variables across chunks.

This is disappointing, and depending on how consequential it is to model skill, it may require the removal of these variables from the dataset, which are a lot of the target variables (20 of 39).

In this section, we will harness what we have discovered about the problem and suggest some approaches to modeling this problem.

I like this dataset; it is messy, realistic, and resists naive approaches.

This section is divided into four sections; they are:

- Framing.
- Data Preparation.
- Modeling.
- Evaluation.

The problem is generally framed as a multivariate multi-step time series forecasting problem.

Further, the multiple variables are required to be forecasted across multiple sites, which is a common structural breakdown for time series forecasting problems, e.g. predict the variable thing at different physical locations such as stores or stations.

Let’s walk through some possible framings of the data.

A first-cut approach might be to treat each variable at each site as a univariate time series forecasting problem.

A model is given eight days of hourly observations for a variable and is asked to forecast three days, from which a specific subset of forecast lead times are taken and used or evaluated.

It may be possible in a few select cases, and this could be confirmed with some further data analysis.

Nevertheless, the data generally resists this framing because not all chunks have eight days of observations for each target variable. Further, the time series for the target variable can be dramatically discontiguous, if not mostly (90%-to-95%) incomplete.

We could relax the expectation of the structure and amount of prior data required by the model, designing the model to make use of whatever is available.

This approach would require 39 models per chunk and a total of (208 * 39) or 8,112 separate models. It sounds possible, but perhaps less scalable than we may prefer from an engineering perspective.

The variable-site combinations could be modeled across chunks, requiring only 39 models.

The target variables can be aggregated across sites.

We can also relax what lag lead times are used to make a forecast and present what is available either with zero-padding or imputing for missing values, or even lag observations that disregard lead time.

We can then frame the problem as given some prior observations for a given variable, forecast the following three days.

The models may have more to work with, but will disregard any variable differences based on site. This may or may not be reasonless and could be checked by comparing variable distributions across sites.

There are 12 unique variables.

We could model each variable per chunk, giving (208 * 12) or 2,496 models. It might make more sense to model the 12 variables across chunks, requiring only 12 models.

Perhaps one or more target variables are dependent on one or more of the meteorological variables, or even on the other target variables.

This could be explored by investigating the correlation between each target variable and each input variable, as well as with the other target variables.

If such dependencies exist, or could be assumed, it may be possible to not only forecast the variables with more complete data, but also those target variables with above 90% missing data.

Such models could use some subset of prior meteorological observations and/or target variable observations as input. The discontiguous nature of the data may require the relaxing of the traditional lag temporal structure for the input variables, allowing the model to use whatever was available for a specific forecast.

Depending on the choice of model, the input and target variables may benefit from some data preparation, such as:

- Standardization.
- Normalization.
- Power Transform, where Gaussian.
- Seasonal Differencing, where seasonal structures are present.

To address the missing data, in some cases imputing may be required with simple persistence or averaging.

In other cases, and depending on the choice of model, it may be possible to learn directly from the NaN values as observations (e.g. XGBoost can do this) or to fill with 0 values and mask the inputs (e.g. LSTMs can do this).

It may be interesting to investigate downscaling input to 2, 4, or 12, hourly data or similar in an attempt to fill the gaps in discontiguous data, e.g. forecast hourly from 12 hourly data.

Modeling may require some prototyping to discover what works well in terms of methods and chosen input observations.

There may be rare examples of chunks with complete data where classical methods like ETS or SARIMA could be used for univariate forecasting.

Generally, the problem resists the classical methods.

A good choice would be the use of nonlinear machine learning methods that are agnostic about the temporal structure of the input data, making use of whatever is available.

Such models could be used in a recursive or direct strategy to forecast the lead times. A direct strategy may make more sense, with one model per required lead time.

There are 10 lead times, and 39 target variables, in which case a direct strategy would require (39 * 10) or 390 models.

A downside of the direct approach to modeling the problem is the inability of the model to leverage any dependencies between target variables in the forecast interval, specifically across sites, across variables, or across lead times. If these dependencies exist (and some surely do), it may be possible to add a flavor of them in using a second-tier of of ensemble models.

Feature selection could be used to discover the variables and/or the lag lead times that may provide the most value in forecasting each target variable and lead time.

This approach would provide a lot of flexibility, and as was shown in the competition, ensembles of decision trees perform well with little tuning.

Like machine learning methods, deep learning methods may be able to use whatever multivariate data is available in order to make a prediction.

Two classes of neural networks may be worth exploring for this problem:

- Convolutional Neural Networks or CNNs.
- Recurrent Neural Networks, specifically Long Short-Term Memory networks or LSTMs.

CNNs are capable of distilling long sequences of multivariate input time series data into small feature maps, and in essence learn the features from the sequences that are most relevant for forecasting. Their ability to handle noise and feature invariance across the input sequences may be useful. Like other neural networks, CNNs can output a vector in order to predict the forecast lead times.

LSTMs are designed to work with sequence data and can directly support missing data via masking. They too are capable of automatic feature learning from long input sequences and alone or combined with CNNs may perform well on this problem. Together with an encoder-decoder architecture, the LSTM network can be used to natively forecast multiple lead times.

A naive approach that mirrors that used in the competition might be best for evaluating models.

That is, splitting each chunk into train and test sets, in this case using the first five days of data for training and the remaining three for test.

It may be possible and interesting to finalize a model by training it on the entire dataset and submitting a forecast to the Kaggle website for evaluation on the held out test dataset.

This section provides more resources on the topic if you are looking to go deeper.

- A Standard Multivariate, Multi-Step, and Multi-Site Time Series Forecasting Problem
- EMC Data Science Global Hackathon (Air Quality Prediction)
- Chucking everything into a Random Forest: Ben Hamner on Winning The Air Quality Prediction Hackathon
- Winning Code for the EMC Data Science Global Hackathon (Air Quality Prediction)
- General approaches to partitioning the models?

In this tutorial, you discovered and explored the Air Quality Prediction dataset that represents a challenging multivariate, multi-site, and multi-step time series forecasting problem.

Specifically, you learned:

- How to load and explore the chunk-structure of the dataset.
- How to explore and visualize the input and target variables for the dataset.
- How to use the new understand to outline a suite of methods for framing the problem, preparing the data, and modeling the dataset.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Load, Visualize, and Explore a Complex Multivariate Multistep Time Series Forecasting Dataset appeared first on Machine Learning Mastery.

]]>The post How to Develop LSTM Models for Multi-Step Time Series Forecasting of Household Power Consumption appeared first on Machine Learning Mastery.

]]>This data represents a multivariate time series of power-related variables that in turn could be used to model and even forecast future electricity consumption.

Unlike other machine learning algorithms, long short-term memory recurrent neural networks are capable of automatically learning features from sequence data, support multiple-variate data, and can output a variable length sequences that can be used for multi-step forecasting.

In this tutorial, you will discover how to develop long short-term memory recurrent neural networks for multi-step time series forecasting of household power consumption.

After completing this tutorial, you will know:

- How to develop and evaluate Univariate and multivariate Encoder-Decoder LSTMs for multi-step time series forecasting.
- How to develop and evaluate an CNN-LSTM Encoder-Decoder model for multi-step time series forecasting.
- How to develop and evaluate a ConvLSTM Encoder-Decoder model for multi-step time series forecasting.

Let’s get started.

This tutorial is divided into nine parts; they are:

- Problem Description
- Load and Prepare Dataset
- Model Evaluation
- LSTMs for Multi-Step Forecasting
- LSTM Model With Univariate Input and Vector Output
- Encoder-Decoder LSTM Model With Univariate Input
- Encoder-Decoder LSTM Model With Multivariate Input
- CNN-LSTM Encoder-Decoder Model With Univariate Input
- ConvLSTM Encoder-Decoder Model With Univariate Input

The ‘Household Power Consumption‘ dataset is a multivariate time series dataset that describes the electricity consumption for a single household over four years.

The data was collected between December 2006 and November 2010 and observations of power consumption within the household were collected every minute.

It is a multivariate series comprised of seven variables (besides the date and time); they are:

**global_active_power**: The total active power consumed by the household (kilowatts).**global_reactive_power**: The total reactive power consumed by the household (kilowatts).**voltage**: Average voltage (volts).**global_intensity**: Average current intensity (amps).**sub_metering_1**: Active energy for kitchen (watt-hours of active energy).**sub_metering_2**: Active energy for laundry (watt-hours of active energy).**sub_metering_3**: Active energy for climate control systems (watt-hours of active energy).

Active and reactive energy refer to the technical details of alternative current.

A fourth sub-metering variable can be created by subtracting the sum of three defined sub-metering variables from the total active energy as follows:

sub_metering_remainder = (global_active_power * 1000 / 60) - (sub_metering_1 + sub_metering_2 + sub_metering_3)

The dataset can be downloaded from the UCI Machine Learning repository as a single 20 megabyte .zip file:

Download the dataset and unzip it into your current working directory. You will now have the file “*household_power_consumption.txt*” that is about 127 megabytes in size and contains all of the observations.

We can use the *read_csv()* function to load the data and combine the first two columns into a single date-time column that we can use as an index.

# load all data dataset = read_csv('household_power_consumption.txt', sep=';', header=0, low_memory=False, infer_datetime_format=True, parse_dates={'datetime':[0,1]}, index_col=['datetime'])

Next, we can mark all missing values indicated with a ‘*?*‘ character with a *NaN* value, which is a float.

This will allow us to work with the data as one array of floating point values rather than mixed types (less efficient.)

# mark all missing values dataset.replace('?', nan, inplace=True) # make dataset numeric dataset = dataset.astype('float32')

We also need to fill in the missing values now that they have been marked.

A very simple approach would be to copy the observation from the same time the day before. We can implement this in a function named *fill_missing()* that will take the NumPy array of the data and copy values from exactly 24 hours ago.

# fill missing values with a value at the same time one day ago def fill_missing(values): one_day = 60 * 24 for row in range(values.shape[0]): for col in range(values.shape[1]): if isnan(values[row, col]): values[row, col] = values[row - one_day, col]

We can apply this function directly to the data within the DataFrame.

# fill missing fill_missing(dataset.values)

Now we can create a new column that contains the remainder of the sub-metering, using the calculation from the previous section.

# add a column for for the remainder of sub metering values = dataset.values dataset['sub_metering_4'] = (values[:,0] * 1000 / 60) - (values[:,4] + values[:,5] + values[:,6])

We can now save the cleaned-up version of the dataset to a new file; in this case we will just change the file extension to .csv and save the dataset as ‘*household_power_consumption.csv*‘.

# save updated dataset dataset.to_csv('household_power_consumption.csv')

Tying all of this together, the complete example of loading, cleaning-up, and saving the dataset is listed below.

# load and clean-up data from numpy import nan from numpy import isnan from pandas import read_csv from pandas import to_numeric # fill missing values with a value at the same time one day ago def fill_missing(values): one_day = 60 * 24 for row in range(values.shape[0]): for col in range(values.shape[1]): if isnan(values[row, col]): values[row, col] = values[row - one_day, col] # load all data dataset = read_csv('household_power_consumption.txt', sep=';', header=0, low_memory=False, infer_datetime_format=True, parse_dates={'datetime':[0,1]}, index_col=['datetime']) # mark all missing values dataset.replace('?', nan, inplace=True) # make dataset numeric dataset = dataset.astype('float32') # fill missing fill_missing(dataset.values) # add a column for for the remainder of sub metering values = dataset.values dataset['sub_metering_4'] = (values[:,0] * 1000 / 60) - (values[:,4] + values[:,5] + values[:,6]) # save updated dataset dataset.to_csv('household_power_consumption.csv')

Running the example creates the new file ‘*household_power_consumption.csv*‘ that we can use as the starting point for our modeling project.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In this section, we will consider how we can develop and evaluate predictive models for the household power dataset.

This section is divided into four parts; they are:

- Problem Framing
- Evaluation Metric
- Train and Test Sets
- Walk-Forward Validation

There are many ways to harness and explore the household power consumption dataset.

In this tutorial, we will use the data to explore a very specific question; that is:

Given recent power consumption, what is the expected power consumption for the week ahead?

This requires that a predictive model forecast the total active power for each day over the next seven days.

Technically, this framing of the problem is referred to as a multi-step time series forecasting problem, given the multiple forecast steps. A model that makes use of multiple input variables may be referred to as a multivariate multi-step time series forecasting model.

A model of this type could be helpful within the household in planning expenditures. It could also be helpful on the supply side for planning electricity demand for a specific household.

This framing of the dataset also suggests that it would be useful to downsample the per-minute observations of power consumption to daily totals. This is not required, but makes sense, given that we are interested in total power per day.

We can achieve this easily using the resample() function on the pandas DataFrame. Calling this function with the argument ‘*D*‘ allows the loaded data indexed by date-time to be grouped by day (see all offset aliases). We can then calculate the sum of all observations for each day and create a new dataset of daily power consumption data for each of the eight variables.

The complete example is listed below.

# resample minute data to total for each day from pandas import read_csv # load the new file dataset = read_csv('household_power_consumption.csv', header=0, infer_datetime_format=True, parse_dates=['datetime'], index_col=['datetime']) # resample data to daily daily_groups = dataset.resample('D') daily_data = daily_groups.sum() # summarize print(daily_data.shape) print(daily_data.head()) # save daily_data.to_csv('household_power_consumption_days.csv')

Running the example creates a new daily total power consumption dataset and saves the result into a separate file named ‘*household_power_consumption_days.csv*‘.

We can use this as the dataset for fitting and evaluating predictive models for the chosen framing of the problem.

A forecast will be comprised of seven values, one for each day of the week ahead.

It is common with multi-step forecasting problems to evaluate each forecasted time step separately. This is helpful for a few reasons:

- To comment on the skill at a specific lead time (e.g. +1 day vs +3 days).
- To contrast models based on their skills at different lead times (e.g. models good at +1 day vs models good at days +5).

The units of the total power are kilowatts and it would be useful to have an error metric that was also in the same units. Both Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) fit this bill, although RMSE is more commonly used and will be adopted in this tutorial. Unlike MAE, RMSE is more punishing of forecast errors.

The performance metric for this problem will be the RMSE for each lead time from day 1 to day 7.

As a short-cut, it may be useful to summarize the performance of a model using a single score in order to aide in model selection.

One possible score that could be used would be the RMSE across all forecast days.

The function *evaluate_forecasts()* below will implement this behavior and return the performance of a model based on multiple seven-day forecasts.

# evaluate one or more weekly forecasts against expected values def evaluate_forecasts(actual, predicted): scores = list() # calculate an RMSE score for each day for i in range(actual.shape[1]): # calculate mse mse = mean_squared_error(actual[:, i], predicted[:, i]) # calculate rmse rmse = sqrt(mse) # store scores.append(rmse) # calculate overall RMSE s = 0 for row in range(actual.shape[0]): for col in range(actual.shape[1]): s += (actual[row, col] - predicted[row, col])**2 score = sqrt(s / (actual.shape[0] * actual.shape[1])) return score, scores

Running the function will first return the overall RMSE regardless of day, then an array of RMSE scores for each day.

We will use the first three years of data for training predictive models and the final year for evaluating models.

The data in a given dataset will be divided into standard weeks. These are weeks that begin on a Sunday and end on a Saturday.

This is a realistic and useful way for using the chosen framing of the model, where the power consumption for the week ahead can be predicted. It is also helpful with modeling, where models can be used to predict a specific day (e.g. Wednesday) or the entire sequence.

We will split the data into standard weeks, working backwards from the test dataset.

The final year of the data is in 2010 and the first Sunday for 2010 was January 3rd. The data ends in mid November 2010 and the closest final Saturday in the data is November 20th. This gives 46 weeks of test data.

The first and last rows of daily data for the test dataset are provided below for confirmation.

2010-01-03,2083.4539999999984,191.61000000000055,350992.12000000034,8703.600000000033,3842.0,4920.0,10074.0,15888.233355799992 ... 2010-11-20,2197.006000000004,153.76800000000028,346475.9999999998,9320.20000000002,4367.0,2947.0,11433.0,17869.76663959999

The daily data starts in late 2006.

The first Sunday in the dataset is December 17th, which is the second row of data.

Organizing the data into standard weeks gives 159 full standard weeks for training a predictive model.

2006-12-17,3390.46,226.0059999999994,345725.32000000024,14398.59999999998,2033.0,4187.0,13341.0,36946.66673200004 ... 2010-01-02,1309.2679999999998,199.54600000000016,352332.8399999997,5489.7999999999865,801.0,298.0,6425.0,14297.133406600002

The function *split_dataset()* below splits the daily data into train and test sets and organizes each into standard weeks.

Specific row offsets are used to split the data using knowledge of the dataset. The split datasets are then organized into weekly data using the NumPy split() function.

# split a univariate dataset into train/test sets def split_dataset(data): # split into standard weeks train, test = data[1:-328], data[-328:-6] # restructure into windows of weekly data train = array(split(train, len(train)/7)) test = array(split(test, len(test)/7)) return train, test

We can test this function out by loading the daily dataset and printing the first and last rows of data from both the train and test sets to confirm they match the expectations above.

The complete code example is listed below.

# split into standard weeks from numpy import split from numpy import array from pandas import read_csv # split a univariate dataset into train/test sets def split_dataset(data): # split into standard weeks train, test = data[1:-328], data[-328:-6] # restructure into windows of weekly data train = array(split(train, len(train)/7)) test = array(split(test, len(test)/7)) return train, test # load the new file dataset = read_csv('household_power_consumption_days.csv', header=0, infer_datetime_format=True, parse_dates=['datetime'], index_col=['datetime']) train, test = split_dataset(dataset.values) # validate train data print(train.shape) print(train[0, 0, 0], train[-1, -1, 0]) # validate test print(test.shape) print(test[0, 0, 0], test[-1, -1, 0])

Running the example shows that indeed the train dataset has 159 weeks of data, whereas the test dataset has 46 weeks.

We can see that the total active power for the train and test dataset for the first and last rows match the data for the specific dates that we defined as the bounds on the standard weeks for each set.

(159, 7, 8) 3390.46 1309.2679999999998 (46, 7, 8) 2083.4539999999984 2197.006000000004

Models will be evaluated using a scheme called walk-forward validation.

This is where a model is required to make a one week prediction, then the actual data for that week is made available to the model so that it can be used as the basis for making a prediction on the subsequent week. This is both realistic for how the model may be used in practice and beneficial to the models allowing them to make use of the best available data.

We can demonstrate this below with separation of input data and output/predicted data.

Input, Predict [Week1] Week2 [Week1 + Week2] Week3 [Week1 + Week2 + Week3] Week4 ...

The walk-forward validation approach to evaluating predictive models on this dataset is provided below named *evaluate_model()*.

The train and test datasets in standard-week format are provided to the function as arguments. An additional argument n_input is provided that is used to define the number of prior observations that the model will use as input in order to make a prediction.

Two new functions are called: one to build a model from the training data called *build_model()* and another that uses the model to make forecasts for each new standard week called *forecast()*. These will be covered in subsequent sections.

We are working with neural networks, and as such, they are generally slow to train but fast to evaluate. This means that the preferred usage of the models is to build them once on historical data and to use them to forecast each step of the walk-forward validation. The models are static (i.e. not updated) during their evaluation.

This is different to other models that are faster to train where a model may be re-fit or updated each step of the walk-forward validation as new data is made available. With sufficient resources, it is possible to use neural networks this way, but we will not in this tutorial.

The complete *evaluate_model()* function is listed below.

# evaluate a single model def evaluate_model(train, test, n_input): # fit model model = build_model(train, n_input) # history is a list of weekly data history = [x for x in train] # walk-forward validation over each week predictions = list() for i in range(len(test)): # predict the week yhat_sequence = forecast(model, history, n_input) # store the predictions predictions.append(yhat_sequence) # get real observation and add to history for predicting the next week history.append(test[i, :]) # evaluate predictions days for each week predictions = array(predictions) score, scores = evaluate_forecasts(test[:, :, 0], predictions) return score, scores

Once we have the evaluation for a model, we can summarize the performance.

The function below named *summarize_scores()* will display the performance of a model as a single line for easy comparison with other models.

# summarize scores def summarize_scores(name, score, scores): s_scores = ', '.join(['%.1f' % s for s in scores]) print('%s: [%.3f] %s' % (name, score, s_scores))

We now have all of the elements to begin evaluating predictive models on the dataset.

Recurrent neural networks, or RNNs, are specifically designed to work, learn, and predict sequence data.

A recurrent neural network is a neural network where the output of the network from one time step is provided as an input in the subsequent time step. This allows the model to make a decision as to what to predict based on both the input for the current time step and direct knowledge of what was output in the prior time step.

Perhaps the most successful and widely used RNN is the long short-term memory network, or LSTM for short. It is successful because it overcomes the challenges involved in training a recurrent neural network, resulting in stable models. In addition to harnessing the recurrent connection of the outputs from the prior time step, LSTMs also have an internal memory that operates like a local variable, allowing them to accumulate state over the input sequence.

For more information about Recurrent Neural Networks, see the post:

For more information about Long Short-Term Memory networks, see the post:

LSTMs offer a number of benefits when it comes to multi-step time series forecasting; they are:

**Native Support for Sequences**. LSTMs are a type of recurrent network, and as such are designed to take sequence data as input, unlike other models where lag observations must be presented as input features.**Multivariate Inputs**. LSTMs directly support multiple parallel input sequences for multivariate inputs, unlike other models where multivariate inputs are presented in a flat structure.**Vector Output**. Like other neural networks, LSTMs are able to map input data directly to an output vector that may represent multiple output time steps.

Further, specialized architectures have been developed that are specifically designed to make multi-step sequence predictions, generally referred to as sequence-to-sequence prediction, or seq2seq for short. This is useful as multi-step time series forecasting is a type of seq2seq prediction.

An example of a recurrent neural network architecture designed for seq2seq problems is the encoder-decoder LSTM.

An encoder-decoder LSTM is a model comprised of two sub-models: one called the encoder that reads the input sequences and compresses it to a fixed-length internal representation, and an output model called the decoder that interprets the internal representation and uses it to predict the output sequence.

The encoder-decoder approach to sequence prediction has proven much more effective than outputting a vector directly and is the preferred approach.

Generally, LSTMs have been found to not be very effective at auto-regression type problems. These are problems where forecasting the next time step is a function of recent time steps.

For more on this issue, see the post:

One-dimensional convolutional neural networks, or CNNs, have proven effective at automatically learning features from input sequences.

A popular approach has been to combine CNNs with LSTMs, where the CNN is as an encoder to learn features from sub-sequences of input data which are provided as time steps to an LSTM. This architecture is called a CNN-LSTM.

For more information on this architecture, see the post:

A power variation on the CNN LSTM architecture is the ConvLSTM that uses the convolutional reading of input subsequences directly within an LSTM’s units. This approach has proven very effective for time series classification and can be adapted for use in multi-step time series forecasting.

In this tutorial, we will explore a suite of LSTM architectures for multi-step time series forecasting. Specifically, we will look at how to develop the following models:

**LSTM**model with vector output for multi-step forecasting with univariate input data.**Encoder-Decoder LSTM**model for multi-step forecasting with univariate input data.**Encoder-Decoder LSTM**model for multi-step forecasting with multivariate input data.**CNN-LSTM Encoder-Decoder**model for multi-step forecasting with univariate input data.**ConvLSTM Encoder-Decoder**model for multi-step forecasting with univariate input data.

The models will be developed and demonstrated on the household power prediction problem. A model is considered skillful if it achieves performance better than a naive model, which is an overall RMSE of about 465 kilowatts across a seven day forecast.

We will not focus on the tuning of these models to achieve optimal performance; instead, we will stop short at skillful models as compared to a naive forecast. The chosen structures and hyperparameters are chosen with a little trial and error. The scores should be taken as just an example rather than a study of the optimal model or configuration for the problem.

Given the stochastic nature of the models, it is good practice to evaluate a given model multiple times and report the mean performance on a test dataset. In the interest of brevity and keeping the code simple, we will instead present single-runs of models in this tutorial.

We cannot know which approach will be the most effective for a given multi-step forecasting problem. It is a good idea to explore a suite of methods in order to discover what works best on your specific dataset.

We will start off by developing a simple or vanilla LSTM model that reads in a sequence of days of total daily power consumption and predicts a vector output of the next standard week of daily power consumption.

This will provide the foundation for the more elaborate models developed in subsequent sections.

The number of prior days used as input defines the one-dimensional (1D) subsequence of data that the LSTM will read and learn to extract features. Some ideas on the size and nature of this input include:

- All prior days, up to years worth of data.
- The prior seven days.
- The prior two weeks.
- The prior one month.
- The prior one year.
- The prior week and the week to be predicted from one year ago.

There is no right answer; instead, each approach and more can be tested and the performance of the model can be used to choose the nature of the input that results in the best model performance.

These choices define a few things:

- How the training data must be prepared in order to fit the model.
- How the test data must be prepared in order to evaluate the model.
- How to use the model to make predictions with a final model in the future.

A good starting point would be to use the prior seven days.

An LSTM model expects data to have the shape:

[samples, timesteps, features]

One sample will be comprised of seven time steps with one feature for the seven days of total daily power consumed.

The training dataset has 159 weeks of data, so the shape of the training dataset would be:

[159, 7, 1]

This is a good start. The data in this format would use the prior standard week to predict the next standard week. A problem is that 159 instances is not a lot to train a neural network.

A way to create a lot more training data is to change the problem during training to predict the next seven days given the prior seven days, regardless of the standard week.

This only impacts the training data, and the test problem remains the same: predict the daily power consumption for the next standard week given the prior standard week.

This will require a little preparation of the training data.

The training data is provided in standard weeks with eight variables, specifically in the shape [*159, 7, 8*]. The first step is to flatten the data so that we have eight time series sequences.

# flatten data data = train.reshape((train.shape[0]*train.shape[1], train.shape[2]))

We then need to iterate over the time steps and divide the data into overlapping windows; each iteration moves along one time step and predicts the subsequent seven days.

For example:

Input, Output [d01, d02, d03, d04, d05, d06, d07], [d08, d09, d10, d11, d12, d13, d14] [d02, d03, d04, d05, d06, d07, d08], [d09, d10, d11, d12, d13, d14, d15] ...

We can do this by keeping track of start and end indexes for the inputs and outputs as we iterate across the length of the flattened data in terms of time steps.

We can also do this in a way where the number of inputs and outputs are parameterized (e.g. *n_input*, *n_out*) so that you can experiment with different values or adapt it for your own problem.

Below is a function named *to_supervised()* that takes a list of weeks (history) and the number of time steps to use as inputs and outputs and returns the data in the overlapping moving window format.

# convert history into inputs and outputs def to_supervised(train, n_input, n_out=7): # flatten data data = train.reshape((train.shape[0]*train.shape[1], train.shape[2])) X, y = list(), list() in_start = 0 # step over the entire history one time step at a time for _ in range(len(data)): # define the end of the input sequence in_end = in_start + n_input out_end = in_end + n_out # ensure we have enough data for this instance if out_end < len(data): x_input = data[in_start:in_end, 0] x_input = x_input.reshape((len(x_input), 1)) X.append(x_input) y.append(data[in_end:out_end, 0]) # move along one time step in_start += 1 return array(X), array(y)

When we run this function on the entire training dataset, we transform 159 samples into 1,099; specifically, the transformed dataset has the shapes *X=[1099, 7, 1]* and *y=[1099, 7].*

Next, we can define and fit the LSTM model on the training data.

This multi-step time series forecasting problem is an autoregression. That means it is likely best modeled where that the next seven days is some function of observations at prior time steps. This and the relatively small amount of data means that a small model is required.

We will develop a model with a single hidden LSTM layer with 200 units. The number of units in the hidden layer is unrelated to the number of time steps in the input sequences. The LSTM layer is followed by a fully connected layer with 200 nodes that will interpret the features learned by the LSTM layer. Finally, an output layer will directly predict a vector with seven elements, one for each day in the output sequence.

We will use the mean squared error loss function as it is a good match for our chosen error metric of RMSE. We will use the efficient Adam implementation of stochastic gradient descent and fit the model for 70 epochs with a batch size of 16.

The small batch size and the stochastic nature of the algorithm means that the same model will learn a slightly different mapping of inputs to outputs each time it is trained. This means results may vary when the model is evaluated. You can try running the model multiple times and calculate an average of model performance.

The *build_model()* below prepares the training data, defines the model, and fits the model on the training data, returning the fit model ready for making predictions.

# train the model def build_model(train, n_input): # prepare data train_x, train_y = to_supervised(train, n_input) # define parameters verbose, epochs, batch_size = 0, 70, 16 n_timesteps, n_features, n_outputs = train_x.shape[1], train_x.shape[2], train_y.shape[1] # define model model = Sequential() model.add(LSTM(200, activation='relu', input_shape=(n_timesteps, n_features))) model.add(Dense(100, activation='relu')) model.add(Dense(n_outputs)) model.compile(loss='mse', optimizer='adam') # fit network model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=verbose) return model

Now that we know how to fit the model, we can look at how the model can be used to make a prediction.

Generally, the model expects data to have the same three dimensional shape when making a prediction.

In this case, the expected shape of an input pattern is one sample, seven days of one feature for the daily power consumed:

[1, 7, 1]

Data must have this shape when making predictions for the test set and when a final model is being used to make predictions in the future. If you change the number if input days to 14, then the shape of the training data and the shape of new samples when making predictions must be changed accordingly to have 14 time steps. It is a modeling choice that you must carry forward when using the model.

We are using walk-forward validation to evaluate the model as described in the previous section.

This means that we have the observations available for the prior week in order to predict the coming week. These are collected into an array of standard weeks called history.

In order to predict the next standard week, we need to retrieve the last days of observations. As with the training data, we must first flatten the history data to remove the weekly structure so that we end up with eight parallel time series.

# flatten data data = data.reshape((data.shape[0]*data.shape[1], data.shape[2]))

Next, we need to retrieve the last seven days of daily total power consumed (feature index 0).

We will parameterize this as we did for the training data so that the number of prior days used as input by the model can be modified in the future.

# retrieve last observations for input data input_x = data[-n_input:, 0]

Next, we reshape the input into the expected three-dimensional structure.

# reshape into [1, n_input, 1] input_x = input_x.reshape((1, len(input_x), 1))

We then make a prediction using the fit model and the input data and retrieve the vector of seven days of output.

# forecast the next week yhat = model.predict(input_x, verbose=0) # we only want the vector forecast yhat = yhat[0]

The *forecast()* function below implements this and takes as arguments the model fit on the training dataset, the history of data observed so far, and the number of input time steps expected by the model.

# make a forecast def forecast(model, history, n_input): # flatten data data = array(history) data = data.reshape((data.shape[0]*data.shape[1], data.shape[2])) # retrieve last observations for input data input_x = data[-n_input:, 0] # reshape into [1, n_input, 1] input_x = input_x.reshape((1, len(input_x), 1)) # forecast the next week yhat = model.predict(input_x, verbose=0) # we only want the vector forecast yhat = yhat[0] return yhat

That’s it; we now have everything we need to make multi-step time series forecasts with an LSTM model on the daily total power consumed univariate dataset.

We can tie all of this together. The complete example is listed below.

# univariate multi-step lstm from math import sqrt from numpy import split from numpy import array from pandas import read_csv from sklearn.metrics import mean_squared_error from matplotlib import pyplot from keras.models import Sequential from keras.layers import Dense from keras.layers import Flatten from keras.layers import LSTM # split a univariate dataset into train/test sets def split_dataset(data): # split into standard weeks train, test = data[1:-328], data[-328:-6] # restructure into windows of weekly data train = array(split(train, len(train)/7)) test = array(split(test, len(test)/7)) return train, test # evaluate one or more weekly forecasts against expected values def evaluate_forecasts(actual, predicted): scores = list() # calculate an RMSE score for each day for i in range(actual.shape[1]): # calculate mse mse = mean_squared_error(actual[:, i], predicted[:, i]) # calculate rmse rmse = sqrt(mse) # store scores.append(rmse) # calculate overall RMSE s = 0 for row in range(actual.shape[0]): for col in range(actual.shape[1]): s += (actual[row, col] - predicted[row, col])**2 score = sqrt(s / (actual.shape[0] * actual.shape[1])) return score, scores # summarize scores def summarize_scores(name, score, scores): s_scores = ', '.join(['%.1f' % s for s in scores]) print('%s: [%.3f] %s' % (name, score, s_scores)) # convert history into inputs and outputs def to_supervised(train, n_input, n_out=7): # flatten data data = train.reshape((train.shape[0]*train.shape[1], train.shape[2])) X, y = list(), list() in_start = 0 # step over the entire history one time step at a time for _ in range(len(data)): # define the end of the input sequence in_end = in_start + n_input out_end = in_end + n_out # ensure we have enough data for this instance if out_end < len(data): x_input = data[in_start:in_end, 0] x_input = x_input.reshape((len(x_input), 1)) X.append(x_input) y.append(data[in_end:out_end, 0]) # move along one time step in_start += 1 return array(X), array(y) # train the model def build_model(train, n_input): # prepare data train_x, train_y = to_supervised(train, n_input) # define parameters verbose, epochs, batch_size = 0, 70, 16 n_timesteps, n_features, n_outputs = train_x.shape[1], train_x.shape[2], train_y.shape[1] # define model model = Sequential() model.add(LSTM(200, activation='relu', input_shape=(n_timesteps, n_features))) model.add(Dense(100, activation='relu')) model.add(Dense(n_outputs)) model.compile(loss='mse', optimizer='adam') # fit network model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=verbose) return model # make a forecast def forecast(model, history, n_input): # flatten data data = array(history) data = data.reshape((data.shape[0]*data.shape[1], data.shape[2])) # retrieve last observations for input data input_x = data[-n_input:, 0] # reshape into [1, n_input, 1] input_x = input_x.reshape((1, len(input_x), 1)) # forecast the next week yhat = model.predict(input_x, verbose=0) # we only want the vector forecast yhat = yhat[0] return yhat # evaluate a single model def evaluate_model(train, test, n_input): # fit model model = build_model(train, n_input) # history is a list of weekly data history = [x for x in train] # walk-forward validation over each week predictions = list() for i in range(len(test)): # predict the week yhat_sequence = forecast(model, history, n_input) # store the predictions predictions.append(yhat_sequence) # get real observation and add to history for predicting the next week history.append(test[i, :]) # evaluate predictions days for each week predictions = array(predictions) score, scores = evaluate_forecasts(test[:, :, 0], predictions) return score, scores # load the new file dataset = read_csv('household_power_consumption_days.csv', header=0, infer_datetime_format=True, parse_dates=['datetime'], index_col=['datetime']) # split into train and test train, test = split_dataset(dataset.values) # evaluate model and get scores n_input = 7 score, scores = evaluate_model(train, test, n_input) # summarize scores summarize_scores('lstm', score, scores) # plot scores days = ['sun', 'mon', 'tue', 'wed', 'thr', 'fri', 'sat'] pyplot.plot(days, scores, marker='o', label='lstm') pyplot.show()

Running the example fits and evaluates the model, printing the overall RMSE across all seven days, and the per-day RMSE for each lead time.

Your specific results may vary given the stochastic nature of the algorithm. You may want to try running the example a few times.

We can see that in this case, the model was skillful as compared to a naive forecast, achieving an overall RMSE of about 399 kilowatts, less than 465 kilowatts achieved by a naive model.

lstm: [399.456] 419.4, 422.1, 384.5, 395.1, 403.9, 317.7, 441.5

A plot of the daily RMSE is also created.

The plot shows that perhaps Tuesdays and Fridays are easier days to forecast than the other days and that perhaps Saturday at the end of the standard week is the hardest day to forecast.

We can increase the number of prior days to use as input from seven to 14 by changing the *n_input* variable.

# evaluate model and get scores n_input = 14

Re-running the example with this change first prints a summary of performance of the model.

Your specific results may vary; try running the example a few times.

In this case, we can see a further drop in the overall RMSE to about 370 kilowatts, suggesting that further tuning of the input size and perhaps the number of nodes in the model may result in better performance.

lstm: [370.028] 387.4, 377.9, 334.0, 371.2, 367.1, 330.4, 415.1

Comparing the per-day RMSE scores we see some are better and some are worse than using seven-day inputs.

This may suggest benefit in using the two different sized inputs in some way, such as an ensemble of the two approaches or perhaps a single model (e.g. a multi-headed model) that reads the training data in different ways.

In this section, we can update the vanilla LSTM to use an encoder-decoder model.

This means that the model will not output a vector sequence directly. Instead, the model will be comprised of two sub models, the encoder to read and encode the input sequence, and the decoder that will read the encoded input sequence and make a one-step prediction for each element in the output sequence.

The difference is subtle, as in practice both approaches do in fact predict a sequence output.

The important difference is that an LSTM model is used in the decoder, allowing it to both know what was predicted for the prior day in the sequence and accumulate internal state while outputting the sequence.

Let’s take a closer look at how this model is defined.

As before, we define an LSTM hidden layer with 200 units. This is the decoder model that will read the input sequence and will output a 200 element vector (one output per unit) that captures features from the input sequence. We will use 14 days of total power consumption as input.

# define model model = Sequential() model.add(LSTM(200, activation='relu', input_shape=(n_timesteps, n_features)))

We will use a simple encoder-decoder architecture that is easy to implement in Keras, that has a lot of similarity to the architecture of an LSTM autoencoder.

First, the internal representation of the input sequence is repeated multiple times, once for each time step in the output sequence. This sequence of vectors will be presented to the LSTM decoder.

model.add(RepeatVector(7))

We then define the decoder as an LSTM hidden layer with 200 units. Importantly, the decoder will output the entire sequence, not just the output at the end of the sequence as we did with the encoder. This means that each of the 200 units will output a value for each of the seven days, representing the basis for what to predict for each day in the output sequence.

model.add(LSTM(200, activation='relu', return_sequences=True))

We will then use a fully connected layer to interpret each time step in the output sequence before the final output layer. Importantly, the output layer predicts a single step in the output sequence, not all seven days at a time,

This means that we will use the same layers applied to each step in the output sequence. It means that the same fully connected layer and output layer will be used to process each time step provided by the decoder. To achieve this, we will wrap the interpretation layer and the output layer in a TimeDistributed wrapper that allows the wrapped layers to be used for each time step from the decoder.

model.add(TimeDistributed(Dense(100, activation='relu'))) model.add(TimeDistributed(Dense(1)))

This allows the LSTM decoder to figure out the context required for each step in the output sequence and the wrapped dense layers to interpret each time step separately, yet reusing the same weights to perform the interpretation. An alternative would be to flatten all of the structure created by the LSTM decoder and to output the vector directly. You can try this as an extension to see how it compares.

The network therefore outputs a three-dimensional vector with the same structure as the input, with the dimensions [*samples, timesteps, features*].

There is a single feature, the daily total power consumed, and there are always seven features. A single one-week prediction will therefore have the size: [*1, 7, 1*].

Therefore, when training the model, we must restructure the output data (*y*) to have the three-dimensional structure instead of the two-dimensional structure of [*samples, features*] used in the previous section.

# reshape output into [samples, timesteps, features] train_y = train_y.reshape((train_y.shape[0], train_y.shape[1], 1))

We can tie all of this together into the updated *build_model()* function listed below.

# train the model def build_model(train, n_input): # prepare data train_x, train_y = to_supervised(train, n_input) # define parameters verbose, epochs, batch_size = 0, 20, 16 n_timesteps, n_features, n_outputs = train_x.shape[1], train_x.shape[2], train_y.shape[1] # reshape output into [samples, timesteps, features] train_y = train_y.reshape((train_y.shape[0], train_y.shape[1], 1)) # define model model = Sequential() model.add(LSTM(200, activation='relu', input_shape=(n_timesteps, n_features))) model.add(RepeatVector(n_outputs)) model.add(LSTM(200, activation='relu', return_sequences=True)) model.add(TimeDistributed(Dense(100, activation='relu'))) model.add(TimeDistributed(Dense(1))) model.compile(loss='mse', optimizer='adam') # fit network model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=verbose) return model

The complete example with the encoder-decoder model is listed below.

# univariate multi-step encoder-decoder lstm from math import sqrt from numpy import split from numpy import array from pandas import read_csv from sklearn.metrics import mean_squared_error from matplotlib import pyplot from keras.models import Sequential from keras.layers import Dense from keras.layers import Flatten from keras.layers import LSTM from keras.layers import RepeatVector from keras.layers import TimeDistributed # split a univariate dataset into train/test sets def split_dataset(data): # split into standard weeks train, test = data[1:-328], data[-328:-6] # restructure into windows of weekly data train = array(split(train, len(train)/7)) test = array(split(test, len(test)/7)) return train, test # evaluate one or more weekly forecasts against expected values def evaluate_forecasts(actual, predicted): scores = list() # calculate an RMSE score for each day for i in range(actual.shape[1]): # calculate mse mse = mean_squared_error(actual[:, i], predicted[:, i]) # calculate rmse rmse = sqrt(mse) # store scores.append(rmse) # calculate overall RMSE s = 0 for row in range(actual.shape[0]): for col in range(actual.shape[1]): s += (actual[row, col] - predicted[row, col])**2 score = sqrt(s / (actual.shape[0] * actual.shape[1])) return score, scores # summarize scores def summarize_scores(name, score, scores): s_scores = ', '.join(['%.1f' % s for s in scores]) print('%s: [%.3f] %s' % (name, score, s_scores)) # convert history into inputs and outputs def to_supervised(train, n_input, n_out=7): # flatten data data = train.reshape((train.shape[0]*train.shape[1], train.shape[2])) X, y = list(), list() in_start = 0 # step over the entire history one time step at a time for _ in range(len(data)): # define the end of the input sequence in_end = in_start + n_input out_end = in_end + n_out # ensure we have enough data for this instance if out_end < len(data): x_input = data[in_start:in_end, 0] x_input = x_input.reshape((len(x_input), 1)) X.append(x_input) y.append(data[in_end:out_end, 0]) # move along one time step in_start += 1 return array(X), array(y) # train the model def build_model(train, n_input): # prepare data train_x, train_y = to_supervised(train, n_input) # define parameters verbose, epochs, batch_size = 0, 20, 16 n_timesteps, n_features, n_outputs = train_x.shape[1], train_x.shape[2], train_y.shape[1] # reshape output into [samples, timesteps, features] train_y = train_y.reshape((train_y.shape[0], train_y.shape[1], 1)) # define model model = Sequential() model.add(LSTM(200, activation='relu', input_shape=(n_timesteps, n_features))) model.add(RepeatVector(n_outputs)) model.add(LSTM(200, activation='relu', return_sequences=True)) model.add(TimeDistributed(Dense(100, activation='relu'))) model.add(TimeDistributed(Dense(1))) model.compile(loss='mse', optimizer='adam') # fit network model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=verbose) return model # make a forecast def forecast(model, history, n_input): # flatten data data = array(history) data = data.reshape((data.shape[0]*data.shape[1], data.shape[2])) # retrieve last observations for input data input_x = data[-n_input:, 0] # reshape into [1, n_input, 1] input_x = input_x.reshape((1, len(input_x), 1)) # forecast the next week yhat = model.predict(input_x, verbose=0) # we only want the vector forecast yhat = yhat[0] return yhat # evaluate a single model def evaluate_model(train, test, n_input): # fit model model = build_model(train, n_input) # history is a list of weekly data history = [x for x in train] # walk-forward validation over each week predictions = list() for i in range(len(test)): # predict the week yhat_sequence = forecast(model, history, n_input) # store the predictions predictions.append(yhat_sequence) # get real observation and add to history for predicting the next week history.append(test[i, :]) # evaluate predictions days for each week predictions = array(predictions) score, scores = evaluate_forecasts(test[:, :, 0], predictions) return score, scores # load the new file dataset = read_csv('household_power_consumption_days.csv', header=0, infer_datetime_format=True, parse_dates=['datetime'], index_col=['datetime']) # split into train and test train, test = split_dataset(dataset.values) # evaluate model and get scores n_input = 14 score, scores = evaluate_model(train, test, n_input) # summarize scores summarize_scores('lstm', score, scores) # plot scores days = ['sun', 'mon', 'tue', 'wed', 'thr', 'fri', 'sat'] pyplot.plot(days, scores, marker='o', label='lstm') pyplot.show()

Running the example fits the model and summarizes the performance on the test dataset.

Your specific results may vary given the stochastic nature of the algorithm. You may want to try running the example a few times.

We can see that in this case, the model is skillful, achieving an overall RMSE score of about 372 kilowatts.

lstm: [372.595] 379.5, 399.8, 339.6, 372.2, 370.9, 309.9, 424.8

A line plot of the per-day RMSE is also created showing a similar pattern in error as was seen in the previous section.

In this section, we will update the Encoder-Decoder LSTM developed in the previous section to use each of the eight time series variables to predict the next standard week of daily total power consumption.

We will do this by providing each one-dimensional time series to the model as a separate sequence of input.

The LSTM will in turn create an internal representation of each input sequence that will together be interpreted by the decoder.

Using multivariate inputs is helpful for those problems where the output sequence is some function of the observations at prior time steps from multiple different features, not just (or including) the feature being forecasted. It is unclear whether this is the case in the power consumption problem, but we can explore it nonetheless.

First, we must update the preparation of the training data to include all of the eight features, not just the one total daily power consumed. It requires a single line change:

X.append(data[in_start:in_end, :])

The complete *to_supervised()* function with this change is listed below.

# convert history into inputs and outputs def to_supervised(train, n_input, n_out=7): # flatten data data = train.reshape((train.shape[0]*train.shape[1], train.shape[2])) X, y = list(), list() in_start = 0 # step over the entire history one time step at a time for _ in range(len(data)): # define the end of the input sequence in_end = in_start + n_input out_end = in_end + n_out # ensure we have enough data for this instance if out_end < len(data): X.append(data[in_start:in_end, :]) y.append(data[in_end:out_end, 0]) # move along one time step in_start += 1 return array(X), array(y)

We also must update the function used to make forecasts with the fit model to use all eight features from the prior time steps.

Again, another small change:

# retrieve last observations for input data input_x = data[-n_input:, :] # reshape into [1, n_input, n] input_x = input_x.reshape((1, input_x.shape[0], input_x.shape[1]))

The complete *forecast()* function with this change is listed below:

# make a forecast def forecast(model, history, n_input): # flatten data data = array(history) data = data.reshape((data.shape[0]*data.shape[1], data.shape[2])) # retrieve last observations for input data input_x = data[-n_input:, :] # reshape into [1, n_input, n] input_x = input_x.reshape((1, input_x.shape[0], input_x.shape[1])) # forecast the next week yhat = model.predict(input_x, verbose=0) # we only want the vector forecast yhat = yhat[0] return yhat

The same model architecture and configuration is used directly, although we will increase the number of training epochs from 20 to 50 given the 8-fold increase in the amount of input data.

The complete example is listed below.

# multivariate multi-step encoder-decoder lstm from math import sqrt from numpy import split from numpy import array from pandas import read_csv from sklearn.metrics import mean_squared_error from matplotlib import pyplot from keras.models import Sequential from keras.layers import Dense from keras.layers import Flatten from keras.layers import LSTM from keras.layers import RepeatVector from keras.layers import TimeDistributed # split a univariate dataset into train/test sets def split_dataset(data): # split into standard weeks train, test = data[1:-328], data[-328:-6] # restructure into windows of weekly data train = array(split(train, len(train)/7)) test = array(split(test, len(test)/7)) return train, test # evaluate one or more weekly forecasts against expected values def evaluate_forecasts(actual, predicted): scores = list() # calculate an RMSE score for each day for i in range(actual.shape[1]): # calculate mse mse = mean_squared_error(actual[:, i], predicted[:, i]) # calculate rmse rmse = sqrt(mse) # store scores.append(rmse) # calculate overall RMSE s = 0 for row in range(actual.shape[0]): for col in range(actual.shape[1]): s += (actual[row, col] - predicted[row, col])**2 score = sqrt(s / (actual.shape[0] * actual.shape[1])) return score, scores # summarize scores def summarize_scores(name, score, scores): s_scores = ', '.join(['%.1f' % s for s in scores]) print('%s: [%.3f] %s' % (name, score, s_scores)) # convert history into inputs and outputs def to_supervised(train, n_input, n_out=7): # flatten data data = train.reshape((train.shape[0]*train.shape[1], train.shape[2])) X, y = list(), list() in_start = 0 # step over the entire history one time step at a time for _ in range(len(data)): # define the end of the input sequence in_end = in_start + n_input out_end = in_end + n_out # ensure we have enough data for this instance if out_end < len(data): X.append(data[in_start:in_end, :]) y.append(data[in_end:out_end, 0]) # move along one time step in_start += 1 return array(X), array(y) # train the model def build_model(train, n_input): # prepare data train_x, train_y = to_supervised(train, n_input) # define parameters verbose, epochs, batch_size = 0, 50, 16 n_timesteps, n_features, n_outputs = train_x.shape[1], train_x.shape[2], train_y.shape[1] # reshape output into [samples, timesteps, features] train_y = train_y.reshape((train_y.shape[0], train_y.shape[1], 1)) # define model model = Sequential() model.add(LSTM(200, activation='relu', input_shape=(n_timesteps, n_features))) model.add(RepeatVector(n_outputs)) model.add(LSTM(200, activation='relu', return_sequences=True)) model.add(TimeDistributed(Dense(100, activation='relu'))) model.add(TimeDistributed(Dense(1))) model.compile(loss='mse', optimizer='adam') # fit network model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=verbose) return model # make a forecast def forecast(model, history, n_input): # flatten data data = array(history) data = data.reshape((data.shape[0]*data.shape[1], data.shape[2])) # retrieve last observations for input data input_x = data[-n_input:, :] # reshape into [1, n_input, n] input_x = input_x.reshape((1, input_x.shape[0], input_x.shape[1])) # forecast the next week yhat = model.predict(input_x, verbose=0) # we only want the vector forecast yhat = yhat[0] return yhat # evaluate a single model def evaluate_model(train, test, n_input): # fit model model = build_model(train, n_input) # history is a list of weekly data history = [x for x in train] # walk-forward validation over each week predictions = list() for i in range(len(test)): # predict the week yhat_sequence = forecast(model, history, n_input) # store the predictions predictions.append(yhat_sequence) # get real observation and add to history for predicting the next week history.append(test[i, :]) # evaluate predictions days for each week predictions = array(predictions) score, scores = evaluate_forecasts(test[:, :, 0], predictions) return score, scores # load the new file dataset = read_csv('household_power_consumption_days.csv', header=0, infer_datetime_format=True, parse_dates=['datetime'], index_col=['datetime']) # split into train and test train, test = split_dataset(dataset.values) # evaluate model and get scores n_input = 14 score, scores = evaluate_model(train, test, n_input) # summarize scores summarize_scores('lstm', score, scores) # plot scores days = ['sun', 'mon', 'tue', 'wed', 'thr', 'fri', 'sat'] pyplot.plot(days, scores, marker='o', label='lstm') pyplot.show()

Running the example fits the model and summarizes the performance on the test dataset.

Experimentation found that this model appears less stable than the univariate case and may be related to the differing scales of the input eight variables.

Your specific results may vary given the stochastic nature of the algorithm. You may want to try running the example a few times.

We can see that in this case, the model is skillful, achieving an overall RMSE score of about 376 kilowatts.

lstm: [376.273] 378.5, 381.5, 328.4, 388.3, 361.2, 308.0, 467.2

A line plot of the per-day RMSE is also created.

A convolutional neural network, or CNN, can be used as the encoder in an encoder-decoder architecture.

The CNN does not directly support sequence input; instead, a 1D CNN is capable of reading across sequence input and automatically learning the salient features. These can then be interpreted by an LSTM decoder as per normal. We refer to hybrid models that use a CNN and LSTM as CNN-LSTM models, and in this case we are using them together in an encoder-decoder architecture.

The CNN expects the input data to have the same 3D structure as the LSTM model, although multiple features are read as different channels that ultimately have the same effect.

We will simplify the example and focus on the CNN-LSTM with univariate input, but it can just as easily be updated to use multivariate input, which is left as an exercise.

As before, we will use input sequences comprised of 14 days of daily total power consumption.

We will define a simple but effective CNN architecture for the encoder that is comprised of two convolutional layers followed by a max pooling layer, the results of which are then flattened.

The first convolutional layer reads across the input sequence and projects the results onto feature maps. The second performs the same operation on the feature maps created by the first layer, attempting to amplify any salient features. We will use 64 feature maps per convolutional layer and read the input sequences with a kernel size of three time steps.

The max pooling layer simplifies the feature maps by keeping 1/4 of the values with the largest (max) signal. The distilled feature maps after the pooling layer are then flattened into one long vector that can then be used as input to the decoding process.

model.add(Conv1D(filters=64, kernel_size=3, activation='relu', input_shape=(n_timesteps,n_features))) model.add(Conv1D(filters=64, kernel_size=3, activation='relu')) model.add(MaxPooling1D(pool_size=2)) model.add(Flatten())

The decoder is the same as was defined in previous sections.

The only other change is to set the number of training epochs to 20.

The *build_model()* function with these changes is listed below.

# train the model def build_model(train, n_input): # prepare data train_x, train_y = to_supervised(train, n_input) # define parameters verbose, epochs, batch_size = 0, 20, 16 n_timesteps, n_features, n_outputs = train_x.shape[1], train_x.shape[2], train_y.shape[1] # reshape output into [samples, timesteps, features] train_y = train_y.reshape((train_y.shape[0], train_y.shape[1], 1)) # define model model = Sequential() model.add(Conv1D(filters=64, kernel_size=3, activation='relu', input_shape=(n_timesteps,n_features))) model.add(Conv1D(filters=64, kernel_size=3, activation='relu')) model.add(MaxPooling1D(pool_size=2)) model.add(Flatten()) model.add(RepeatVector(n_outputs)) model.add(LSTM(200, activation='relu', return_sequences=True)) model.add(TimeDistributed(Dense(100, activation='relu'))) model.add(TimeDistributed(Dense(1))) model.compile(loss='mse', optimizer='adam') # fit network model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=verbose) return model

We are now ready to try the encoder-decoder architecture with a CNN encoder.

The complete code listing is provided below.

# univariate multi-step encoder-decoder cnn-lstm from math import sqrt from numpy import split from numpy import array from pandas import read_csv from sklearn.metrics import mean_squared_error from matplotlib import pyplot from keras.models import Sequential from keras.layers import Dense from keras.layers import Flatten from keras.layers import LSTM from keras.layers import RepeatVector from keras.layers import TimeDistributed from keras.layers.convolutional import Conv1D from keras.layers.convolutional import MaxPooling1D # split a univariate dataset into train/test sets def split_dataset(data): # split into standard weeks train, test = data[1:-328], data[-328:-6] # restructure into windows of weekly data train = array(split(train, len(train)/7)) test = array(split(test, len(test)/7)) return train, test # evaluate one or more weekly forecasts against expected values def evaluate_forecasts(actual, predicted): scores = list() # calculate an RMSE score for each day for i in range(actual.shape[1]): # calculate mse mse = mean_squared_error(actual[:, i], predicted[:, i]) # calculate rmse rmse = sqrt(mse) # store scores.append(rmse) # calculate overall RMSE s = 0 for row in range(actual.shape[0]): for col in range(actual.shape[1]): s += (actual[row, col] - predicted[row, col])**2 score = sqrt(s / (actual.shape[0] * actual.shape[1])) return score, scores # summarize scores def summarize_scores(name, score, scores): s_scores = ', '.join(['%.1f' % s for s in scores]) print('%s: [%.3f] %s' % (name, score, s_scores)) # convert history into inputs and outputs def to_supervised(train, n_input, n_out=7): # flatten data data = train.reshape((train.shape[0]*train.shape[1], train.shape[2])) X, y = list(), list() in_start = 0 # step over the entire history one time step at a time for _ in range(len(data)): # define the end of the input sequence in_end = in_start + n_input out_end = in_end + n_out # ensure we have enough data for this instance if out_end < len(data): x_input = data[in_start:in_end, 0] x_input = x_input.reshape((len(x_input), 1)) X.append(x_input) y.append(data[in_end:out_end, 0]) # move along one time step in_start += 1 return array(X), array(y) # train the model def build_model(train, n_input): # prepare data train_x, train_y = to_supervised(train, n_input) # define parameters verbose, epochs, batch_size = 0, 20, 16 n_timesteps, n_features, n_outputs = train_x.shape[1], train_x.shape[2], train_y.shape[1] # reshape output into [samples, timesteps, features] train_y = train_y.reshape((train_y.shape[0], train_y.shape[1], 1)) # define model model = Sequential() model.add(Conv1D(filters=64, kernel_size=3, activation='relu', input_shape=(n_timesteps,n_features))) model.add(Conv1D(filters=64, kernel_size=3, activation='relu')) model.add(MaxPooling1D(pool_size=2)) model.add(Flatten()) model.add(RepeatVector(n_outputs)) model.add(LSTM(200, activation='relu', return_sequences=True)) model.add(TimeDistributed(Dense(100, activation='relu'))) model.add(TimeDistributed(Dense(1))) model.compile(loss='mse', optimizer='adam') # fit network model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=verbose) return model # make a forecast def forecast(model, history, n_input): # flatten data data = array(history) data = data.reshape((data.shape[0]*data.shape[1], data.shape[2])) # retrieve last observations for input data input_x = data[-n_input:, 0] # reshape into [1, n_input, 1] input_x = input_x.reshape((1, len(input_x), 1)) # forecast the next week yhat = model.predict(input_x, verbose=0) # we only want the vector forecast yhat = yhat[0] return yhat # evaluate a single model def evaluate_model(train, test, n_input): # fit model model = build_model(train, n_input) # history is a list of weekly data history = [x for x in train] # walk-forward validation over each week predictions = list() for i in range(len(test)): # predict the week yhat_sequence = forecast(model, history, n_input) # store the predictions predictions.append(yhat_sequence) # get real observation and add to history for predicting the next week history.append(test[i, :]) # evaluate predictions days for each week predictions = array(predictions) score, scores = evaluate_forecasts(test[:, :, 0], predictions) return score, scores # load the new file dataset = read_csv('household_power_consumption_days.csv', header=0, infer_datetime_format=True, parse_dates=['datetime'], index_col=['datetime']) # split into train and test train, test = split_dataset(dataset.values) # evaluate model and get scores n_input = 14 score, scores = evaluate_model(train, test, n_input) # summarize scores summarize_scores('lstm', score, scores) # plot scores days = ['sun', 'mon', 'tue', 'wed', 'thr', 'fri', 'sat'] pyplot.plot(days, scores, marker='o', label='lstm') pyplot.show()

Running the example fits the model and summarizes the performance on the test dataset.

A little experimentation showed that using two convolutional layers made the model more stable than using just a single layer.

We can see that in this case the model is skillful, achieving an overall RMSE score of about 372 kilowatts.

lstm: [372.055] 383.8, 381.6, 339.1, 371.8, 371.8, 319.6, 427.2

A line plot of the per-day RMSE is also created.

A further extension of the CNN-LSTM approach is to perform the convolutions of the CNN (e.g. how the CNN reads the input sequence data) as part of the LSTM for each time step.

This combination is called a Convolutional LSTM, or ConvLSTM for short, and like the CNN-LSTM is also used for spatio-temporal data.

Unlike an LSTM that reads the data in directly in order to calculate internal state and state transitions, and unlike the CNN-LSTM that is interpreting the output from CNN models, the ConvLSTM is using convolutions directly as part of reading input into the LSTM units themselves.

For more information for how the equations for the ConvLSTM are calculated within the LSTM unit, see the paper:

The Keras library provides the ConvLSTM2D class that supports the ConvLSTM model for 2D data. It can be configured for 1D multivariate time series forecasting.

The ConvLSTM2D class, by default, expects input data to have the shape:

[samples, timesteps, rows, cols, channels]

Where each time step of data is defined as an image of (*rows * columns*) data points.

We are working with a one-dimensional sequence of total power consumption, which we can interpret as one row with 14 columns, if we assume that we are using two weeks of data as input.

For the ConvLSTM, this would be a single read: that is, the LSTM would read one time step of 14 days and perform a convolution across those time steps.

This is not ideal.

Instead, we can split the 14 days into two subsequences with a length of seven days. The ConvLSTM can then read across the two time steps and perform the CNN process on the seven days of data within each.

For this chosen framing of the problem, the input for the ConvLSTM2D would therefore be:

[n, 2, 1, 7, 1]

Or:

**Samples**: n, for the number of examples in the training dataset.**Time**: 2, for the two subsequences that we split a window of 14 days into.**Rows**: 1, for the one-dimensional shape of each subsequence.**Columns**: 7, for the seven days in each subsequence.**Channels**: 1, for the single feature that we are working with as input.

You can explore other configurations, such as providing 21 days of input split into three subsequences of seven days, and/or providing all eight features or channels as input.

We can now prepare the data for the ConvLSTM2D model.

First, we must reshape the training dataset into the expected structure of [*samples, timesteps, rows, cols, channels*].

# reshape into subsequences [samples, time steps, rows, cols, channels] train_x = train_x.reshape((train_x.shape[0], n_steps, 1, n_length, n_features))

We can then define the encoder as a ConvLSTM hidden layer followed by a flatten layer ready for decoding.

model.add(ConvLSTM2D(filters=64, kernel_size=(1,3), activation='relu', input_shape=(n_steps, 1, n_length, n_features))) model.add(Flatten())

We will also parameterize the number of subsequences (*n_steps*) and the length of each subsequence (*n_length*) and pass them as arguments.

The rest of the model and training is the same. The *build_model()* function with these changes is listed below.

# train the model def build_model(train, n_steps, n_length, n_input): # prepare data train_x, train_y = to_supervised(train, n_input) # define parameters verbose, epochs, batch_size = 0, 20, 16 n_timesteps, n_features, n_outputs = train_x.shape[1], train_x.shape[2], train_y.shape[1] # reshape into subsequences [samples, time steps, rows, cols, channels] train_x = train_x.reshape((train_x.shape[0], n_steps, 1, n_length, n_features)) # reshape output into [samples, timesteps, features] train_y = train_y.reshape((train_y.shape[0], train_y.shape[1], 1)) # define model model = Sequential() model.add(ConvLSTM2D(filters=64, kernel_size=(1,3), activation='relu', input_shape=(n_steps, 1, n_length, n_features))) model.add(Flatten()) model.add(RepeatVector(n_outputs)) model.add(LSTM(200, activation='relu', return_sequences=True)) model.add(TimeDistributed(Dense(100, activation='relu'))) model.add(TimeDistributed(Dense(1))) model.compile(loss='mse', optimizer='adam') # fit network model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=verbose) return model

This model expects five-dimensional data as input. Therefore, we must also update the preparation of a single sample in the *forecast()* function when making a prediction.

# reshape into [samples, time steps, rows, cols, channels] input_x = input_x.reshape((1, n_steps, 1, n_length, 1))

The *forecast()* function with this change and with the parameterized subsequences is provided below.

# make a forecast def forecast(model, history, n_steps, n_length, n_input): # flatten data data = array(history) data = data.reshape((data.shape[0]*data.shape[1], data.shape[2])) # retrieve last observations for input data input_x = data[-n_input:, 0] # reshape into [samples, time steps, rows, cols, channels] input_x = input_x.reshape((1, n_steps, 1, n_length, 1)) # forecast the next week yhat = model.predict(input_x, verbose=0) # we only want the vector forecast yhat = yhat[0] return yhat

We now have all of the elements for evaluating an encoder-decoder architecture for multi-step time series forecasting where a ConvLSTM is used as the encoder.

The complete code example is listed below.

# univariate multi-step encoder-decoder convlstm from math import sqrt from numpy import split from numpy import array from pandas import read_csv from sklearn.metrics import mean_squared_error from matplotlib import pyplot from keras.models import Sequential from keras.layers import Dense from keras.layers import Flatten from keras.layers import LSTM from keras.layers import RepeatVector from keras.layers import TimeDistributed from keras.layers import ConvLSTM2D # split a univariate dataset into train/test sets def split_dataset(data): # split into standard weeks train, test = data[1:-328], data[-328:-6] # restructure into windows of weekly data train = array(split(train, len(train)/7)) test = array(split(test, len(test)/7)) return train, test # evaluate one or more weekly forecasts against expected values def evaluate_forecasts(actual, predicted): scores = list() # calculate an RMSE score for each day for i in range(actual.shape[1]): # calculate mse mse = mean_squared_error(actual[:, i], predicted[:, i]) # calculate rmse rmse = sqrt(mse) # store scores.append(rmse) # calculate overall RMSE s = 0 for row in range(actual.shape[0]): for col in range(actual.shape[1]): s += (actual[row, col] - predicted[row, col])**2 score = sqrt(s / (actual.shape[0] * actual.shape[1])) return score, scores # summarize scores def summarize_scores(name, score, scores): s_scores = ', '.join(['%.1f' % s for s in scores]) print('%s: [%.3f] %s' % (name, score, s_scores)) # convert history into inputs and outputs def to_supervised(train, n_input, n_out=7): # flatten data data = train.reshape((train.shape[0]*train.shape[1], train.shape[2])) X, y = list(), list() in_start = 0 # step over the entire history one time step at a time for _ in range(len(data)): # define the end of the input sequence in_end = in_start + n_input out_end = in_end + n_out # ensure we have enough data for this instance if out_end < len(data): x_input = data[in_start:in_end, 0] x_input = x_input.reshape((len(x_input), 1)) X.append(x_input) y.append(data[in_end:out_end, 0]) # move along one time step in_start += 1 return array(X), array(y) # train the model def build_model(train, n_steps, n_length, n_input): # prepare data train_x, train_y = to_supervised(train, n_input) # define parameters verbose, epochs, batch_size = 0, 20, 16 n_timesteps, n_features, n_outputs = train_x.shape[1], train_x.shape[2], train_y.shape[1] # reshape into subsequences [samples, time steps, rows, cols, channels] train_x = train_x.reshape((train_x.shape[0], n_steps, 1, n_length, n_features)) # reshape output into [samples, timesteps, features] train_y = train_y.reshape((train_y.shape[0], train_y.shape[1], 1)) # define model model = Sequential() model.add(ConvLSTM2D(filters=64, kernel_size=(1,3), activation='relu', input_shape=(n_steps, 1, n_length, n_features))) model.add(Flatten()) model.add(RepeatVector(n_outputs)) model.add(LSTM(200, activation='relu', return_sequences=True)) model.add(TimeDistributed(Dense(100, activation='relu'))) model.add(TimeDistributed(Dense(1))) model.compile(loss='mse', optimizer='adam') # fit network model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=verbose) return model # make a forecast def forecast(model, history, n_steps, n_length, n_input): # flatten data data = array(history) data = data.reshape((data.shape[0]*data.shape[1], data.shape[2])) # retrieve last observations for input data input_x = data[-n_input:, 0] # reshape into [samples, time steps, rows, cols, channels] input_x = input_x.reshape((1, n_steps, 1, n_length, 1)) # forecast the next week yhat = model.predict(input_x, verbose=0) # we only want the vector forecast yhat = yhat[0] return yhat # evaluate a single model def evaluate_model(train, test, n_steps, n_length, n_input): # fit model model = build_model(train, n_steps, n_length, n_input) # history is a list of weekly data history = [x for x in train] # walk-forward validation over each week predictions = list() for i in range(len(test)): # predict the week yhat_sequence = forecast(model, history, n_steps, n_length, n_input) # store the predictions predictions.append(yhat_sequence) # get real observation and add to history for predicting the next week history.append(test[i, :]) # evaluate predictions days for each week predictions = array(predictions) score, scores = evaluate_forecasts(test[:, :, 0], predictions) return score, scores # load the new file dataset = read_csv('household_power_consumption_days.csv', header=0, infer_datetime_format=True, parse_dates=['datetime'], index_col=['datetime']) # split into train and test train, test = split_dataset(dataset.values) # define the number of subsequences and the length of subsequences n_steps, n_length = 2, 7 # define the total days to use as input n_input = n_length * n_steps score, scores = evaluate_model(train, test, n_steps, n_length, n_input) # summarize scores summarize_scores('lstm', score, scores) # plot scores days = ['sun', 'mon', 'tue', 'wed', 'thr', 'fri', 'sat'] pyplot.plot(days, scores, marker='o', label='lstm') pyplot.show()

Running the example fits the model and summarizes the performance on the test dataset.

A little experimentation showed that using two convolutional layers made the model more stable than using just a single layer.

We can see that in this case the model is skillful, achieving an overall RMSE score of about 367 kilowatts.

lstm: [367.929] 416.3, 379.7, 334.7, 362.3, 374.7, 284.8, 406.7

A line plot of the per-day RMSE is also created.

This section lists some ideas for extending the tutorial that you may wish to explore.

**Size of Input**. Explore more or fewer number of days used as input for the model, such as three days, 21 days, 30 days, and more.**Model Tuning**. Tune the structure and hyperparameters for a model and further lift model performance on average.**Data Scaling**. Explore whether data scaling, such as standardization and normalization, can be used to improve the performance of any of the LSTM models.**Learning Diagnostics**. Use diagnostics such as learning curves for the train and validation loss and mean squared error to help tune the structure and hyperparameters of a LSTM model.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- 4 Strategies for Multi-Step Time Series Forecasting
- Crash Course in Recurrent Neural Networks for Deep Learning
- A Gentle Introduction to Long Short-Term Memory Networks by the Experts
- On the Suitability of LSTMs for Time Series Forecasting
- CNN Long Short-Term Memory Networks
- How to Develop an Encoder-Decoder Model for Sequence-to-Sequence Prediction in Keras

- pandas.read_csv API
- pandas.DataFrame.resample API
- Resample Offset Aliases
- sklearn.metrics.mean_squared_error API
- numpy.split API

- Individual household electric power consumption Data Set, UCI Machine Learning Repository.
- AC power, Wikipedia.
- Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting, 2015.

In this tutorial, you discovered how to develop long short-term memory recurrent neural networks for multi-step time series forecasting of household power consumption.

Specifically, you learned:

- How to develop and evaluate Univariate and multivariate Encoder-Decoder LSTMs for multi-step time series forecasting.
- How to develop and evaluate an CNN-LSTM Encoder-Decoder model for multi-step time series forecasting.
- How to develop and evaluate a ConvLSTM Encoder-Decoder model for multi-step time series forecasting.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Develop LSTM Models for Multi-Step Time Series Forecasting of Household Power Consumption appeared first on Machine Learning Mastery.

]]>The post How to Develop Convolutional Neural Networks for Multi-Step Time Series Forecasting appeared first on Machine Learning Mastery.

]]>This data represents a multivariate time series of power-related variables that in turn could be used to model and even forecast future electricity consumption.

Unlike other machine learning algorithms, convolutional neural networks are capable of automatically learning features from sequence data, support multiple-variate data, and can directly output a vector for multi-step forecasting. As such, one-dimensional CNNs have been demonstrated to perform well and even achieve state-of-the-art results on challenging sequence prediction problems.

In this tutorial, you will discover how to develop 1D convolutional neural networks for multi-step time series forecasting.

After completing this tutorial, you will know:

- How to develop a CNN for multi-step time series forecasting model for univariate data.
- How to develop a multichannel multi-step time series forecasting model for multivariate data.
- How to develop a multi-headed multi-step time series forecasting model for multivariate data.

Let’s get started.

This tutorial is divided into seven parts; they are:

- Problem Description
- Load and Prepare Dataset
- Model Evaluation
- CNNs for Multi-Step Forecasting
- Multi-step Time Series Forecasting With a Univariate CNN
- Multi-step Time Series Forecasting With a Multichannel CNN
- Multi-step Time Series Forecasting With a Multihead CNN

The ‘Household Power Consumption‘ dataset is a multivariate time series dataset that describes the electricity consumption for a single household over four years.

The data was collected between December 2006 and November 2010 and observations of power consumption within the household were collected every minute.

It is a multivariate series comprised of seven variables (besides the date and time); they are:

**global_active_power**: The total active power consumed by the household (kilowatts).**global_reactive_power**: The total reactive power consumed by the household (kilowatts).**voltage**: Average voltage (volts).**global_intensity**: Average current intensity (amps).**sub_metering_1**: Active energy for kitchen (watt-hours of active energy).**sub_metering_2**: Active energy for laundry (watt-hours of active energy).**sub_metering_3**: Active energy for climate control systems (watt-hours of active energy).

Active and reactive energy refer to the technical details of alternative current.

A fourth sub-metering variable can be created by subtracting the sum of three defined sub-metering variables from the total active energy as follows:

sub_metering_remainder = (global_active_power * 1000 / 60) - (sub_metering_1 + sub_metering_2 + sub_metering_3)

The dataset can be downloaded from the UCI Machine Learning repository as a single 20 megabyte .zip file:

Download the dataset and unzip it into your current working directory. You will now have the file “*household_power_consumption.txt*” that is about 127 megabytes in size and contains all of the observations.

We can use the *read_csv()* function to load the data and combine the first two columns into a single date-time column that we can use as an index.

# load all data dataset = read_csv('household_power_consumption.txt', sep=';', header=0, low_memory=False, infer_datetime_format=True, parse_dates={'datetime':[0,1]}, index_col=['datetime'])

Next, we can mark all missing values indicated with a ‘*?*‘ character with a *NaN* value, which is a float.

This will allow us to work with the data as one array of floating point values rather than mixed types (less efficient.)

# mark all missing values dataset.replace('?', nan, inplace=True) # make dataset numeric dataset = dataset.astype('float32')

We also need to fill in the missing values now that they have been marked.

A very simple approach would be to copy the observation from the same time the day before. We can implement this in a function named *fill_missing()* that will take the NumPy array of the data and copy values from exactly 24 hours ago.

# fill missing values with a value at the same time one day ago def fill_missing(values): one_day = 60 * 24 for row in range(values.shape[0]): for col in range(values.shape[1]): if isnan(values[row, col]): values[row, col] = values[row - one_day, col]

We can apply this function directly to the data within the DataFrame.

# fill missing fill_missing(dataset.values)

Now we can create a new column that contains the remainder of the sub-metering, using the calculation from the previous section.

# add a column for for the remainder of sub metering values = dataset.values dataset['sub_metering_4'] = (values[:,0] * 1000 / 60) - (values[:,4] + values[:,5] + values[:,6])

We can now save the cleaned-up version of the dataset to a new file; in this case we will just change the file extension to .csv and save the dataset as ‘*household_power_consumption.csv*‘.

# save updated dataset dataset.to_csv('household_power_consumption.csv')

Tying all of this together, the complete example of loading, cleaning-up, and saving the dataset is listed below.

# load and clean-up data from numpy import nan from numpy import isnan from pandas import read_csv from pandas import to_numeric # fill missing values with a value at the same time one day ago def fill_missing(values): one_day = 60 * 24 for row in range(values.shape[0]): for col in range(values.shape[1]): if isnan(values[row, col]): values[row, col] = values[row - one_day, col] # load all data dataset = read_csv('household_power_consumption.txt', sep=';', header=0, low_memory=False, infer_datetime_format=True, parse_dates={'datetime':[0,1]}, index_col=['datetime']) # mark all missing values dataset.replace('?', nan, inplace=True) # make dataset numeric dataset = dataset.astype('float32') # fill missing fill_missing(dataset.values) # add a column for for the remainder of sub metering values = dataset.values dataset['sub_metering_4'] = (values[:,0] * 1000 / 60) - (values[:,4] + values[:,5] + values[:,6]) # save updated dataset dataset.to_csv('household_power_consumption.csv')

Running the example creates the new file ‘*household_power_consumption.csv*‘ that we can use as the starting point for our modeling project.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In this section, we will consider how we can develop and evaluate predictive models for the household power dataset.

This section is divided into four parts; they are:

- Problem Framing
- Evaluation Metric
- Train and Test Sets
- Walk-Forward Validation

There are many ways to harness and explore the household power consumption dataset.

In this tutorial, we will use the data to explore a very specific question; that is:

Given recent power consumption, what is the expected power consumption for the week ahead?

This requires that a predictive model forecast the total active power for each day over the next seven days.

Technically, this framing of the problem is referred to as a multi-step time series forecasting problem, given the multiple forecast steps. A model that makes use of multiple input variables may be referred to as a multivariate multi-step time series forecasting model.

A model of this type could be helpful within the household in planning expenditures. It could also be helpful on the supply side for planning electricity demand for a specific household.

This framing of the dataset also suggests that it would be useful to downsample the per-minute observations of power consumption to daily totals. This is not required, but makes sense, given that we are interested in total power per day.

We can achieve this easily using the resample() function on the pandas DataFrame. Calling this function with the argument ‘*D*‘ allows the loaded data indexed by date-time to be grouped by day (see all offset aliases). We can then calculate the sum of all observations for each day and create a new dataset of daily power consumption data for each of the eight variables.

The complete example is listed below.

# resample minute data to total for each day from pandas import read_csv # load the new file dataset = read_csv('household_power_consumption.csv', header=0, infer_datetime_format=True, parse_dates=['datetime'], index_col=['datetime']) # resample data to daily daily_groups = dataset.resample('D') daily_data = daily_groups.sum() # summarize print(daily_data.shape) print(daily_data.head()) # save daily_data.to_csv('household_power_consumption_days.csv')

Running the example creates a new daily total power consumption dataset and saves the result into a separate file named ‘*household_power_consumption_days.csv*‘.

We can use this as the dataset for fitting and evaluating predictive models for the chosen framing of the problem.

A forecast will be comprised of seven values, one for each day of the week ahead.

It is common with multi-step forecasting problems to evaluate each forecasted time step separately. This is helpful for a few reasons:

- To comment on the skill at a specific lead time (e.g. +1 day vs +3 days).
- To contrast models based on their skills at different lead times (e.g. models good at +1 day vs models good at days +5).

The units of the total power are kilowatts and it would be useful to have an error metric that was also in the same units. Both Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) fit this bill, although RMSE is more commonly used and will be adopted in this tutorial. Unlike MAE, RMSE is more punishing of forecast errors.

The performance metric for this problem will be the RMSE for each lead time from day 1 to day 7.

As a short-cut, it may be useful to summarize the performance of a model using a single score in order to aide in model selection.

One possible score that could be used would be the RMSE across all forecast days.

The function *evaluate_forecasts()* below will implement this behavior and return the performance of a model based on multiple seven-day forecasts.

# evaluate one or more weekly forecasts against expected values def evaluate_forecasts(actual, predicted): scores = list() # calculate an RMSE score for each day for i in range(actual.shape[1]): # calculate mse mse = mean_squared_error(actual[:, i], predicted[:, i]) # calculate rmse rmse = sqrt(mse) # store scores.append(rmse) # calculate overall RMSE s = 0 for row in range(actual.shape[0]): for col in range(actual.shape[1]): s += (actual[row, col] - predicted[row, col])**2 score = sqrt(s / (actual.shape[0] * actual.shape[1])) return score, scores

Running the function will first return the overall RMSE regardless of day, then an array of RMSE scores for each day.

We will use the first three years of data for training predictive models and the final year for evaluating models.

The data in a given dataset will be divided into standard weeks. These are weeks that begin on a Sunday and end on a Saturday.

This is a realistic and useful way for using the chosen framing of the model, where the power consumption for the week ahead can be predicted. It is also helpful with modeling, where models can be used to predict a specific day (e.g. Wednesday) or the entire sequence.

We will split the data into standard weeks, working backwards from the test dataset.

The final year of the data is in 2010 and the first Sunday for 2010 was January 3rd. The data ends in mid November 2010 and the closest final Saturday in the data is November 20th. This gives 46 weeks of test data.

The first and last rows of daily data for the test dataset are provided below for confirmation.

2010-01-03,2083.4539999999984,191.61000000000055,350992.12000000034,8703.600000000033,3842.0,4920.0,10074.0,15888.233355799992 ... 2010-11-20,2197.006000000004,153.76800000000028,346475.9999999998,9320.20000000002,4367.0,2947.0,11433.0,17869.76663959999

The daily data starts in late 2006.

The first Sunday in the dataset is December 17th, which is the second row of data.

Organizing the data into standard weeks gives 159 full standard weeks for training a predictive model.

2006-12-17,3390.46,226.0059999999994,345725.32000000024,14398.59999999998,2033.0,4187.0,13341.0,36946.66673200004 ... 2010-01-02,1309.2679999999998,199.54600000000016,352332.8399999997,5489.7999999999865,801.0,298.0,6425.0,14297.133406600002

The function *split_dataset()* below splits the daily data into train and test sets and organizes each into standard weeks.

Specific row offsets are used to split the data using knowledge of the dataset. The split datasets are then organized into weekly data using the NumPy split() function.

# split a univariate dataset into train/test sets def split_dataset(data): # split into standard weeks train, test = data[1:-328], data[-328:-6] # restructure into windows of weekly data train = array(split(train, len(train)/7)) test = array(split(test, len(test)/7)) return train, test

We can test this function out by loading the daily dataset and printing the first and last rows of data from both the train and test sets to confirm they match the expectations above.

The complete code example is listed below.

# split into standard weeks from numpy import split from numpy import array from pandas import read_csv # split a univariate dataset into train/test sets def split_dataset(data): # split into standard weeks train, test = data[1:-328], data[-328:-6] # restructure into windows of weekly data train = array(split(train, len(train)/7)) test = array(split(test, len(test)/7)) return train, test # load the new file dataset = read_csv('household_power_consumption_days.csv', header=0, infer_datetime_format=True, parse_dates=['datetime'], index_col=['datetime']) train, test = split_dataset(dataset.values) # validate train data print(train.shape) print(train[0, 0, 0], train[-1, -1, 0]) # validate test print(test.shape) print(test[0, 0, 0], test[-1, -1, 0])

Running the example shows that indeed the train dataset has 159 weeks of data, whereas the test dataset has 46 weeks.

We can see that the total active power for the train and test dataset for the first and last rows match the data for the specific dates that we defined as the bounds on the standard weeks for each set.

(159, 7, 8) 3390.46 1309.2679999999998 (46, 7, 8) 2083.4539999999984 2197.006000000004

Models will be evaluated using a scheme called walk-forward validation.

This is where a model is required to make a one week prediction, then the actual data for that week is made available to the model so that it can be used as the basis for making a prediction on the subsequent week. This is both realistic for how the model may be used in practice and beneficial to the models, allowing them to make use of the best available data.

We can demonstrate this below with separation of input data and output/predicted data.

Input, Predict [Week1] Week2 [Week1 + Week2] Week3 [Week1 + Week2 + Week3] Week4 ...

The walk-forward validation approach to evaluating predictive models on this dataset is provided below, named *evaluate_model()*.

The train and test datasets in standard-week format are provided to the function as arguments. An additional argument, *n_input*, is provided that is used to define the number of prior observations that the model will use as input in order to make a prediction.

Two new functions are called: one to build a model from the training data called *build_model()* and another that uses the model to make forecasts for each new standard week, called *forecast()*. These will be covered in subsequent sections.

We are working with neural networks and as such they are generally slow to train but fast to evaluate. This means that the preferred usage of the models is to build them once on historical data and to use them to forecast each step of the walk-forward validation. The models are static (i.e. not updated) during their evaluation.

This is different to other models that are faster to train, where a model may be re-fit or updated each step of the walk-forward validation as new data is made available. With sufficient resources, it is possible to use neural networks this way, but we will not in this tutorial.

The complete *evaluate_model()* function is listed below.

# evaluate a single model def evaluate_model(train, test, n_input): # fit model model = build_model(train, n_input) # history is a list of weekly data history = [x for x in train] # walk-forward validation over each week predictions = list() for i in range(len(test)): # predict the week yhat_sequence = forecast(model, history, n_input) # store the predictions predictions.append(yhat_sequence) # get real observation and add to history for predicting the next week history.append(test[i, :]) # evaluate predictions days for each week predictions = array(predictions) score, scores = evaluate_forecasts(test[:, :, 0], predictions) return score, scores

Once we have the evaluation for a model, we can summarize the performance.

The function below, named *summarize_scores()*, will display the performance of a model as a single line for easy comparison with other models.

# summarize scores def summarize_scores(name, score, scores): s_scores = ', '.join(['%.1f' % s for s in scores]) print('%s: [%.3f] %s' % (name, score, s_scores))

We now have all of the elements to begin evaluating predictive models on the dataset.

Convolutional Neural Network models, or CNNs for short, are a type of deep neural network that was developed for use with image data, such as handwriting recognition.

They are proven very effective on challenging computer vision problems when trained at scale for tasks such as identifying and localizing objects in images and automatically describing the content of images.

They are a model that are comprised of two main types of elements: convolutional layers and pooling layers.

**Convolutional layers** read an input, such as a 2D image or a 1D signal using a kernel that reads in small segments at a time and steps across the entire input field. Each read results in an interpretation of the input that is projected onto a filter map and represents an interpretation of the input.

**Pooling layers** take the feature map projections and distill them to the most essential elements, such as using a signal averaging or signal maximizing process.

The convolution and pooling layers can be repeated at depth, providing multiple layers of abstraction of the input signals.

The output of these networks is often one or more fully-connected layers that interpret what has been read and maps this internal representation to a class value.

For more information on convolutional neural networks, you can see the post:

Convolutional neural networks can be used for multi-step time series forecasting.

- The convolutional layers can read sequences of input data and automatically extract features.
- The pooling layers can distill the extracted features and focus attention on the most salient elements.
- The fully connected layers can interpret the internal representation and output a vector representing multiple time steps.

The key benefits of the approach are the automatic feature learning and the ability of the model to output a multi-step vector directly.

CNNs can be used in either a recursive or direct forecast strategy, where the model makes one-step predictions and outputs are fed as inputs for subsequent predictions, and where one model is developed for each time step to be predicted. Alternately, CNNs can be used to predict the entire output sequence as a one-step prediction of the entire vector. This is a general benefit of feed-forward neural networks.

An important secondary benefit of using CNNs is that they can support multiple 1D inputs in order to make a prediction. This is useful if the multi-step output sequence is a function of more than one input sequence. This can be achieved using two different model configurations.

**Multiple Input Channels**. This is where each input sequence is read as a separate channel, like the different channels of an image (e.g. red, green and blue).**Multiple Input Heads**. This is where each input sequence is read by a different CNN sub-model and the internal representations are combined before being interpreted and used to make a prediction.

In this tutorial, we will explore how to develop three different types of CNN models for multi-step time series forecasting; they are:

- A CNN for multi-step time series forecasting with univariate input data.
- A CNN for multi-step time series forecasting with multivariate input data via channels.
- A CNN for multi-step time series forecasting with multivariate input data via submodels.

The models will be developed and demonstrated on the household power prediction problem. A model is considered skillful if it achieves performance better than a naive model, which is an overall RMSE of about 465 kilowatts across a seven day forecast.

We will not focus on the tuning of these models to achieve optimal performance; instead we will sill stop short at skillful models as compared to a naive forecast. The chosen structures and hyperparameters are chosen with a little trial and error.

In this section, we will develop a convolutional neural network for multi-step time series forecasting using only the univariate sequence of daily power consumption.

Specifically, the framing of the problem is:

Given some number of prior days of total daily power consumption, predict the next standard week of daily power consumption.

The number of prior days used as input defines the one-dimensional (1D) subsequence of data that the CNN will read and learn to extract features. Some ideas on the size and nature of this input include:

- All prior days, up to years worth of data.
- The prior seven days.
- The prior two weeks.
- The prior one month.
- The prior one year.
- The prior week and the week to be predicted from one year ago.

There is no right answer; instead, each approach and more can be tested and the performance of the model can be used to choose the nature of the input that results in the best model performance.

These choices define a few things about the implementation, such as:

- How the training data must be prepared in order to fit the model.
- How the test data must be prepared in order to evaluate the model.
- How to use the model to make predictions with a final model in the future.

A good starting point would be to use the prior seven days.

A 1D CNN model expects data to have the shape of:

[samples, timesteps, features]

One sample will be comprised of seven time steps with one feature for the seven days of total daily power consumed.

The training dataset has 159 weeks of data, so the shape of the training dataset would be:

[159, 7, 1]

This is a good start. The data in this format would use the prior standard week to predict the next standard week. A problem is that 159 instances is not a lot for a neural network.

A way to create a lot more training data is to change the problem during training to predict the next seven days given the prior seven days, regardless of the standard week.

This only impacts the training data, the test problem remains the same: predict the daily power consumption for the next standard week given the prior standard week.

This will require a little preparation of the training data.

The training data is provided in standard weeks with eight variables, specifically in the shape [159, 7, 8]. The first step is to flatten the data so that we have eight time series sequences.

# flatten data data = data.reshape((data.shape[0]*data.shape[1], data.shape[2]))

We then need to iterate over the time steps and divide the data into overlapping windows; each iteration moves along one time step and predicts the subsequent seven days.

For example:

Input, Output [d01, d02, d03, d04, d05, d06, d07], [d08, d09, d10, d11, d12, d13, d14] [d02, d03, d04, d05, d06, d07, d08], [d09, d10, d11, d12, d13, d14, d15] ...

We can do this by keeping track of start and end indexes for the inputs and outputs as we iterate across the length of the flattened data in terms of time steps.

We can also do this in a way where the number of inputs and outputs are parameterized (e.g. *n_input*, *n_out*) so that you can experiment with different values or adapt it for your own problem.

Below is a function named *to_supervised()* that takes a list of weeks (history) and the number of time steps to use as inputs and outputs and returns the data in the overlapping moving window format.

# convert history into inputs and outputs def to_supervised(train, n_input, n_out=7): # flatten data data = train.reshape((train.shape[0]*train.shape[1], train.shape[2])) X, y = list(), list() in_start = 0 # step over the entire history one time step at a time for _ in range(len(data)): # define the end of the input sequence in_end = in_start + n_input out_end = in_end + n_out # ensure we have enough data for this instance if out_end < len(data): x_input = data[in_start:in_end, 0] x_input = x_input.reshape((len(x_input), 1)) X.append(x_input) y.append(data[in_end:out_end, 0]) # move along one time step in_start += 1 return array(X), array(y)

When we run this function on the entire training dataset, we transform 159 samples into 1,099; specifically, the transformed dataset has the shapes *X=[1099, 7, 1]* and *y=[1099, 7].*

Next, we can define and fit the CNN model on the training data.

This multi-step time series forecasting problem is an autoregression. That means it is likely best modeled where that the next seven days is some function of observations at prior time steps. This and the relatively small amount of data means that a small model is required.

We will use a model with one convolution layer with 16 filters and a kernel size of 3. This means that the input sequence of seven days will be read with a convolutional operation three time steps at a time and this operation will be performed 16 times. A pooling layer will reduce these feature maps by 1/4 their size before the internal representation is flattened to one long vector. This is then interpreted by a fully connected layer before the output layer predicts the next seven days in the sequence.

We will use the mean squared error loss function as it is a good match for our chosen error metric of RMSE. We will use the efficient Adam implementation of stochastic gradient descent and fit the model for 20 epochs with a batch size of 4.

The small batch size and the stochastic nature of the algorithm means that the same model will learn a slightly different mapping of inputs to outputs each time it is trained. This means results may vary when the model is evaluated. You can try running the model multiple times and calculating an average of model performance.

The *build_model()* below prepares the training data, defines the model, and fits the model on the training data, returning the fit model ready for making predictions.

# train the model def build_model(train, n_input): # prepare data train_x, train_y = to_supervised(train, n_input) # define parameters verbose, epochs, batch_size = 0, 20, 4 n_timesteps, n_features, n_outputs = train_x.shape[1], train_x.shape[2], train_y.shape[1] # define model model = Sequential() model.add(Conv1D(filters=16, kernel_size=3, activation='relu', input_shape=(n_timesteps,n_features))) model.add(MaxPooling1D(pool_size=2)) model.add(Flatten()) model.add(Dense(10, activation='relu')) model.add(Dense(n_outputs)) model.compile(loss='mse', optimizer='adam') # fit network model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=verbose) return model

Now that we know how to fit the model, we can look at how the model can be used to make a prediction.

Generally, the model expects data to have the same three dimensional shape when making a prediction.

In this case, the expected shape of an input pattern is one sample, seven days of one feature for the daily power consumed:

[1, 7, 1]

Data must have this shape when making predictions for the test set and when a final model is being used to make predictions in the future. If you change the number of input days to 14, then the shape of the training data and the shape of new samples when making predictions must be changed accordingly to have 14 time steps. It is a modeling choice that you must carry forward when using the model.

We are using walk-forward validation to evaluate the model as described in the previous section.

This means that we have the observations available for the prior week in order to predict the coming week. These are collected into an array of standard weeks, called history.

In order to predict the next standard week, we need to retrieve the last days of observations. As with the training data, we must first flatten the history data to remove the weekly structure so that we end up with eight parallel time series.

# flatten data data = data.reshape((data.shape[0]*data.shape[1], data.shape[2]))

Next, we need to retrieve the last seven days of daily total power consumed (feature number 0). We will parameterize as we did for the training data so that the number of prior days used as input by the model can be modified in the future.

# retrieve last observations for input data input_x = data[-n_input:, 0]

Next, we reshape the input into the expected three-dimensional structure.

# reshape into [1, n_input, 1] input_x = input_x.reshape((1, len(input_x), 1))

We then make a prediction using the fit model and the input data and retrieve the vector of seven days of output.

# forecast the next week yhat = model.predict(input_x, verbose=0) # we only want the vector forecast yhat = yhat[0]

The *forecast()* function below implements this and takes as arguments the model fit on the training dataset, the history of data observed so far, and the number of inputs time steps expected by the model.

# make a forecast def forecast(model, history, n_input): # flatten data data = array(history) data = data.reshape((data.shape[0]*data.shape[1], data.shape[2])) # retrieve last observations for input data input_x = data[-n_input:, 0] # reshape into [1, n_input, 1] input_x = input_x.reshape((1, len(input_x), 1)) # forecast the next week yhat = model.predict(input_x, verbose=0) # we only want the vector forecast yhat = yhat[0] return yhat

That’s it; we now have everything we need to make multi-step time series forecasts with a CNN model on the daily total power consumed univariate dataset.

We can tie all of this together. The complete example is listed below.

# univariate multi-step cnn from math import sqrt from numpy import split from numpy import array from pandas import read_csv from sklearn.metrics import mean_squared_error from matplotlib import pyplot from keras.models import Sequential from keras.layers import Dense from keras.layers import Flatten from keras.layers.convolutional import Conv1D from keras.layers.convolutional import MaxPooling1D # split a univariate dataset into train/test sets def split_dataset(data): # split into standard weeks train, test = data[1:-328], data[-328:-6] # restructure into windows of weekly data train = array(split(train, len(train)/7)) test = array(split(test, len(test)/7)) return train, test # evaluate one or more weekly forecasts against expected values def evaluate_forecasts(actual, predicted): scores = list() # calculate an RMSE score for each day for i in range(actual.shape[1]): # calculate mse mse = mean_squared_error(actual[:, i], predicted[:, i]) # calculate rmse rmse = sqrt(mse) # store scores.append(rmse) # calculate overall RMSE s = 0 for row in range(actual.shape[0]): for col in range(actual.shape[1]): s += (actual[row, col] - predicted[row, col])**2 score = sqrt(s / (actual.shape[0] * actual.shape[1])) return score, scores # summarize scores def summarize_scores(name, score, scores): s_scores = ', '.join(['%.1f' % s for s in scores]) print('%s: [%.3f] %s' % (name, score, s_scores)) # convert history into inputs and outputs def to_supervised(train, n_input, n_out=7): # flatten data data = train.reshape((train.shape[0]*train.shape[1], train.shape[2])) X, y = list(), list() in_start = 0 # step over the entire history one time step at a time for _ in range(len(data)): # define the end of the input sequence in_end = in_start + n_input out_end = in_end + n_out # ensure we have enough data for this instance if out_end < len(data): x_input = data[in_start:in_end, 0] x_input = x_input.reshape((len(x_input), 1)) X.append(x_input) y.append(data[in_end:out_end, 0]) # move along one time step in_start += 1 return array(X), array(y) # train the model def build_model(train, n_input): # prepare data train_x, train_y = to_supervised(train, n_input) # define parameters verbose, epochs, batch_size = 0, 20, 4 n_timesteps, n_features, n_outputs = train_x.shape[1], train_x.shape[2], train_y.shape[1] # define model model = Sequential() model.add(Conv1D(filters=16, kernel_size=3, activation='relu', input_shape=(n_timesteps,n_features))) model.add(MaxPooling1D(pool_size=2)) model.add(Flatten()) model.add(Dense(10, activation='relu')) model.add(Dense(n_outputs)) model.compile(loss='mse', optimizer='adam') # fit network model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=verbose) return model # make a forecast def forecast(model, history, n_input): # flatten data data = array(history) data = data.reshape((data.shape[0]*data.shape[1], data.shape[2])) # retrieve last observations for input data input_x = data[-n_input:, 0] # reshape into [1, n_input, 1] input_x = input_x.reshape((1, len(input_x), 1)) # forecast the next week yhat = model.predict(input_x, verbose=0) # we only want the vector forecast yhat = yhat[0] return yhat # evaluate a single model def evaluate_model(train, test, n_input): # fit model model = build_model(train, n_input) # history is a list of weekly data history = [x for x in train] # walk-forward validation over each week predictions = list() for i in range(len(test)): # predict the week yhat_sequence = forecast(model, history, n_input) # store the predictions predictions.append(yhat_sequence) # get real observation and add to history for predicting the next week history.append(test[i, :]) # evaluate predictions days for each week predictions = array(predictions) score, scores = evaluate_forecasts(test[:, :, 0], predictions) return score, scores # load the new file dataset = read_csv('household_power_consumption_days.csv', header=0, infer_datetime_format=True, parse_dates=['datetime'], index_col=['datetime']) # split into train and test train, test = split_dataset(dataset.values) # evaluate model and get scores n_input = 7 score, scores = evaluate_model(train, test, n_input) # summarize scores summarize_scores('cnn', score, scores) # plot scores days = ['sun', 'mon', 'tue', 'wed', 'thr', 'fri', 'sat'] pyplot.plot(days, scores, marker='o', label='cnn') pyplot.show()

Running the example fits and evaluates the model, printing the overall RMSE across all seven days, and the per-day RMSE for each lead time.

We can see that in this case, the model was skillful as compared to a naive forecast, achieving an overall RMSE of about 404 kilowatts, less than 465 kilowatts achieved by a naive model.

cnn: [404.411] 436.1, 400.6, 346.2, 388.2, 405.5, 326.0, 502.9

A plot of the daily RMSE is also created. The plot shows that perhaps Tuesdays and Fridays are easier days to forecast than the other days and that perhaps Saturday at the end of the standard week is the hardest day to forecast.

We can increase the number of prior days to use as input from seven to 14 by changing the *n_input* variable.

# evaluate model and get scores n_input = 14

Re-running the example with this change first prints a summary of the performance of the model.

Your specific results may vary; try running the example a few times.

In this case, we can see a further drop in the overall RMSE, suggesting that further tuning of the input size and perhaps the kernel size of the model may result in better performance.

cnn: [396.497] 392.2, 412.8, 384.0, 389.0, 387.3, 381.0, 427.1

Comparing the per-day RMSE scores, we see some are better and some are worse than using seventh inputs.

This may suggest a benefit in using the two different sized inputs in some way, such as an ensemble of the two approaches or perhaps a single model (e.g. a multi-headed model) that reads the training data in different ways.

In this section, we will update the CNN developed in the previous section to use each of the eight time series variables to predict the next standard week of daily total power consumption.

We will do this by providing each one-dimensional time series to the model as a separate channel of input.

The CNN will then use a separate kernel and read each input sequence onto a separate set of filter maps, essentially learning features from each input time series variable.

This is helpful for those problems where the output sequence is some function of the observations at prior time steps from multiple different features, not just (or including) the feature being forecasted. It is unclear whether this is the case in the power consumption problem, but we can explore it nonetheless.

First, we must update the preparation of the training data to include all of the eight features, not just the one total daily power consumed. It requires a single line:

X.append(data[in_start:in_end, :])

The complete *to_supervised()* function with this change is listed below.

# convert history into inputs and outputs def to_supervised(train, n_input, n_out=7): # flatten data data = train.reshape((train.shape[0]*train.shape[1], train.shape[2])) X, y = list(), list() in_start = 0 # step over the entire history one time step at a time for _ in range(len(data)): # define the end of the input sequence in_end = in_start + n_input out_end = in_end + n_out # ensure we have enough data for this instance if out_end < len(data): X.append(data[in_start:in_end, :]) y.append(data[in_end:out_end, 0]) # move along one time step in_start += 1 return array(X), array(y)

We also must update the function used to make forecasts with the fit model to use all eight features from the prior time steps. Again, another small change:

# retrieve last observations for input data input_x = data[-n_input:, :] # reshape into [1, n_input, n] input_x = input_x.reshape((1, input_x.shape[0], input_x.shape[1]))

The complete *forecast()* with this change is listed below:

# make a forecast def forecast(model, history, n_input): # flatten data data = array(history) data = data.reshape((data.shape[0]*data.shape[1], data.shape[2])) # retrieve last observations for input data input_x = data[-n_input:, :] # reshape into [1, n_input, n] input_x = input_x.reshape((1, input_x.shape[0], input_x.shape[1])) # forecast the next week yhat = model.predict(input_x, verbose=0) # we only want the vector forecast yhat = yhat[0] return yhat

We will use 14 days of prior observations across eight of the input variables as we did in the final section of the prior section that resulted in slightly better performance.

n_input = 14

Finally, the model used in the previous section does not perform well on this new framing of the problem.

The increase in the amount of data requires a larger and more sophisticated model that is trained for longer.

With a little trial and error, one model that performs well uses two convolutional layers with 32 filter maps followed by pooling, then another convolutional layer with 16 feature maps and pooling. The fully connected layer that interprets the features is increased to 100 nodes and the model is fit for 70 epochs with a batch size of 16 samples.

The updated *build_model()* function that defines and fits the model on the training dataset is listed below.

# train the model def build_model(train, n_input): # prepare data train_x, train_y = to_supervised(train, n_input) # define parameters verbose, epochs, batch_size = 0, 70, 16 n_timesteps, n_features, n_outputs = train_x.shape[1], train_x.shape[2], train_y.shape[1] # define model model = Sequential() model.add(Conv1D(filters=32, kernel_size=3, activation='relu', input_shape=(n_timesteps,n_features))) model.add(Conv1D(filters=32, kernel_size=3, activation='relu')) model.add(MaxPooling1D(pool_size=2)) model.add(Conv1D(filters=16, kernel_size=3, activation='relu')) model.add(MaxPooling1D(pool_size=2)) model.add(Flatten()) model.add(Dense(100, activation='relu')) model.add(Dense(n_outputs)) model.compile(loss='mse', optimizer='adam') # fit network model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=verbose) return model

We now have all of the elements required to develop a multi-channel CNN for multivariate input data to make multi-step time series forecasts.

The complete example is listed below.

# multichannel multi-step cnn from math import sqrt from numpy import split from numpy import array from pandas import read_csv from sklearn.metrics import mean_squared_error from matplotlib import pyplot from keras.models import Sequential from keras.layers import Dense from keras.layers import Flatten from keras.layers.convolutional import Conv1D from keras.layers.convolutional import MaxPooling1D # split a univariate dataset into train/test sets def split_dataset(data): # split into standard weeks train, test = data[1:-328], data[-328:-6] # restructure into windows of weekly data train = array(split(train, len(train)/7)) test = array(split(test, len(test)/7)) return train, test # evaluate one or more weekly forecasts against expected values def evaluate_forecasts(actual, predicted): scores = list() # calculate an RMSE score for each day for i in range(actual.shape[1]): # calculate mse mse = mean_squared_error(actual[:, i], predicted[:, i]) # calculate rmse rmse = sqrt(mse) # store scores.append(rmse) # calculate overall RMSE s = 0 for row in range(actual.shape[0]): for col in range(actual.shape[1]): s += (actual[row, col] - predicted[row, col])**2 score = sqrt(s / (actual.shape[0] * actual.shape[1])) return score, scores # summarize scores def summarize_scores(name, score, scores): s_scores = ', '.join(['%.1f' % s for s in scores]) print('%s: [%.3f] %s' % (name, score, s_scores)) # convert history into inputs and outputs def to_supervised(train, n_input, n_out=7): # flatten data data = train.reshape((train.shape[0]*train.shape[1], train.shape[2])) X, y = list(), list() in_start = 0 # step over the entire history one time step at a time for _ in range(len(data)): # define the end of the input sequence in_end = in_start + n_input out_end = in_end + n_out # ensure we have enough data for this instance if out_end < len(data): X.append(data[in_start:in_end, :]) y.append(data[in_end:out_end, 0]) # move along one time step in_start += 1 return array(X), array(y) # train the model def build_model(train, n_input): # prepare data train_x, train_y = to_supervised(train, n_input) # define parameters verbose, epochs, batch_size = 0, 70, 16 n_timesteps, n_features, n_outputs = train_x.shape[1], train_x.shape[2], train_y.shape[1] # define model model = Sequential() model.add(Conv1D(filters=32, kernel_size=3, activation='relu', input_shape=(n_timesteps,n_features))) model.add(Conv1D(filters=32, kernel_size=3, activation='relu')) model.add(MaxPooling1D(pool_size=2)) model.add(Conv1D(filters=16, kernel_size=3, activation='relu')) model.add(MaxPooling1D(pool_size=2)) model.add(Flatten()) model.add(Dense(100, activation='relu')) model.add(Dense(n_outputs)) model.compile(loss='mse', optimizer='adam') # fit network model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=verbose) return model # make a forecast def forecast(model, history, n_input): # flatten data data = array(history) data = data.reshape((data.shape[0]*data.shape[1], data.shape[2])) # retrieve last observations for input data input_x = data[-n_input:, :] # reshape into [1, n_input, n] input_x = input_x.reshape((1, input_x.shape[0], input_x.shape[1])) # forecast the next week yhat = model.predict(input_x, verbose=0) # we only want the vector forecast yhat = yhat[0] return yhat # evaluate a single model def evaluate_model(train, test, n_input): # fit model model = build_model(train, n_input) # history is a list of weekly data history = [x for x in train] # walk-forward validation over each week predictions = list() for i in range(len(test)): # predict the week yhat_sequence = forecast(model, history, n_input) # store the predictions predictions.append(yhat_sequence) # get real observation and add to history for predicting the next week history.append(test[i, :]) # evaluate predictions days for each week predictions = array(predictions) score, scores = evaluate_forecasts(test[:, :, 0], predictions) return score, scores # load the new file dataset = read_csv('household_power_consumption_days.csv', header=0, infer_datetime_format=True, parse_dates=['datetime'], index_col=['datetime']) # split into train and test train, test = split_dataset(dataset.values) # evaluate model and get scores n_input = 14 score, scores = evaluate_model(train, test, n_input) # summarize scores summarize_scores('cnn', score, scores) # plot scores days = ['sun', 'mon', 'tue', 'wed', 'thr', 'fri', 'sat'] pyplot.plot(days, scores, marker='o', label='cnn') pyplot.show()

Running the example fits and evaluates the model, printing the overall RMSE across all seven days, and the per-day RMSE for each lead time.

We can see that in this case, the use of all eight input variables does result in another small drop in the overall RMSE score.

cnn: [385.711] 422.2, 363.5, 349.8, 393.1, 357.1, 318.8, 474.3

For the daily RMSE scores, we do see that some are better and some are worse than the univariate CNN from the previous section.

The final day, Saturday, remains a challenging day to forecast, and Friday an easy day to forecast. There may be some benefit in designing models to focus specifically on reducing the error of the harder to forecast days.

It may be interesting to see if the variance across daily scores could be further reduced with a tuned model or perhaps an ensemble of multiple different models. It may also be interesting to compare the performance for a model that uses seven or even 21 days of input data to see if further gains can be made.

We can further extend the CNN model to have a separate sub-CNN model or head for each input variable, which we can refer to as a multi-headed CNN model.

This requires a modification to the preparation of the model, and in turn, modification to the preparation of the training and test datasets.

Starting with the model, we must define a separate CNN model for each of the eight input variables.

The configuration of the model, including the number of layers and their hyperparameters, were also modified to better suit the new approach. The new configuration is not optimal and was found with a little trial and error.

The multi-headed model is specified using the more flexible functional API for defining Keras models.

We can loop over each variable and create a sub-model that takes a one-dimensional sequence of 14 days of data and outputs a flat vector containing a summary of the learned features from the sequence. Each of these vectors can be merged via concatenation to make one very long vector that is then interpreted by some fully connected layers before a prediction is made.

As we build up the submodels, we keep track of the input layers and flatten layers in lists. This is so that we can specify the inputs in the definition of the model object and use the list of flatten layers in the merge layer.

# create a channel for each variable in_layers, out_layers = list(), list() for i in range(n_features): inputs = Input(shape=(n_timesteps,1)) conv1 = Conv1D(filters=32, kernel_size=3, activation='relu')(inputs) conv2 = Conv1D(filters=32, kernel_size=3, activation='relu')(conv1) pool1 = MaxPooling1D(pool_size=2)(conv2) flat = Flatten()(pool1) # store layers in_layers.append(inputs) out_layers.append(flat) # merge heads merged = concatenate(out_layers) # interpretation dense1 = Dense(200, activation='relu')(merged) dense2 = Dense(100, activation='relu')(dense1) outputs = Dense(n_outputs)(dense2) model = Model(inputs=in_layers, outputs=outputs) # compile model model.compile(loss='mse', optimizer='adam')

When the model is used, it will require eight arrays as input: one for each of the submodels.

This is required when training the model, when evaluating the model, and when making predictions with a final model.

We can achieve this by creating a list of 3D arrays, where each 3D array contains [*samples, timesteps, 1*], with one feature.

We can prepare the training dataset in this format as follows:

input_data = [train_x[:,:,i].reshape((train_x.shape[0],n_timesteps,1)) for i in range(n_features)]

The updated *build_model()* function with these changes is listed below.

# train the model def build_model(train, n_input): # prepare data train_x, train_y = to_supervised(train, n_input) # define parameters verbose, epochs, batch_size = 0, 25, 16 n_timesteps, n_features, n_outputs = train_x.shape[1], train_x.shape[2], train_y.shape[1] # create a channel for each variable in_layers, out_layers = list(), list() for i in range(n_features): inputs = Input(shape=(n_timesteps,1)) conv1 = Conv1D(filters=32, kernel_size=3, activation='relu')(inputs) conv2 = Conv1D(filters=32, kernel_size=3, activation='relu')(conv1) pool1 = MaxPooling1D(pool_size=2)(conv2) flat = Flatten()(pool1) # store layers in_layers.append(inputs) out_layers.append(flat) # merge heads merged = concatenate(out_layers) # interpretation dense1 = Dense(200, activation='relu')(merged) dense2 = Dense(100, activation='relu')(dense1) outputs = Dense(n_outputs)(dense2) model = Model(inputs=in_layers, outputs=outputs) # compile model model.compile(loss='mse', optimizer='adam') # plot the model plot_model(model, show_shapes=True, to_file='multiheaded_cnn.png') # fit network input_data = [train_x[:,:,i].reshape((train_x.shape[0],n_timesteps,1)) for i in range(n_features)] model.fit(input_data, train_y, epochs=epochs, batch_size=batch_size, verbose=verbose) return model

When the model is built, a diagram of the structure of the model is created and saved to file.

Note: the call to plot_model() requires that pygraphviz and pydot are installed. If this is a problem, you can comment out this line.

The structure of the network looks as follows.

Next, we can update the preparation of input samples when making a prediction for the test dataset.

We must perform the same change, where an input array of [1, 14, 8] must be transformed into a list of eight 3D arrays each with [1, 14, 1].

input_x = [input_x[:,i].reshape((1,input_x.shape[0],1)) for i in range(input_x.shape[1])]

The *forecast()* function with this change is listed below.

# make a forecast def forecast(model, history, n_input): # flatten data data = array(history) data = data.reshape((data.shape[0]*data.shape[1], data.shape[2])) # retrieve last observations for input data input_x = data[-n_input:, :] # reshape into n input arrays input_x = [input_x[:,i].reshape((1,input_x.shape[0],1)) for i in range(input_x.shape[1])] # forecast the next week yhat = model.predict(input_x, verbose=0) # we only want the vector forecast yhat = yhat[0] return yhat

That’s it.

We can tie all of this together; the complete example is listed below.

# multi headed multi-step cnn from math import sqrt from numpy import split from numpy import array from pandas import read_csv from sklearn.metrics import mean_squared_error from matplotlib import pyplot from keras.models import Sequential from keras.layers import Dense from keras.layers import Flatten from keras.layers.convolutional import Conv1D from keras.layers.convolutional import MaxPooling1D from keras.models import Model from keras.layers import Input from keras.layers.merge import concatenate # split a univariate dataset into train/test sets def split_dataset(data): # split into standard weeks train, test = data[1:-328], data[-328:-6] # restructure into windows of weekly data train = array(split(train, len(train)/7)) test = array(split(test, len(test)/7)) return train, test # evaluate one or more weekly forecasts against expected values def evaluate_forecasts(actual, predicted): scores = list() # calculate an RMSE score for each day for i in range(actual.shape[1]): # calculate mse mse = mean_squared_error(actual[:, i], predicted[:, i]) # calculate rmse rmse = sqrt(mse) # store scores.append(rmse) # calculate overall RMSE s = 0 for row in range(actual.shape[0]): for col in range(actual.shape[1]): s += (actual[row, col] - predicted[row, col])**2 score = sqrt(s / (actual.shape[0] * actual.shape[1])) return score, scores # summarize scores def summarize_scores(name, score, scores): s_scores = ', '.join(['%.1f' % s for s in scores]) print('%s: [%.3f] %s' % (name, score, s_scores)) # convert history into inputs and outputs def to_supervised(train, n_input, n_out=7): # flatten data data = train.reshape((train.shape[0]*train.shape[1], train.shape[2])) X, y = list(), list() in_start = 0 # step over the entire history one time step at a time for _ in range(len(data)): # define the end of the input sequence in_end = in_start + n_input out_end = in_end + n_out # ensure we have enough data for this instance if out_end < len(data): X.append(data[in_start:in_end, :]) y.append(data[in_end:out_end, 0]) # move along one time step in_start += 1 return array(X), array(y) # plot training history def plot_history(history): # plot loss pyplot.subplot(2, 1, 1) pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.title('loss', y=0, loc='center') pyplot.legend() # plot rmse pyplot.subplot(2, 1, 2) pyplot.plot(history.history['rmse'], label='train') pyplot.plot(history.history['val_rmse'], label='test') pyplot.title('rmse', y=0, loc='center') pyplot.legend() pyplot.show() # train the model def build_model(train, n_input): # prepare data train_x, train_y = to_supervised(train, n_input) # define parameters verbose, epochs, batch_size = 0, 25, 16 n_timesteps, n_features, n_outputs = train_x.shape[1], train_x.shape[2], train_y.shape[1] # create a channel for each variable in_layers, out_layers = list(), list() for i in range(n_features): inputs = Input(shape=(n_timesteps,1)) conv1 = Conv1D(filters=32, kernel_size=3, activation='relu')(inputs) conv2 = Conv1D(filters=32, kernel_size=3, activation='relu')(conv1) pool1 = MaxPooling1D(pool_size=2)(conv2) flat = Flatten()(pool1) # store layers in_layers.append(inputs) out_layers.append(flat) # merge heads merged = concatenate(out_layers) # interpretation dense1 = Dense(200, activation='relu')(merged) dense2 = Dense(100, activation='relu')(dense1) outputs = Dense(n_outputs)(dense2) model = Model(inputs=in_layers, outputs=outputs) # compile model model.compile(loss='mse', optimizer='adam') # fit network input_data = [train_x[:,:,i].reshape((train_x.shape[0],n_timesteps,1)) for i in range(n_features)] model.fit(input_data, train_y, epochs=epochs, batch_size=batch_size, verbose=verbose) return model # make a forecast def forecast(model, history, n_input): # flatten data data = array(history) data = data.reshape((data.shape[0]*data.shape[1], data.shape[2])) # retrieve last observations for input data input_x = data[-n_input:, :] # reshape into n input arrays input_x = [input_x[:,i].reshape((1,input_x.shape[0],1)) for i in range(input_x.shape[1])] # forecast the next week yhat = model.predict(input_x, verbose=0) # we only want the vector forecast yhat = yhat[0] return yhat # evaluate a single model def evaluate_model(train, test, n_input): # fit model model = build_model(train, n_input) # history is a list of weekly data history = [x for x in train] # walk-forward validation over each week predictions = list() for i in range(len(test)): # predict the week yhat_sequence = forecast(model, history, n_input) # store the predictions predictions.append(yhat_sequence) # get real observation and add to history for predicting the next week history.append(test[i, :]) # evaluate predictions days for each week predictions = array(predictions) score, scores = evaluate_forecasts(test[:, :, 0], predictions) return score, scores # load the new file dataset = read_csv('household_power_consumption_days.csv', header=0, infer_datetime_format=True, parse_dates=['datetime'], index_col=['datetime']) # split into train and test train, test = split_dataset(dataset.values) # evaluate model and get scores n_input = 14 score, scores = evaluate_model(train, test, n_input) # summarize scores summarize_scores('cnn', score, scores) # plot scores days = ['sun', 'mon', 'tue', 'wed', 'thr', 'fri', 'sat'] pyplot.plot(days, scores, marker='o', label='cnn') pyplot.show()

We can see that in this case, the overall RMSE is skillful compared to a naive forecast, but with the chosen configuration may not perform better than the multi-channel model in the previous section.

cnn: [396.116] 414.5, 385.5, 377.2, 412.1, 371.1, 380.6, 428.1

We can also see a different, more pronounced profile for the daily RMSE scores where perhaps Mon-Tue and Thu-Fri are easier for the model to predict than the other forecast days.

These results may be useful when combined with another forecast model.

It may be interesting to explore alternate methods in the architecture for merging the output of each sub-model.

This section lists some ideas for extending the tutorial that you may wish to explore.

**Size of Input**. Explore more or fewer numbers of days used as input for the model, such as three days, 21 days, 30 days and more.**Model Tuning**. Tune the structure and hyperparameters for a model and further lift model performance on average.**Data Scaling**. Explore whether data scaling, such as standardization and normalization, can be used to improve the performance of any of the CNN models.**Learning Diagnostics**. Use diagnostics such as learning curves for the train and validation loss and mean squared error to help tune the structure and hyperparameters of a CNN model.**Vary Kernel Size**. Combine the multichannel CNN with the multi-headed CNN and use a different kernel size for each head to see if this configuration can further improve performance.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- pandas.read_csv API
- pandas.DataFrame.resample API
- Resample Offset Aliases
- sklearn.metrics.mean_squared_error API
- numpy.split API

- Individual household electric power consumption Data Set, UCI Machine Learning Repository.
- AC power, Wikipedia.
- 4 Strategies for Multi-Step Time Series Forecasting
- Crash Course in Convolutional Neural Networks for Machine Learning

In this tutorial, you discovered how to develop 1D convolutional neural networks for multi-step time series forecasting.

Specifically, you learned:

- How to develop a CNN for multi-step time series forecasting model for univariate data.
- How to develop a multichannel multi-step time series forecasting model for multivariate data.
- How to develop a multi-headed multi-step time series forecasting model for multivariate data.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Develop Convolutional Neural Networks for Multi-Step Time Series Forecasting appeared first on Machine Learning Mastery.

]]>The post Multi-step Time Series Forecasting with Machine Learning for Household Electricity Consumption appeared first on Machine Learning Mastery.

]]>This data represents a multivariate time series of power-related variables that in turn could be used to model and even forecast future electricity consumption.

Machine learning algorithms predict a single value and cannot be used directly for multi-step forecasting. Two strategies that can be used to make multi-step forecasts with machine learning algorithms are the recursive and the direct methods.

In this tutorial, you will discover how to develop recursive and direct multi-step forecasting models with machine learning algorithms.

After completing this tutorial, you will know:

- How to develop a framework for evaluating linear, nonlinear, and ensemble machine learning algorithms for multi-step time series forecasting.
- How to evaluate machine learning algorithms using a recursive multi-step time series forecasting strategy.
- How to evaluate machine learning algorithms using a direct per-day and per-lead time multi-step time series forecasting strategy.

Let’s get started.

This tutorial is divided into five parts; they are:

- Problem Description
- Load and Prepare Dataset
- Model Evaluation
- Recursive Multi-Step Forecasting
- Direct Multi-Step Forecasting

The ‘Household Power Consumption‘ dataset is a multivariate time series dataset that describes the electricity consumption for a single household over four years.

The data was collected between December 2006 and November 2010 and observations of power consumption within the household were collected every minute.

It is a multivariate series comprised of seven variables (besides the date and time); they are:

**global_active_power**: The total active power consumed by the household (kilowatts).**global_reactive_power**: The total reactive power consumed by the household (kilowatts).**voltage**: Average voltage (volts).**global_intensity**: Average current intensity (amps).**sub_metering_1**: Active energy for kitchen (watt-hours of active energy).**sub_metering_2**: Active energy for laundry (watt-hours of active energy).**sub_metering_3**: Active energy for climate control systems (watt-hours of active energy).

Active and reactive energy refer to the technical details of alternative current.

A fourth sub-metering variable can be created by subtracting the sum of three defined sub-metering variables from the total active energy as follows:

sub_metering_remainder = (global_active_power * 1000 / 60) - (sub_metering_1 + sub_metering_2 + sub_metering_3)

The dataset can be downloaded from the UCI Machine Learning repository as a single 20 megabyte .zip file:

Download the dataset and unzip it into your current working directory. You will now have the file “*household_power_consumption.txt*” that is about 127 megabytes in size and contains all of the observations.

We can use the *read_csv()* function to load the data and combine the first two columns into a single date-time column that we can use as an index.

# load all data dataset = read_csv('household_power_consumption.txt', sep=';', header=0, low_memory=False, infer_datetime_format=True, parse_dates={'datetime':[0,1]}, index_col=['datetime'])

Next, we can mark all missing values indicated with a ‘*?*‘ character with a *NaN* value, which is a float.

This will allow us to work with the data as one array of floating point values rather than mixed types (less efficient.)

# mark all missing values dataset.replace('?', nan, inplace=True) # make dataset numeric dataset = dataset.astype('float32')

We also need to fill in the missing values now that they have been marked.

A very simple approach would be to copy the observation from the same time the day before. We can implement this in a function named *fill_missing()* that will take the NumPy array of the data and copy values from exactly 24 hours ago.

# fill missing values with a value at the same time one day ago def fill_missing(values): one_day = 60 * 24 for row in range(values.shape[0]): for col in range(values.shape[1]): if isnan(values[row, col]): values[row, col] = values[row - one_day, col]

We can apply this function directly to the data within the DataFrame.

# fill missing fill_missing(dataset.values)

Now we can create a new column that contains the remainder of the sub-metering, using the calculation from the previous section.

# add a column for for the remainder of sub metering values = dataset.values dataset['sub_metering_4'] = (values[:,0] * 1000 / 60) - (values[:,4] + values[:,5] + values[:,6])

We can now save the cleaned-up version of the dataset to a new file; in this case we will just change the file extension to .csv and save the dataset as ‘*household_power_consumption.csv*‘.

# save updated dataset dataset.to_csv('household_power_consumption.csv')

Tying all of this together, the complete example of loading, cleaning-up, and saving the dataset is listed below.

# load and clean-up data from numpy import nan from numpy import isnan from pandas import read_csv from pandas import to_numeric # fill missing values with a value at the same time one day ago def fill_missing(values): one_day = 60 * 24 for row in range(values.shape[0]): for col in range(values.shape[1]): if isnan(values[row, col]): values[row, col] = values[row - one_day, col] # load all data dataset = read_csv('household_power_consumption.txt', sep=';', header=0, low_memory=False, infer_datetime_format=True, parse_dates={'datetime':[0,1]}, index_col=['datetime']) # mark all missing values dataset.replace('?', nan, inplace=True) # make dataset numeric dataset = dataset.astype('float32') # fill missing fill_missing(dataset.values) # add a column for for the remainder of sub metering values = dataset.values dataset['sub_metering_4'] = (values[:,0] * 1000 / 60) - (values[:,4] + values[:,5] + values[:,6]) # save updated dataset dataset.to_csv('household_power_consumption.csv')

Running the example creates the new file ‘*household_power_consumption.csv*‘ that we can use as the starting point for our modeling project.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In this section, we will consider how we can develop and evaluate predictive models for the household power dataset.

This section is divided into four parts; they are:

- Problem Framing
- Evaluation Metric
- Train and Test Sets
- Walk-Forward Validation

There are many ways to harness and explore the household power consumption dataset.

In this tutorial, we will use the data to explore a very specific question; that is:

Given recent power consumption, what is the expected power consumption for the week ahead?

This requires that a predictive model forecast the total active power for each day over the next seven days.

Technically, this framing of the problem is referred to as a multi-step time series forecasting problem, given the multiple forecast steps. A model that makes use of multiple input variables may be referred to as a multivariate multi-step time series forecasting model.

A model of this type could be helpful within the household in planning expenditures. It could also be helpful on the supply side for planning electricity demand for a specific household.

This framing of the dataset also suggests that it would be useful to downsample the per-minute observations of power consumption to daily totals. This is not required, but makes sense, given that we are interested in total power per day.

We can achieve this easily using the resample() function on the pandas DataFrame. Calling this function with the argument ‘*D*‘ allows the loaded data indexed by date-time to be grouped by day (see all offset aliases). We can then calculate the sum of all observations for each day and create a new dataset of daily power consumption data for each of the eight variables.

The complete example is listed below.

# resample minute data to total for each day from pandas import read_csv # load the new file dataset = read_csv('household_power_consumption.csv', header=0, infer_datetime_format=True, parse_dates=['datetime'], index_col=['datetime']) # resample data to daily daily_groups = dataset.resample('D') daily_data = daily_groups.sum() # summarize print(daily_data.shape) print(daily_data.head()) # save daily_data.to_csv('household_power_consumption_days.csv')

Running the example creates a new daily total power consumption dataset and saves the result into a separate file named ‘*household_power_consumption_days.csv*‘.

We can use this as the dataset for fitting and evaluating predictive models for the chosen framing of the problem.

A forecast will be comprised of seven values, one for each day of the week ahead.

It is common with multi-step forecasting problems to evaluate each forecasted time step separately. This is helpful for a few reasons:

- To comment on the skill at a specific lead time (e.g. +1 day vs +3 days).
- To contrast models based on their skills at different lead times (e.g. models good at +1 day vs models good at days +5).

The units of the total power are kilowatts and it would be useful to have an error metric that was also in the same units. Both Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) fit this bill, although RMSE is more commonly used and will be adopted in this tutorial. Unlike MAE, RMSE is more punishing of forecast errors.

The performance metric for this problem will be the RMSE for each lead time from day 1 to day 7.

As a short-cut, it may be useful to summarize the performance of a model using a single score in order to aide in model selection.

One possible score that could be used would be the RMSE across all forecast days.

The function *evaluate_forecasts()* below will implement this behavior and return the performance of a model based on multiple seven-day forecasts.

# evaluate one or more weekly forecasts against expected values def evaluate_forecasts(actual, predicted): scores = list() # calculate an RMSE score for each day for i in range(actual.shape[1]): # calculate mse mse = mean_squared_error(actual[:, i], predicted[:, i]) # calculate rmse rmse = sqrt(mse) # store scores.append(rmse) # calculate overall RMSE s = 0 for row in range(actual.shape[0]): for col in range(actual.shape[1]): s += (actual[row, col] - predicted[row, col])**2 score = sqrt(s / (actual.shape[0] * actual.shape[1])) return score, scores

Running the function will first return the overall RMSE regardless of day, then an array of RMSE scores for each day.

We will use the first three years of data for training predictive models and the final year for evaluating models.

The data in a given dataset will be divided into standard weeks. These are weeks that begin on a Sunday and end on a Saturday.

This is a realistic and useful way for using the chosen framing of the model, where the power consumption for the week ahead can be predicted. It is also helpful with modeling, where models can be used to predict a specific day (e.g. Wednesday) or the entire sequence.

We will split the data into standard weeks, working backwards from the test dataset.

The final year of the data is in 2010 and the first Sunday for 2010 was January 3rd. The data ends in mid November 2010 and the closest final Saturday in the data is November 20th. This gives 46 weeks of test data.

The first and last rows of daily data for the test dataset are provided below for confirmation.

2010-01-03,2083.4539999999984,191.61000000000055,350992.12000000034,8703.600000000033,3842.0,4920.0,10074.0,15888.233355799992 ... 2010-11-20,2197.006000000004,153.76800000000028,346475.9999999998,9320.20000000002,4367.0,2947.0,11433.0,17869.76663959999

The daily data starts in late 2006.

The first Sunday in the dataset is December 17th, which is the second row of data.

Organizing the data into standard weeks gives 159 full standard weeks for training a predictive model.

2006-12-17,3390.46,226.0059999999994,345725.32000000024,14398.59999999998,2033.0,4187.0,13341.0,36946.66673200004 ... 2010-01-02,1309.2679999999998,199.54600000000016,352332.8399999997,5489.7999999999865,801.0,298.0,6425.0,14297.133406600002

The function *split_dataset()* below splits the daily data into train and test sets and organizes each into standard weeks.

Specific row offsets are used to split the data using knowledge of the dataset. The split datasets are then organized into weekly data using the NumPy split() function.

# split a univariate dataset into train/test sets def split_dataset(data): # split into standard weeks train, test = data[1:-328], data[-328:-6] # restructure into windows of weekly data train = array(split(train, len(train)/7)) test = array(split(test, len(test)/7)) return train, test

We can test this function out by loading the daily dataset and printing the first and last rows of data from both the train and test sets to confirm they match the expectations above.

The complete code example is listed below.

# split into standard weeks from numpy import split from numpy import array from pandas import read_csv # split a univariate dataset into train/test sets def split_dataset(data): # split into standard weeks train, test = data[1:-328], data[-328:-6] # restructure into windows of weekly data train = array(split(train, len(train)/7)) test = array(split(test, len(test)/7)) return train, test # load the new file dataset = read_csv('household_power_consumption_days.csv', header=0, infer_datetime_format=True, parse_dates=['datetime'], index_col=['datetime']) train, test = split_dataset(dataset.values) # validate train data print(train.shape) print(train[0, 0, 0], train[-1, -1, 0]) # validate test print(test.shape) print(test[0, 0, 0], test[-1, -1, 0])

Running the example shows that indeed the train dataset has 159 weeks of data, whereas the test dataset has 46 weeks.

We can see that the total active power for the train and test dataset for the first and last rows match the data for the specific dates that we defined as the bounds on the standard weeks for each set.

(159, 7, 8) 3390.46 1309.2679999999998 (46, 7, 8) 2083.4539999999984 2197.006000000004

Models will be evaluated using a scheme called walk-forward validation.

This is where a model is required to make a one week prediction, then the actual data for that week is made available to the model so that it can be used as the basis for making a prediction on the subsequent week. This is both realistic for how the model may be used in practice and beneficial to the models, allowing them to make use of the best available data.

We can demonstrate this below with separation of input data and output/predicted data.

Input, Predict [Week1] Week2 [Week1 + Week2] Week3 [Week1 + Week2 + Week3] Week4 ...

The walk-forward validation approach to evaluating predictive models on this dataset is provided in a function below, named *evaluate_model()*.

A scikit-learn model object is provided as an argument to the function, along with the train and test datasets. An additional argument *n_input* is provided that is used to define the number of prior observations that the model will use as input in order to make a prediction.

The specifics of how a scikit-learn model is fit and makes predictions is covered in later sections.

The forecasts made by the model are then evaluated against the test dataset using the previously defined *evaluate_forecasts()* function.

# evaluate a single model def evaluate_model(model, train, test, n_input): # history is a list of weekly data history = [x for x in train] # walk-forward validation over each week predictions = list() for i in range(len(test)): # predict the week yhat_sequence = ... # store the predictions predictions.append(yhat_sequence) # get real observation and add to history for predicting the next week history.append(test[i, :]) predictions = array(predictions) # evaluate predictions days for each week score, scores = evaluate_forecasts(test[:, :, 0], predictions) return score, scores

Once we have the evaluation for a model we can summarize the performance.

The function below, named *summarize_scores()*, will display the performance of a model as a single line for easy comparison with other models.

# summarize scores def summarize_scores(name, score, scores): s_scores = ', '.join(['%.1f' % s for s in scores]) print('%s: [%.3f] %s' % (name, score, s_scores))

We now have all of the elements to begin evaluating predictive models on the dataset.

Most predictive modeling algorithms will take some number of observations as input and predict a single output value.

As such, they cannot be used directly to make a multi-step time series forecast.

This applies to most linear, nonlinear, and ensemble machine learning algorithms.

One approach where machine learning algorithms can be used to make a multi-step time series forecast is to use them recursively.

This involves making a prediction for one time step, taking the prediction, and feeding it into the model as an input in order to predict the subsequent time step. This process is repeated until the desired number of steps have been forecasted.

For example:

X = [x1, x2, x3] y1 = model.predict(X) X = [x2, x3, y1] y2 = model.predict(X) X = [x3, y1, y2] y3 = model.predict(X) ...

In this section, we will develop a test harness for fitting and evaluating machine learning algorithms provided in scikit-learn using a recursive model for multi-step forecasting.

The first step is to convert the prepared training data in window format into a single univariate series.

The *to_series()* function below will convert a list of weekly multivariate data into a single univariate series of daily total power consumed.

# convert windows of weekly multivariate data into a series of total power def to_series(data): # extract just the total power from each week series = [week[:, 0] for week in data] # flatten into a single series series = array(series).flatten() return series

Next, the sequence of daily power needs to be transformed into inputs and outputs suitable for fitting a supervised learning problem.

The prediction will be some function of the total power consumed on prior days. We can choose the number of prior days to use as inputs, such as one or two weeks. There will always be a single output: the total power consumed on the next day.

The model will be fit on the true observations from prior time steps. We need to iterate through the sequence of daily power consumed and split it into inputs and outputs. This is called a sliding window data representation.

The *to_supervised()* function below implements this behavior.

It takes a list of weekly data as input as well as the number of prior days to use as inputs for each sample that is created.

The first step is to convert the history into a single data series. The series is then enumerated, creating one input and output pair per time step. This framing of the problem will allow a model to learn to predict any day of the week given the observations of prior days. The function returns the inputs (X) and outputs (y) ready for training a model.

# convert history into inputs and outputs def to_supervised(history, n_input): # convert history to a univariate series data = to_series(history) X, y = list(), list() ix_start = 0 # step over the entire history one time step at a time for i in range(len(data)): # define the end of the input sequence ix_end = ix_start + n_input # ensure we have enough data for this instance if ix_end < len(data): X.append(data[ix_start:ix_end]) y.append(data[ix_end]) # move along one time step ix_start += 1 return array(X), array(y)

The scikit-learn library allows a model to be used as part of a pipeline. This allows data transforms to be applied automatically prior to fitting the model. More importantly, the transforms are prepared in the correct way, where they are prepared or fit on the training data and applied on the test data. This prevents data leakage when evaluating models.

We can use this capability when in evaluating models by creating a pipeline prior to fitting each model on the training dataset. We will both standardize and normalize the data prior to using the model.

The *make_pipeline()* function below implements this behavior, returning a Pipeline that can be used just like a model, e.g. it can be fit and it can make predictions.

The standardization and normalization operations are performed per column. In the *to_supervised()* function, we have essentially split one column of data (total power) into multiple columns, e.g. seven for seven days of input observations. This means that each of the seven columns in the input data will have a different mean and standard deviation for standardization and a different min and max for normalization.

Given that we used a sliding window, almost all values will appear in each column, therefore, this is not likely an issue. But it is important to note that it would be more rigorous to scale the data as a single column prior to splitting it into inputs and outputs.

# create a feature preparation pipeline for a model def make_pipeline(model): steps = list() # standardization steps.append(('standardize', StandardScaler())) # normalization steps.append(('normalize', MinMaxScaler())) # the model steps.append(('model', model)) # create pipeline pipeline = Pipeline(steps=steps) return pipeline

We can tie these elements together into a function called *sklearn_predict()*, listed below.

The function takes a scikit-learn model object, the training data, called history, and a specified number of prior days to use as inputs. It transforms the training data into inputs and outputs, wraps the model in a pipeline, fits it, and uses it to make a prediction.

# fit a model and make a forecast def sklearn_predict(model, history, n_input): # prepare data train_x, train_y = to_supervised(history, n_input) # make pipeline pipeline = make_pipeline(model) # fit the model pipeline.fit(train_x, train_y) # predict the week, recursively yhat_sequence = forecast(pipeline, train_x[-1, :], n_input) return yhat_sequence

The model will use the last row from the training dataset as input in order to make the prediction.

The *forecast()* function will use the model to make a recursive multi-step forecast.

The recursive forecast involves iterating over each of the seven days required of the multi-step forecast.

The input data to the model is taken as the last few observations of the *input_data* list. This list is seeded with all of the observations from the last row of the training data, and as we make predictions with the model, they are added to the end of this list. Therefore, we can take the last *n_input* observations from this list in order to achieve the effect of providing prior outputs as inputs.

The model is used to make a prediction for the prepared input data and the output is added both to the list for the actual output sequence that we will return and the list of input data from which we will draw observations as input for the model on the next iteration.

# make a recursive multi-step forecast def forecast(model, input_x, n_input): yhat_sequence = list() input_data = [x for x in input_x] for j in range(7): # prepare the input data X = array(input_data[-n_input:]).reshape(1, n_input) # make a one-step forecast yhat = model.predict(X)[0] # add to the result yhat_sequence.append(yhat) # add the prediction to the input input_data.append(yhat) return yhat_sequence

We now have all of the elements to fit and evaluate scikit-learn models using a recursive multi-step forecasting strategy.

We can update the *evaluate_model()* function defined in the previous section to call the *sklearn_predict()* function. The updated function is listed below.

# evaluate a single model def evaluate_model(model, train, test, n_input): # history is a list of weekly data history = [x for x in train] # walk-forward validation over each week predictions = list() for i in range(len(test)): # predict the week yhat_sequence = sklearn_predict(model, history, n_input) # store the predictions predictions.append(yhat_sequence) # get real observation and add to history for predicting the next week history.append(test[i, :]) predictions = array(predictions) # evaluate predictions days for each week score, scores = evaluate_forecasts(test[:, :, 0], predictions) return score, scores

An important final function is the *get_models()* that defines a dictionary of scikit-learn model objects mapped to a shorthand name we can use for reporting.

We will start-off by evaluating a suite of linear algorithms. We would expect that these would perform similar to an autoregression model (e.g. AR(7) if seven days of inputs were used).

The *get_models()* function with ten linear models is defined below.

This is a spot check where we are interested in the general performance of a diverse range of algorithms rather than optimizing any given algorithm.

# prepare a list of ml models def get_models(models=dict()): # linear models models['lr'] = LinearRegression() models['lasso'] = Lasso() models['ridge'] = Ridge() models['en'] = ElasticNet() models['huber'] = HuberRegressor() models['lars'] = Lars() models['llars'] = LassoLars() models['pa'] = PassiveAggressiveRegressor(max_iter=1000, tol=1e-3) models['ranscac'] = RANSACRegressor() models['sgd'] = SGDRegressor(max_iter=1000, tol=1e-3) print('Defined %d models' % len(models)) return models

Finally, we can tie all of this together.

First, the dataset is loaded and split into train and test sets.

# load the new file dataset = read_csv('household_power_consumption_days.csv', header=0, infer_datetime_format=True, parse_dates=['datetime'], index_col=['datetime']) # split into train and test train, test = split_dataset(dataset.values)

We can then prepare the dictionary of models and define the number of prior days of observations to use as inputs to the model.

# prepare the models to evaluate models = get_models() n_input = 7

The models in the dictionary are then enumerated, evaluating each, summarizing their scores, and adding the results to a line plot.

The complete example is listed below.

# recursive multi-step forecast with linear algorithms from math import sqrt from numpy import split from numpy import array from pandas import read_csv from sklearn.metrics import mean_squared_error from matplotlib import pyplot from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import MinMaxScaler from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression from sklearn.linear_model import Lasso from sklearn.linear_model import Ridge from sklearn.linear_model import ElasticNet from sklearn.linear_model import HuberRegressor from sklearn.linear_model import Lars from sklearn.linear_model import LassoLars from sklearn.linear_model import PassiveAggressiveRegressor from sklearn.linear_model import RANSACRegressor from sklearn.linear_model import SGDRegressor # split a univariate dataset into train/test sets def split_dataset(data): # split into standard weeks train, test = data[1:-328], data[-328:-6] # restructure into windows of weekly data train = array(split(train, len(train)/7)) test = array(split(test, len(test)/7)) return train, test # evaluate one or more weekly forecasts against expected values def evaluate_forecasts(actual, predicted): scores = list() # calculate an RMSE score for each day for i in range(actual.shape[1]): # calculate mse mse = mean_squared_error(actual[:, i], predicted[:, i]) # calculate rmse rmse = sqrt(mse) # store scores.append(rmse) # calculate overall RMSE s = 0 for row in range(actual.shape[0]): for col in range(actual.shape[1]): s += (actual[row, col] - predicted[row, col])**2 score = sqrt(s / (actual.shape[0] * actual.shape[1])) return score, scores # summarize scores def summarize_scores(name, score, scores): s_scores = ', '.join(['%.1f' % s for s in scores]) print('%s: [%.3f] %s' % (name, score, s_scores)) # prepare a list of ml models def get_models(models=dict()): # linear models models['lr'] = LinearRegression() models['lasso'] = Lasso() models['ridge'] = Ridge() models['en'] = ElasticNet() models['huber'] = HuberRegressor() models['lars'] = Lars() models['llars'] = LassoLars() models['pa'] = PassiveAggressiveRegressor(max_iter=1000, tol=1e-3) models['ranscac'] = RANSACRegressor() models['sgd'] = SGDRegressor(max_iter=1000, tol=1e-3) print('Defined %d models' % len(models)) return models # create a feature preparation pipeline for a model def make_pipeline(model): steps = list() # standardization steps.append(('standardize', StandardScaler())) # normalization steps.append(('normalize', MinMaxScaler())) # the model steps.append(('model', model)) # create pipeline pipeline = Pipeline(steps=steps) return pipeline # make a recursive multi-step forecast def forecast(model, input_x, n_input): yhat_sequence = list() input_data = [x for x in input_x] for j in range(7): # prepare the input data X = array(input_data[-n_input:]).reshape(1, n_input) # make a one-step forecast yhat = model.predict(X)[0] # add to the result yhat_sequence.append(yhat) # add the prediction to the input input_data.append(yhat) return yhat_sequence # convert windows of weekly multivariate data into a series of total power def to_series(data): # extract just the total power from each week series = [week[:, 0] for week in data] # flatten into a single series series = array(series).flatten() return series # convert history into inputs and outputs def to_supervised(history, n_input): # convert history to a univariate series data = to_series(history) X, y = list(), list() ix_start = 0 # step over the entire history one time step at a time for i in range(len(data)): # define the end of the input sequence ix_end = ix_start + n_input # ensure we have enough data for this instance if ix_end < len(data): X.append(data[ix_start:ix_end]) y.append(data[ix_end]) # move along one time step ix_start += 1 return array(X), array(y) # fit a model and make a forecast def sklearn_predict(model, history, n_input): # prepare data train_x, train_y = to_supervised(history, n_input) # make pipeline pipeline = make_pipeline(model) # fit the model pipeline.fit(train_x, train_y) # predict the week, recursively yhat_sequence = forecast(pipeline, train_x[-1, :], n_input) return yhat_sequence # evaluate a single model def evaluate_model(model, train, test, n_input): # history is a list of weekly data history = [x for x in train] # walk-forward validation over each week predictions = list() for i in range(len(test)): # predict the week yhat_sequence = sklearn_predict(model, history, n_input) # store the predictions predictions.append(yhat_sequence) # get real observation and add to history for predicting the next week history.append(test[i, :]) predictions = array(predictions) # evaluate predictions days for each week score, scores = evaluate_forecasts(test[:, :, 0], predictions) return score, scores # load the new file dataset = read_csv('household_power_consumption_days.csv', header=0, infer_datetime_format=True, parse_dates=['datetime'], index_col=['datetime']) # split into train and test train, test = split_dataset(dataset.values) # prepare the models to evaluate models = get_models() n_input = 7 # evaluate each model days = ['sun', 'mon', 'tue', 'wed', 'thr', 'fri', 'sat'] for name, model in models.items(): # evaluate and get scores score, scores = evaluate_model(model, train, test, n_input) # summarize scores summarize_scores(name, score, scores) # plot scores pyplot.plot(days, scores, marker='o', label=name) # show plot pyplot.legend() pyplot.show()

Running the example evaluates the ten linear algorithms and summarizes the results.

As each of the algorithms is evaluated and the performance is reported with a one-line summary, including the overall RMSE as well as the per-time step RMSE.

We can see that most of the evaluated models performed well, below 400 kilowatts in error over the whole week, with perhaps the Stochastic Gradient Descent (SGD) regressor performing the best with an overall RMSE of about 383.

Defined 10 models lr: [388.388] 411.0, 389.1, 338.0, 370.8, 408.5, 308.3, 471.1 lasso: [386.838] 403.6, 388.9, 337.3, 371.1, 406.1, 307.6, 471.6 ridge: [387.659] 407.9, 388.6, 337.5, 371.2, 407.0, 307.7, 471.7 en: [469.337] 452.2, 451.9, 435.8, 485.7, 460.4, 405.8, 575.1 huber: [392.465] 412.1, 388.0, 337.9, 377.3, 405.6, 306.9, 492.5 lars: [388.388] 411.0, 389.1, 338.0, 370.8, 408.5, 308.3, 471.1 llars: [388.406] 396.1, 387.8, 339.3, 377.8, 402.9, 310.3, 481.9 pa: [399.402] 410.0, 391.7, 342.2, 389.7, 409.8, 315.9, 508.4 ranscac: [439.945] 454.0, 424.0, 369.5, 421.5, 457.5, 409.7, 526.9 sgd: [383.177] 400.3, 386.0, 333.0, 368.9, 401.5, 303.9, 466.9

A line plot of the daily RMSE for each of the 10 classifiers is also created.

We can see that all but two of the methods cluster together with equally well performing results across the seven day forecasts.

Better results may be achieved by tuning the hyperparameters of some of the better performing algorithms. Further, it may be interesting to update the example to test a suite of nonlinear and ensemble algorithms.

An interesting experiment may be to evaluate the performance of one or a few of the better performing algorithms with more or fewer prior days as input.

An alternate to the recursive strategy for multi-step forecasting is to use a different model for each of the days to be forecasted.

This is called a direct multi-step forecasting strategy.

Because we are interested in forecasting seven days, this would require preparing seven different models, each specialized for forecasting a different day.

There are two approaches to training such a model:

**Predict Day**. Models can be prepared to predict a specific day of the standard week, e.g. Monday.**Predict Lead Time**. Models can be prepared to predict a specific lead time, e.g. day 1.

Predicting a day will be more specific, but will mean that less of the training data can be used for each model. Predicting a lead time makes use of more of the training data, but requires the model to generalize across the different days of the week.

We will explore both approaches in this section.

First, we must update the *to_supervised()* function to prepare the data, such as the prior week of observations, used as input and an observation from a specific day in the following week used as the output.

The updated *to_supervised()* function that implements this behavior is listed below. It takes an argument *output_ix* that defines the day [0,6] in the following week to use as the output.

# convert history into inputs and outputs def to_supervised(history, output_ix): X, y = list(), list() # step over the entire history one time step at a time for i in range(len(history)-1): X.append(history[i][:,0]) y.append(history[i + 1][output_ix,0]) return array(X), array(y)

This function can be called seven times, once for each of the seven models required.

Next, we can update the *sklearn_predict()* function to create a new dataset and a new model for each day in the one-week forecast.

The body of the function is mostly unchanged, only it is used within a loop over each day in the output sequence, where the index of the day “*i*” is passed to the call to *to_supervised()* in order to prepare a specific dataset for training a model to predict that day.

The function no longer takes an *n_input* argument, as we have fixed the input to be the seven days of the prior week.

# fit a model and make a forecast def sklearn_predict(model, history): yhat_sequence = list() # fit a model for each forecast day for i in range(7): # prepare data train_x, train_y = to_supervised(history, i) # make pipeline pipeline = make_pipeline(model) # fit the model pipeline.fit(train_x, train_y) # forecast x_input = array(train_x[-1, :]).reshape(1,7) yhat = pipeline.predict(x_input)[0] # store yhat_sequence.append(yhat) return yhat_sequence

The complete example is listed below.

# direct multi-step forecast by day from math import sqrt from numpy import split from numpy import array from pandas import read_csv from sklearn.metrics import mean_squared_error from matplotlib import pyplot from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import MinMaxScaler from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression from sklearn.linear_model import Lasso from sklearn.linear_model import Ridge from sklearn.linear_model import ElasticNet from sklearn.linear_model import HuberRegressor from sklearn.linear_model import Lars from sklearn.linear_model import LassoLars from sklearn.linear_model import PassiveAggressiveRegressor from sklearn.linear_model import RANSACRegressor from sklearn.linear_model import SGDRegressor # split a univariate dataset into train/test sets def split_dataset(data): # split into standard weeks train, test = data[1:-328], data[-328:-6] # restructure into windows of weekly data train = array(split(train, len(train)/7)) test = array(split(test, len(test)/7)) return train, test # evaluate one or more weekly forecasts against expected values def evaluate_forecasts(actual, predicted): scores = list() # calculate an RMSE score for each day for i in range(actual.shape[1]): # calculate mse mse = mean_squared_error(actual[:, i], predicted[:, i]) # calculate rmse rmse = sqrt(mse) # store scores.append(rmse) # calculate overall RMSE s = 0 for row in range(actual.shape[0]): for col in range(actual.shape[1]): s += (actual[row, col] - predicted[row, col])**2 score = sqrt(s / (actual.shape[0] * actual.shape[1])) return score, scores # summarize scores def summarize_scores(name, score, scores): s_scores = ', '.join(['%.1f' % s for s in scores]) print('%s: [%.3f] %s' % (name, score, s_scores)) # prepare a list of ml models def get_models(models=dict()): # linear models models['lr'] = LinearRegression() models['lasso'] = Lasso() models['ridge'] = Ridge() models['en'] = ElasticNet() models['huber'] = HuberRegressor() models['lars'] = Lars() models['llars'] = LassoLars() models['pa'] = PassiveAggressiveRegressor(max_iter=1000, tol=1e-3) models['ranscac'] = RANSACRegressor() models['sgd'] = SGDRegressor(max_iter=1000, tol=1e-3) print('Defined %d models' % len(models)) return models # create a feature preparation pipeline for a model def make_pipeline(model): steps = list() # standardization steps.append(('standardize', StandardScaler())) # normalization steps.append(('normalize', MinMaxScaler())) # the model steps.append(('model', model)) # create pipeline pipeline = Pipeline(steps=steps) return pipeline # convert history into inputs and outputs def to_supervised(history, output_ix): X, y = list(), list() # step over the entire history one time step at a time for i in range(len(history)-1): X.append(history[i][:,0]) y.append(history[i + 1][output_ix,0]) return array(X), array(y) # fit a model and make a forecast def sklearn_predict(model, history): yhat_sequence = list() # fit a model for each forecast day for i in range(7): # prepare data train_x, train_y = to_supervised(history, i) # make pipeline pipeline = make_pipeline(model) # fit the model pipeline.fit(train_x, train_y) # forecast x_input = array(train_x[-1, :]).reshape(1,7) yhat = pipeline.predict(x_input)[0] # store yhat_sequence.append(yhat) return yhat_sequence # evaluate a single model def evaluate_model(model, train, test): # history is a list of weekly data history = [x for x in train] # walk-forward validation over each week predictions = list() for i in range(len(test)): # predict the week yhat_sequence = sklearn_predict(model, history) # store the predictions predictions.append(yhat_sequence) # get real observation and add to history for predicting the next week history.append(test[i, :]) predictions = array(predictions) # evaluate predictions days for each week score, scores = evaluate_forecasts(test[:, :, 0], predictions) return score, scores # load the new file dataset = read_csv('household_power_consumption_days.csv', header=0, infer_datetime_format=True, parse_dates=['datetime'], index_col=['datetime']) # split into train and test train, test = split_dataset(dataset.values) # prepare the models to evaluate models = get_models() # evaluate each model days = ['sun', 'mon', 'tue', 'wed', 'thr', 'fri', 'sat'] for name, model in models.items(): # evaluate and get scores score, scores = evaluate_model(model, train, test) # summarize scores summarize_scores(name, score, scores) # plot scores pyplot.plot(days, scores, marker='o', label=name) # show plot pyplot.legend() pyplot.show()

Running the example first summarizes the performance of each model.

We can see that the performance is slightly worse than the recursive model on this problem.

Defined 10 models lr: [410.927] 463.8, 381.4, 351.9, 430.7, 387.8, 350.4, 488.8 lasso: [408.440] 458.4, 378.5, 352.9, 429.5, 388.0, 348.0, 483.5 ridge: [403.875] 447.1, 377.9, 347.5, 427.4, 384.1, 343.4, 479.7 en: [454.263] 471.8, 433.8, 415.8, 477.4, 434.4, 373.8, 551.8 huber: [409.500] 466.8, 380.2, 359.8, 432.4, 387.0, 351.3, 470.9 lars: [410.927] 463.8, 381.4, 351.9, 430.7, 387.8, 350.4, 488.8 llars: [406.490] 453.0, 378.8, 357.3, 428.1, 388.0, 345.0, 476.9 pa: [402.476] 428.4, 380.9, 356.5, 426.7, 390.4, 348.6, 471.4 ranscac: [497.225] 456.1, 423.0, 445.9, 547.6, 521.9, 451.5, 607.2 sgd: [403.526] 441.4, 378.2, 354.5, 423.9, 382.4, 345.8, 480.3

A line plot of the per-day RMSE scores for each model is also created, showing a similar grouping of models as was seen with the recursive model.

The direct lead time approach is the same, except that the *to_supervised()* makes use of more of the training dataset.

The function is the same as it was defined in the recursive model example, except it takes an additional *output_ix* argument to define the day in the following week to use as the output.

The updated *to_supervised()* function for the direct per-lead time strategy is listed below.

Unlike the per-day strategy, this version of the function does support variable sized inputs (not just seven days), allowing you to experiment if you like.

# convert history into inputs and outputs def to_supervised(history, n_input, output_ix): # convert history to a univariate series data = to_series(history) X, y = list(), list() ix_start = 0 # step over the entire history one time step at a time for i in range(len(data)): # define the end of the input sequence ix_end = ix_start + n_input ix_output = ix_end + output_ix # ensure we have enough data for this instance if ix_output < len(data): X.append(data[ix_start:ix_end]) y.append(data[ix_output]) # move along one time step ix_start += 1 return array(X), array(y)

The complete example is listed below.

# direct multi-step forecast by lead time from math import sqrt from numpy import split from numpy import array from pandas import read_csv from sklearn.metrics import mean_squared_error from matplotlib import pyplot from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import MinMaxScaler from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression from sklearn.linear_model import Lasso from sklearn.linear_model import Ridge from sklearn.linear_model import ElasticNet from sklearn.linear_model import HuberRegressor from sklearn.linear_model import Lars from sklearn.linear_model import LassoLars from sklearn.linear_model import PassiveAggressiveRegressor from sklearn.linear_model import RANSACRegressor from sklearn.linear_model import SGDRegressor # split a univariate dataset into train/test sets def split_dataset(data): # split into standard weeks train, test = data[1:-328], data[-328:-6] # restructure into windows of weekly data train = array(split(train, len(train)/7)) test = array(split(test, len(test)/7)) return train, test # evaluate one or more weekly forecasts against expected values def evaluate_forecasts(actual, predicted): scores = list() # calculate an RMSE score for each day for i in range(actual.shape[1]): # calculate mse mse = mean_squared_error(actual[:, i], predicted[:, i]) # calculate rmse rmse = sqrt(mse) # store scores.append(rmse) # calculate overall RMSE s = 0 for row in range(actual.shape[0]): for col in range(actual.shape[1]): s += (actual[row, col] - predicted[row, col])**2 score = sqrt(s / (actual.shape[0] * actual.shape[1])) return score, scores # summarize scores def summarize_scores(name, score, scores): s_scores = ', '.join(['%.1f' % s for s in scores]) print('%s: [%.3f] %s' % (name, score, s_scores)) # prepare a list of ml models def get_models(models=dict()): # linear models models['lr'] = LinearRegression() models['lasso'] = Lasso() models['ridge'] = Ridge() models['en'] = ElasticNet() models['huber'] = HuberRegressor() models['lars'] = Lars() models['llars'] = LassoLars() models['pa'] = PassiveAggressiveRegressor(max_iter=1000, tol=1e-3) models['ranscac'] = RANSACRegressor() models['sgd'] = SGDRegressor(max_iter=1000, tol=1e-3) print('Defined %d models' % len(models)) return models # create a feature preparation pipeline for a model def make_pipeline(model): steps = list() # standardization steps.append(('standardize', StandardScaler())) # normalization steps.append(('normalize', MinMaxScaler())) # the model steps.append(('model', model)) # create pipeline pipeline = Pipeline(steps=steps) return pipeline # # convert windows of weekly multivariate data into a series of total power def to_series(data): # extract just the total power from each week series = [week[:, 0] for week in data] # flatten into a single series series = array(series).flatten() return series # convert history into inputs and outputs def to_supervised(history, n_input, output_ix): # convert history to a univariate series data = to_series(history) X, y = list(), list() ix_start = 0 # step over the entire history one time step at a time for i in range(len(data)): # define the end of the input sequence ix_end = ix_start + n_input ix_output = ix_end + output_ix # ensure we have enough data for this instance if ix_output < len(data): X.append(data[ix_start:ix_end]) y.append(data[ix_output]) # move along one time step ix_start += 1 return array(X), array(y) # fit a model and make a forecast def sklearn_predict(model, history, n_input): yhat_sequence = list() # fit a model for each forecast day for i in range(7): # prepare data train_x, train_y = to_supervised(history, n_input, i) # make pipeline pipeline = make_pipeline(model) # fit the model pipeline.fit(train_x, train_y) # forecast x_input = array(train_x[-1, :]).reshape(1,n_input) yhat = pipeline.predict(x_input)[0] # store yhat_sequence.append(yhat) return yhat_sequence # evaluate a single model def evaluate_model(model, train, test, n_input): # history is a list of weekly data history = [x for x in train] # walk-forward validation over each week predictions = list() for i in range(len(test)): # predict the week yhat_sequence = sklearn_predict(model, history, n_input) # store the predictions predictions.append(yhat_sequence) # get real observation and add to history for predicting the next week history.append(test[i, :]) predictions = array(predictions) # evaluate predictions days for each week score, scores = evaluate_forecasts(test[:, :, 0], predictions) return score, scores # load the new file dataset = read_csv('household_power_consumption_days.csv', header=0, infer_datetime_format=True, parse_dates=['datetime'], index_col=['datetime']) # split into train and test train, test = split_dataset(dataset.values) # prepare the models to evaluate models = get_models() n_input = 7 # evaluate each model days = ['sun', 'mon', 'tue', 'wed', 'thr', 'fri', 'sat'] for name, model in models.items(): # evaluate and get scores score, scores = evaluate_model(model, train, test, n_input) # summarize scores summarize_scores(name, score, scores) # plot scores pyplot.plot(days, scores, marker='o', label=name) # show plot pyplot.legend() pyplot.show()

Running the example summarizes the overall and per-day RMSE for each of the evaluated linear models.

We can see that generally the per-lead time approach resulted in better performance than the per-day version. This is likely because the approach made more of the training data available to the model.

Defined 10 models lr: [394.983] 411.0, 400.7, 340.2, 382.9, 385.1, 362.8, 469.4 lasso: [391.767] 403.6, 394.4, 336.1, 382.7, 384.2, 360.4, 468.1 ridge: [393.444] 407.9, 397.8, 338.9, 383.2, 383.2, 360.4, 469.6 en: [461.986] 452.2, 448.3, 430.3, 480.4, 448.9, 396.0, 560.6 huber: [394.287] 412.1, 394.0, 333.4, 384.1, 383.1, 364.3, 474.4 lars: [394.983] 411.0, 400.7, 340.2, 382.9, 385.1, 362.8, 469.4 llars: [390.075] 396.1, 390.1, 334.3, 384.4, 385.2, 355.6, 470.9 pa: [389.340] 409.7, 380.6, 328.3, 388.6, 370.1, 351.8, 478.4 ranscac: [439.298] 387.2, 462.4, 394.4, 427.7, 412.9, 447.9, 526.8 sgd: [390.184] 396.7, 386.7, 337.6, 391.4, 374.0, 357.1, 473.5

A line plot of the per-day RMSE scores was again created.

It may be interesting to explore a blending of the per-day and per-time step approaches to modeling the problem.

It may also be interesting to see if increasing the number of prior days used as input for the per-lead time improves performance, e.g. using two weeks of data instead of one week.

This section lists some ideas for extending the tutorial that you may wish to explore.

**Tune Models**. Select one well-performing model and tune the model hyperparameters in order to further improve performance.**Tune Data Preparation**. All data was standardized and normalized prior to fitting each model; explore whether these methods are necessary and whether more or different combinations of data scaling methods can result in better performance.**Explore Input Size**. The input size was limited to seven days of prior observations; explore more and fewer days of observations as input and their impact on model performance.**Nonlinear Algorithms**. Explore a suite of nonlinear and ensemble machine learning algorithms to see if they can lift performance, such as SVM and Random Forest.**Multivariate Direct Models**. Develop direct models that make use of all input variables for the prior week, not just the total daily power consumed. This will require flattening the 2D arrays of seven days of eight variables into 1D vectors.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- pandas.read_csv API
- pandas.DataFrame.resample API
- Resample Offset Aliases
- sklearn.metrics.mean_squared_error API
- numpy.split API

- Individual household electric power consumption Data Set, UCI Machine Learning Repository.
- AC power, Wikipedia.
- 4 Strategies for Multi-Step Time Series Forecasting

In this tutorial, you discovered how to develop recursive and direct multi-step forecasting models with machine learning algorithms.

Specifically, you learned:

- How to develop a framework for evaluating linear, nonlinear, and ensemble machine learning algorithms for multi-step time series forecasting.
- How to evaluate machine learning algorithms using a recursive multi-step time series forecasting strategy.
- How to evaluate machine learning algorithms using a direct per-day and per-lead time multi-step time series forecasting strategy.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Multi-step Time Series Forecasting with Machine Learning for Household Electricity Consumption appeared first on Machine Learning Mastery.

]]>The post How to Develop an Autoregression Forecast Model for Household Electricity Consumption appeared first on Machine Learning Mastery.

]]>Autocorrelation models are very simple and can provide a fast and effective way to make skillful one-step and multi-step forecasts for electricity consumption.

In this tutorial, you will discover how to develop and evaluate an autoregression model for multi-step forecasting household power consumption.

After completing this tutorial, you will know:

- How to create and analyze autocorrelation and partial autocorrelation plots for univariate time series data.
- How to use the findings from autocorrelation plots to configure an autoregression model.
- How to develop and evaluate an autocorrelation model used to make one-week forecasts.

Let’s get started.

This tutorial is divided into five parts; they are:

- Problem Description
- Load and Prepare Dataset
- Model Evaluation
- Autocorrelation Analysis
- Develop an Autoregression Model

It is a multivariate series comprised of seven variables (besides the date and time); they are:

**global_active_power**: The total active power consumed by the household (kilowatts).**global_reactive_power**: The total reactive power consumed by the household (kilowatts).**voltage**: Average voltage (volts).**global_intensity**: Average current intensity (amps).**sub_metering_1**: Active energy for kitchen (watt-hours of active energy).**sub_metering_2**: Active energy for laundry (watt-hours of active energy).**sub_metering_3**: Active energy for climate control systems (watt-hours of active energy).

Active and reactive energy refer to the technical details of alternative current.

*household_power_consumption.txt*” that is about 127 megabytes in size and contains all of the observations.

*read_csv()* function to load the data and combine the first two columns into a single date-time column that we can use as an index.

*?*‘ character with a *NaN* value, which is a float.

We also need to fill in the missing values now that they have been marked.

*fill_missing()* that will take the NumPy array of the data and copy values from exactly 24 hours ago.

We can apply this function directly to the data within the DataFrame.

# fill missing fill_missing(dataset.values)

*household_power_consumption.csv*‘.

# save updated dataset dataset.to_csv('household_power_consumption.csv')

*household_power_consumption.csv*‘ that we can use as the starting point for our modeling project.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

This section is divided into four parts; they are:

- Problem Framing
- Evaluation Metric
- Train and Test Sets
- Walk-Forward Validation

There are many ways to harness and explore the household power consumption dataset.

In this tutorial, we will use the data to explore a very specific question; that is:

Given recent power consumption, what is the expected power consumption for the week ahead?

*D*‘ allows the loaded data indexed by date-time to be grouped by day (see all offset aliases). We can then calculate the sum of all observations for each day and create a new dataset of daily power consumption data for each of the eight variables.

The complete example is listed below.

*household_power_consumption_days.csv*‘.

A forecast will be comprised of seven values, one for each day of the week ahead.

- To comment on the skill at a specific lead time (e.g. +1 day vs +3 days).

The performance metric for this problem will be the RMSE for each lead time from day 1 to day 7.

One possible score that could be used would be the RMSE across all forecast days.

*evaluate_forecasts()* below will implement this behavior and return the performance of a model based on multiple seven-day forecasts.

We will split the data into standard weeks, working backwards from the test dataset.

The first and last rows of daily data for the test dataset are provided below for confirmation.

The daily data starts in late 2006.

The first Sunday in the dataset is December 17th, which is the second row of data.

*split_dataset()* below splits the daily data into train and test sets and organizes each into standard weeks.

The complete code example is listed below.

(159, 7, 8) 3390.46 1309.2679999999998 (46, 7, 8) 2083.4539999999984 2197.006000000004

Models will be evaluated using a scheme called walk-forward validation.

This is where a model is required to make a one week prediction, then the actual data for that week is made available to the model so that it can be used as the basis for making a prediction on the subsequent week. This is both realistic for how the model may be used in practice and beneficial to the models allowing them to make use of the best available data.

We can demonstrate this below with separation of input data and output/predicted data.

Input, Predict [Week1] Week2 [Week1 + Week2] Week3 [Week1 + Week2 + Week3] Week4 ...

The walk-forward validation approach to evaluating predictive models on this dataset is implement below, named *evaluate_model()*.

The name of a function is provided for the model as the argument “*model_func*“. This function is responsible for defining the model, fitting the model on the training data, and making a one-week forecast.

The forecasts made by the model are then evaluated against the test dataset using the previously defined *evaluate_forecasts()* function.

# evaluate a single model def evaluate_model(model_func, train, test): # history is a list of weekly data history = [x for x in train] # walk-forward validation over each week predictions = list() for i in range(len(test)): # predict the week yhat_sequence = model_func(history) # store the predictions predictions.append(yhat_sequence) # get real observation and add to history for predicting the next week history.append(test[i, :]) predictions = array(predictions) # evaluate predictions days for each week score, scores = evaluate_forecasts(test[:, :, 0], predictions) return score, scores

Once we have the evaluation for a model, we can summarize the performance.

The function below named *summarize_scores()* will display the performance of a model as a single line for easy comparison with other models.

We now have all of the elements to begin evaluating predictive models on the dataset.

Statistical correlation summarizes the strength of the relationship between two variables.

We can assume the distribution of each variable fits a Gaussian (bell curve) distribution. If this is the case, we can use the Pearson’s correlation coefficient to summarize the correlation between the variables.

The Pearson’s correlation coefficient is a number between -1 and 1 that describes a negative or positive correlation respectively. A value of zero indicates no correlation.

We can calculate the correlation for time series observations with observations with previous time steps, called lags. Because the correlation of the time series observations is calculated with values of the same series at previous times, this is called a serial correlation, or an autocorrelation.

A plot of the autocorrelation of a time series by lag is called the AutoCorrelation Function, or the acronym ACF. This plot is sometimes called a correlogram, or an autocorrelation plot.

A partial autocorrelation function or PACF is a summary of the relationship between an observation in a time series with observations at prior time steps with the relationships of intervening observations removed.

The autocorrelation for an observation and an observation at a prior time step is comprised of both the direct correlation and indirect correlations. These indirect correlations are a linear function of the correlation of the observation, with observations at intervening time steps.

It is these indirect correlations that the partial autocorrelation function seeks to remove. Without going into the math, this is the intuition for the partial autocorrelation.

We can calculate autocorrelation and partial autocorrelation plots using the plot_acf() and plot_pacf() statsmodels functions respectively.

In order to calculate and plot the autocorrelation, we must convert the data into a univariate time series. Specifically, the observed daily total power consumed.

The *to_series()* function below will take the multivariate data divided into weekly windows and will return a single univariate time series.

# convert windows of weekly multivariate data into a series of total power def to_series(data): # extract just the total power from each week series = [week[:, 0] for week in data] # flatten into a single series series = array(series).flatten() return series

We can call this function for the prepared training dataset.

First, the daily power consumption dataset must be loaded.

# load the new file dataset = read_csv('household_power_consumption_days.csv', header=0, infer_datetime_format=True, parse_dates=['datetime'], index_col=['datetime'])

The dataset must then be split into train and test sets with the standard week window structure.

# split into train and test train, test = split_dataset(dataset.values)

A univariate time series of daily power consumption can then be extracted from the training dataset.

# convert training data into a series series = to_series(train)

We can then create a single figure that contains both an ACF and a PACF plot. The number of lag time steps can be specified. We will fix this to be one year of daily observations, or 365 days.

# plots pyplot.figure() lags = 365 # acf axis = pyplot.subplot(2, 1, 1) plot_acf(series, ax=axis, lags=lags) # pacf axis = pyplot.subplot(2, 1, 2) plot_pacf(series, ax=axis, lags=lags) # show plot pyplot.show()

The complete example is listed below.

We would expect that the power consumed tomorrow and in the coming week will be dependent upon the power consumed in the prior days. As such, we would expect to see a strong autocorrelation signal in the ACF and PACF plots.

# acf and pacf plots of total power from numpy import split from numpy import array from pandas import read_csv from matplotlib import pyplot from statsmodels.graphics.tsaplots import plot_acf from statsmodels.graphics.tsaplots import plot_pacf # split a univariate dataset into train/test sets def split_dataset(data): # split into standard weeks train, test = data[1:-328], data[-328:-6] # restructure into windows of weekly data train = array(split(train, len(train)/7)) test = array(split(test, len(test)/7)) return train, test # convert windows of weekly multivariate data into a series of total power def to_series(data): # extract just the total power from each week series = [week[:, 0] for week in data] # flatten into a single series series = array(series).flatten() return series # load the new file dataset = read_csv('household_power_consumption_days.csv', header=0, infer_datetime_format=True, parse_dates=['datetime'], index_col=['datetime']) # split into train and test train, test = split_dataset(dataset.values) # convert training data into a series series = to_series(train) # plots pyplot.figure() lags = 365 # acf axis = pyplot.subplot(2, 1, 1) plot_acf(series, ax=axis, lags=lags) # pacf axis = pyplot.subplot(2, 1, 2) plot_pacf(series, ax=axis, lags=lags) # show plot pyplot.show()

Running the example creates a single figure with both ACF and PACF plots.

The plots are very dense, and hard to read. Nevertheless, we might be able to see a familiar autoregression pattern.

We might also see some significant lag observations at one year out. Further investigation may suggest a seasonal autocorrelation component, which would not be a surprising finding.

We can zoom in the plot and change the number of lag observations from 365 to 50.

lags = 50

Re-running the code example with this change results is a zoomed-in version of the plots with much less clutter.

We can clearly see a familiar autoregression pattern across the two plots. This pattern is comprised of two elements:

**ACF**: A large number of significant lag observations that slowly degrade as the lag increases.**PACF**: A few significant lag observations that abruptly drop as the lag increases.

The ACF plot indicates that there is a strong autocorrelation component, whereas the PACF plot indicates that this component is distinct for the first approximately seven lag observations.

This suggests that a good starting model would be an AR(7); that is an autoregression model with seven lag observations used as input.

We can develop an autoregression model for univariate series of daily power consumption.

The Statsmodels library provides multiple ways of developing an AR model, such as using the AR, ARMA, ARIMA, and SARIMAX classes.

We will use the ARIMA implementation as it allows for easy expandability into differencing and moving average.

First, the history data comprised of weeks of prior observations must be converted into a univariate time series of daily power consumption. We can use the *to_series()* function developed in the previous section.

# convert history into a univariate series series = to_series(history)

Next, an ARIMA model can be defined by passing arguments to the constructor of the ARIMA class.

We will specify an AR(7) model, which in ARIMA notation is ARIMA(7,0,0).

# define the model model = ARIMA(series, order=(7,0,0))

Next, the model can be fit on the training data. We will use the defaults and disable all debugging information during the fit by setting *disp=False*.

# fit the model model_fit = model.fit(disp=False)

Now that the model has been fit, we can make a prediction.

A prediction can be made by calling the *predict()* function and passing it either an interval of dates or indices relative to the training data. We will use indices starting with the first time step beyond the training data and extending it six more days, giving a total of a seven day forecast period beyond the training dataset.

# make forecast yhat = model_fit.predict(len(series), len(series)+6)

We can wrap all of this up into a function below named *arima_forecast()* that takes the history and returns a one week forecast.

# arima forecast def arima_forecast(history): # convert history into a univariate series series = to_series(history) # define the model model = ARIMA(series, order=(7,0,0)) # fit the model model_fit = model.fit(disp=False) # make forecast yhat = model_fit.predict(len(series), len(series)+6) return yhat

This function can be used directly in the test harness described previously.

The complete example is listed below.

# arima forecast from math import sqrt from numpy import split from numpy import array from pandas import read_csv from sklearn.metrics import mean_squared_error from matplotlib import pyplot from statsmodels.tsa.arima_model import ARIMA # split a univariate dataset into train/test sets def split_dataset(data): # split into standard weeks train, test = data[1:-328], data[-328:-6] # restructure into windows of weekly data train = array(split(train, len(train)/7)) test = array(split(test, len(test)/7)) return train, test # evaluate one or more weekly forecasts against expected values def evaluate_forecasts(actual, predicted): scores = list() # calculate an RMSE score for each day for i in range(actual.shape[1]): # calculate mse mse = mean_squared_error(actual[:, i], predicted[:, i]) # calculate rmse rmse = sqrt(mse) # store scores.append(rmse) # calculate overall RMSE s = 0 for row in range(actual.shape[0]): for col in range(actual.shape[1]): s += (actual[row, col] - predicted[row, col])**2 score = sqrt(s / (actual.shape[0] * actual.shape[1])) return score, scores # summarize scores def summarize_scores(name, score, scores): s_scores = ', '.join(['%.1f' % s for s in scores]) print('%s: [%.3f] %s' % (name, score, s_scores)) # evaluate a single model def evaluate_model(model_func, train, test): # history is a list of weekly data history = [x for x in train] # walk-forward validation over each week predictions = list() for i in range(len(test)): # predict the week yhat_sequence = model_func(history) # store the predictions predictions.append(yhat_sequence) # get real observation and add to history for predicting the next week history.append(test[i, :]) predictions = array(predictions) # evaluate predictions days for each week score, scores = evaluate_forecasts(test[:, :, 0], predictions) return score, scores # convert windows of weekly multivariate data into a series of total power def to_series(data): # extract just the total power from each week series = [week[:, 0] for week in data] # flatten into a single series series = array(series).flatten() return series # arima forecast def arima_forecast(history): # convert history into a univariate series series = to_series(history) # define the model model = ARIMA(series, order=(7,0,0)) # fit the model model_fit = model.fit(disp=False) # make forecast yhat = model_fit.predict(len(series), len(series)+6) return yhat # load the new file dataset = read_csv('household_power_consumption_days.csv', header=0, infer_datetime_format=True, parse_dates=['datetime'], index_col=['datetime']) # split into train and test train, test = split_dataset(dataset.values) # define the names and functions for the models we wish to evaluate models = dict() models['arima'] = arima_forecast # evaluate each model days = ['sun', 'mon', 'tue', 'wed', 'thr', 'fri', 'sat'] for name, func in models.items(): # evaluate and get scores score, scores = evaluate_model(func, train, test) # summarize scores summarize_scores(name, score, scores) # plot scores pyplot.plot(days, scores, marker='o', label=name) # show plot pyplot.legend() pyplot.show()

Running the example first prints the performance of the AR(7) model on the test dataset.

We can see that the model achieves the overall RMSE of about 381 kilowatts.

This model has skill when compared to naive forecast models, such as a model that forecasts the week ahead using observations from the same time one year ago that achieved an overall RMSE of about 465 kilowatts.

arima: [381.636] 393.8, 398.9, 357.0, 377.2, 393.9, 306.1, 432.2

A line plot of the forecast is also created, showing the RMSE in kilowatts for each of the seven lead times of the forecast.

We can see an interesting pattern.

We might expect that earlier lead times are easier to forecast than later lead times, as the error at each successive lead time compounds.

Instead, we see that Friday (lead time +6) is the easiest to forecast and Saturday (lead time +7) is the most challenging to forecast. We can also see that the remaining lead times all have a similar error in the mid- to high-300 kilowatt range.

This section lists some ideas for extending the tutorial that you may wish to explore.

**Tune ARIMA**. The parameters of the ARIMA model were not tuned. Explore or search a suite of ARIMA parameters (q, d, p) to see if performance can be further improved.**Explore Seasonal AR**. Explore whether the performance of the AR model can be improved by including seasonal autoregression elements. This may require the use of a SARIMA model.**Explore Data Preparation**. The model was fit on the raw data directly. Explore whether standardization or normalization or even power transforms can further improve the skill of the AR model.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- pandas.read_csv API
- pandas.DataFrame.resample API
- Resample Offset Aliases
- sklearn.metrics.mean_squared_error API
- numpy.split API
- statsmodels.graphics.tsaplots.plot_acf API
- statsmodels.graphics.tsaplots.plot_pacf API
- statsmodels.tsa.arima_model.ARIMA API

- Individual household electric power consumption Data Set, UCI Machine Learning Repository.
- AC power, Wikipedia.
- Correlogram, Wikipedia.

In this tutorial, you discovered how to develop and evaluate an autoregression model for multi-step forecasting household power consumption.

Specifically, you learned:

- How to create and analyze autocorrelation and partial autocorrelation plots for univariate time series data.
- How to use the findings from autocorrelation plots to configure an autoregression model.
- How to develop and evaluate an autocorrelation model used to make one-week forecasts.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Develop an Autoregression Forecast Model for Household Electricity Consumption appeared first on Machine Learning Mastery.

]]>The post How to Develop and Evaluate Naive Methods for Forecasting Household Electricity Consumption appeared first on Machine Learning Mastery.

]]>In this tutorial, you will discover how to develop a test harness for the ‘household power consumption’ dataset and evaluate three naive forecast strategies that provide a baseline for more sophisticated algorithms.

After completing this tutorial, you will know:

- How to load, prepare, and downsample the household power consumption dataset ready for developing models.
- How to develop metrics, dataset split, and walk-forward validation elements for a robust test harness for evaluating forecasting models.
- How to develop and evaluate and compare the performance a suite of naive persistence forecasting methods.

Let’s get started.

This tutorial is divided into four parts; they are:

- Problem Description
- Load and Prepare Dataset
- Model Evaluation
- Naive Forecast Models

It is a multivariate series comprised of seven variables (besides the date and time); they are:

**global_active_power**: The total active power consumed by the household (kilowatts).**global_reactive_power**: The total reactive power consumed by the household (kilowatts).**voltage**: Average voltage (volts).**global_intensity**: Average current intensity (amps).**sub_metering_1**: Active energy for kitchen (watt-hours of active energy).**sub_metering_2**: Active energy for laundry (watt-hours of active energy).**sub_metering_3**: Active energy for climate control systems (watt-hours of active energy).

Active and reactive energy refer to the technical details of alternative current.

*household_power_consumption.txt*” that is about 127 megabytes in size and contains all of the observations.

*read_csv()* function to load the data and combine the first two columns into a single date-time column that we can use as an index.

*?*‘ character with a *NaN* value, which is a float.

We also need to fill in the missing values now that they have been marked.

*fill_missing()* that will take the NumPy array of the data and copy values from exactly 24 hours ago.

We can apply this function directly to the data within the DataFrame.

# fill missing fill_missing(dataset.values)

*household_power_consumption.csv*‘.

# save updated dataset dataset.to_csv('household_power_consumption.csv')

*household_power_consumption.csv*‘ that we can use as the starting point for our modeling project.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

This section is divided into four parts; they are:

- Problem Framing
- Evaluation Metric
- Train and Test Sets
- Walk-Forward Validation

There are many ways to harness and explore the household power consumption dataset.

In this tutorial, we will use the data to explore a very specific question; that is:

Given recent power consumption, what is the expected power consumption for the week ahead?

*D*‘ allows the loaded data indexed by date-time to be grouped by day (see all offset aliases). We can then calculate the sum of all observations for each day and create a new dataset of daily power consumption data for each of the eight variables.

The complete example is listed below.

*household_power_consumption_days.csv*‘.

A forecast will be comprised of seven values, one for each day of the week ahead.

- To comment on the skill at a specific lead time (e.g. +1 day vs +3 days).

The performance metric for this problem will be the RMSE for each lead time from day 1 to day 7.

One possible score that could be used would be the RMSE across all forecast days.

*evaluate_forecasts()* below will implement this behavior and return the performance of a model based on multiple seven-day forecasts.

We will split the data into standard weeks, working backwards from the test dataset.

The first and last rows of daily data for the test dataset are provided below for confirmation.

The daily data starts in late 2006.

The first Sunday in the dataset is December 17th, which is the second row of data.

*split_dataset()* below splits the daily data into train and test sets and organizes each into standard weeks.

The complete code example is listed below.

(159, 7, 8) 3390.46 1309.2679999999998 (46, 7, 8) 2083.4539999999984 2197.006000000004

Models will be evaluated using a scheme called walk-forward validation.

This is where a model is required to make a one week prediction, then the actual data for that week is made available to the model so that it can be used as the basis for making a prediction on the subsequent week. This is both realistic for how the model may be used in practice and beneficial to the models allowing them to make use of the best available data.

We can demonstrate this below with separation of input data and output/predicted data.

Input, Predict [Week1] Week2 [Week1 + Week2] Week3 [Week1 + Week2 + Week3] Week4 ...

The walk-forward validation approach to evaluating predictive models on this dataset is implement below, named *evaluate_model()*.

The name of a function is provided for the model as the argument “*model_func*“. This function is responsible for defining the model, fitting the model on the training data, and making a one-week forecast.

The forecasts made by the model are then evaluated against the test dataset using the previously defined *evaluate_forecasts()* function.

# evaluate a single model def evaluate_model(model_func, train, test): # history is a list of weekly data history = [x for x in train] # walk-forward validation over each week predictions = list() for i in range(len(test)): # predict the week yhat_sequence = model_func(history) # store the predictions predictions.append(yhat_sequence) # get real observation and add to history for predicting the next week history.append(test[i, :]) predictions = array(predictions) # evaluate predictions days for each week score, scores = evaluate_forecasts(test[:, :, 0], predictions) return score, scores

Once we have the evaluation for a model, we can summarize the performance.

The function below named *summarize_scores()* will display the performance of a model as a single line for easy comparison with other models.

We now have all of the elements to begin evaluating predictive models on the dataset.

It is important to test naive forecast models on any new prediction problem.

The results from naive models provide a quantitative idea of how difficult the forecast problem is and provide a baseline performance by which more sophisticated forecast methods can be evaluated.

In this section, we will develop and compare three naive forecast methods for the household power prediction problem; they are:

- Daily Persistence Forecast.
- Weekly Persistent Forecast.
- Weekly One-Year-Ago Persistent Forecast.

The first naive forecast that we will develop is a daily persistence model.

This model takes the active power from the last day prior to the forecast period (e.g. Saturday) and uses it as the value of the power for each day in the forecast period (Sunday to Saturday).

The *daily_persistence()* function below implements the daily persistence forecast strategy.

# daily persistence model def daily_persistence(history): # get the data for the prior week last_week = history[-1] # get the total active power for the last day value = last_week[-1, 0] # prepare 7 day forecast forecast = [value for _ in range(7)] return forecast

Another good naive forecast when forecasting a standard week is to use the entire prior week as the forecast for the week ahead.

It is based on the idea that next week will be very similar to this week.

The *weekly_persistence()* function below implements the weekly persistence forecast strategy.

# weekly persistence model def weekly_persistence(history): # get the data for the prior week last_week = history[-1] return last_week[:, 0]

Similar to the idea of using last week to forecast next week is the idea of using the same week last year to predict next week.

That is, use the week of observations from 52 weeks ago as the forecast, based on the idea that next week will be similar to the same week one year ago.

The *week_one_year_ago_persistence()* function below implements the week one year ago forecast strategy.

# week one year ago persistence model def week_one_year_ago_persistence(history): # get the data for the prior week last_week = history[-52] return last_week[:, 0]

We can compare each of the forecast strategies using the test harness developed in the previous section.

First, the dataset can be loaded and split into train and test sets.

# load the new file dataset = read_csv('household_power_consumption_days.csv', header=0, infer_datetime_format=True, parse_dates=['datetime'], index_col=['datetime']) # split into train and test train, test = split_dataset(dataset.values)

Each of the strategies can be stored in a dictionary against a unique name. This name can be used in printing and in creating a plot of the scores.

# define the names and functions for the models we wish to evaluate models = dict() models['daily'] = daily_persistence models['weekly'] = weekly_persistence models['week-oya'] = week_one_year_ago_persistence

We can then enumerate each of the strategies, evaluating it using walk-forward validation, printing the scores, and adding the scores to a line plot for visual comparison.

# evaluate each model days = ['sun', 'mon', 'tue', 'wed', 'thr', 'fri', 'sat'] for name, func in models.items(): # evaluate and get scores score, scores = evaluate_model(func, train, test) # summarize scores summarize_scores('daily persistence', score, scores) # plot scores pyplot.plot(days, scores, marker='o', label=name)

Tying all of this together, the complete example evaluating the three naive forecast strategies is listed below.

# naive forecast strategies from math import sqrt from numpy import split from numpy import array from pandas import read_csv from sklearn.metrics import mean_squared_error from matplotlib import pyplot # split a univariate dataset into train/test sets def split_dataset(data): # split into standard weeks train, test = data[1:-328], data[-328:-6] # restructure into windows of weekly data train = array(split(train, len(train)/7)) test = array(split(test, len(test)/7)) return train, test # evaluate one or more weekly forecasts against expected values def evaluate_forecasts(actual, predicted): scores = list() # calculate an RMSE score for each day for i in range(actual.shape[1]): # calculate mse mse = mean_squared_error(actual[:, i], predicted[:, i]) # calculate rmse rmse = sqrt(mse) # store scores.append(rmse) # calculate overall RMSE s = 0 for row in range(actual.shape[0]): for col in range(actual.shape[1]): s += (actual[row, col] - predicted[row, col])**2 score = sqrt(s / (actual.shape[0] * actual.shape[1])) return score, scores # summarize scores def summarize_scores(name, score, scores): s_scores = ', '.join(['%.1f' % s for s in scores]) print('%s: [%.3f] %s' % (name, score, s_scores)) # evaluate a single model def evaluate_model(model_func, train, test): # history is a list of weekly data history = [x for x in train] # walk-forward validation over each week predictions = list() for i in range(len(test)): # predict the week yhat_sequence = model_func(history) # store the predictions predictions.append(yhat_sequence) # get real observation and add to history for predicting the next week history.append(test[i, :]) predictions = array(predictions) # evaluate predictions days for each week score, scores = evaluate_forecasts(test[:, :, 0], predictions) return score, scores # daily persistence model def daily_persistence(history): # get the data for the prior week last_week = history[-1] # get the total active power for the last day value = last_week[-1, 0] # prepare 7 day forecast forecast = [value for _ in range(7)] return forecast # weekly persistence model def weekly_persistence(history): # get the data for the prior week last_week = history[-1] return last_week[:, 0] # week one year ago persistence model def week_one_year_ago_persistence(history): # get the data for the prior week last_week = history[-52] return last_week[:, 0] # load the new file dataset = read_csv('household_power_consumption_days.csv', header=0, infer_datetime_format=True, parse_dates=['datetime'], index_col=['datetime']) # split into train and test train, test = split_dataset(dataset.values) # define the names and functions for the models we wish to evaluate models = dict() models['daily'] = daily_persistence models['weekly'] = weekly_persistence models['week-oya'] = week_one_year_ago_persistence # evaluate each model days = ['sun', 'mon', 'tue', 'wed', 'thr', 'fri', 'sat'] for name, func in models.items(): # evaluate and get scores score, scores = evaluate_model(func, train, test) # summarize scores summarize_scores(name, score, scores) # plot scores pyplot.plot(days, scores, marker='o', label=name) # show plot pyplot.legend() pyplot.show()

Running the example first prints the total and daily scores for each model.

We can see that the weekly strategy performs better than the daily strategy and that the week one year ago (*week-oya*) performs slightly better again.

We can see this in both the overall RMSE scores for each model and in the daily scores for each forecast day. One exception is the forecast error for the first day (Sunday) where it appears that the daily persistence model performs better than the two weekly strategies.

We can use the week-oya strategy with an overall RMSE of 465.294 kilowatts as the baseline in performance for more sophisticated models to be considered skillful on this specific framing of the problem.

daily: [511.886] 452.9, 596.4, 532.1, 490.5, 534.3, 481.5, 482.0 weekly: [469.389] 567.6, 500.3, 411.2, 466.1, 471.9, 358.3, 482.0 week-oya: [465.294] 550.0, 446.7, 398.6, 487.0, 459.3, 313.5, 555.1

A line plot of the daily forecast error is also created.

We can see the same observed pattern of the weekly strategies performing better than the daily strategy in general, except in the case of the first day.

It is surprising (to me) that the week one-year-ago performs better than using the prior week. I would have expected that the power consumption from last week to be more relevant.

Reviewing all strategies on the same plot suggests possible combinations of the strategies that may result in even better performance.

This section lists some ideas for extending the tutorial that you may wish to explore.

**Additional Naive Strategy**. Propose, develop, and evaluate one more naive strategy for forecasting the next week of power consumption.**Naive Ensemble Strategy**. Develop an ensemble strategy that combines the predictions from the three proposed naive forecast methods.**Optimized Direct Persistence Models**. Test and find the optimal relative prior day (e.g. -1 or -7) to use for each forecast day in a direct persistence model.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- pandas.read_csv API
- pandas.DataFrame.resample API
- Resample Offset Aliases
- sklearn.metrics.mean_squared_error API
- numpy.split API

- Individual household electric power consumption Data Set, UCI Machine Learning Repository.
- AC power, Wikipedia.

In this tutorial, you discovered how to develop a test harness for the household power consumption dataset and evaluate three naive forecast strategies that provide a baseline for more sophisticated algorithms.

Specifically, you learned:

- How to load, prepare, and downsample the household power consumption dataset ready for modeling.
- How to develop metrics, dataset split, and walk-forward validation elements for a robust test harness for evaluating forecasting models.
- How to develop and evaluate and compare the performance a suite of naive persistence forecasting methods.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Develop and Evaluate Naive Methods for Forecasting Household Electricity Consumption appeared first on Machine Learning Mastery.

]]>