A Standard Multivariate, Multi-Step, and Multi-Site Time Series Forecasting Problem

By Jason Brownlee on August 21, 2019 in Time Series 35

Real-world time series forecasting is challenging for a whole host of reasons not limited to problem features such as having multiple input variables, the requirement to predict multiple time steps, and the need to perform the same type of prediction for multiple physical sites.

In this post, you will discover a standardized yet complex time series forecasting problem that has these properties, but is small and sufficiently well understood that it can be used to explore and better understand methods for developing forecasting models on challenging datasets.

After reading this post, you will know:

The competition and motivation for addressing the air-quality dataset.
An overview of the defined prediction problem and the data challenges it covers.
A description of the free data files that you can download and start working with immediately.

Kick-start your project with my new book Time Series Forecasting With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

A Standard Multivariate, Multi-Step, and Multi-Site Time Series Forecasting Problem
Photo by someone, some rights reserved.

EMC Data Science Global Hackathon

The dataset was used as the center of a Kaggle competition.

Specifically, a 24-hour hackathon hosted by Data Science London and Data Science Global as part of a Big Data Week event, two organizations that don’t seem to exist now, 6 years later.

The competition involved a multi-thousand-dollar cash prize, and the dataset was provided by the Cook County, Illinois local government, suggesting all locations mentioned in the dataset are in that locality.

The motivation for the challenge is to develop a better model for predicting air quality, taken from the competition description:

The EPA’s Air Quality Index is used daily by people suffering from asthma and other respiratory diseases to avoid dangerous levels of outdoor air pollutants, which can trigger attacks. According to the World Health Organisation there are now estimated to be 235 million people suffering from asthma. Globally, it is now the most common chronic disease among children, with incidence in the US doubling since 1980.

The competition description suggests that winning models could be used as the basis for a new air-quality prediction system, although it is not clear if any models were ever transitioned for this purpose.

The competition was won by a Kaggle employee, Ben Hamner, who presumably did not collect the prize given the conflict of interest. Ben described his winning approach in the blog post titled “Chucking everything into a Random Forest: Ben Hamner on Winning The Air Quality Prediction Hackathon” and provided his code on GitHub.

There is also a good discussion of solutions and related code in this forum post titled “General approaches to partitioning the models?“.

Predictive Modeling Problem

The data describes a multi-step forecasting problem given a multivariate time series across multiple sites or physical locations.

Given multiple weather measurements over time, predict a sequence of air quality measurements at specific future time intervals across multiple physical locations.

It is a challenging time series forecasting problem that has a lot of the qualities of real-world forecasting:

Incomplete data. Not all weather and air quality measures are available for all locations.
Missing data. Not all available measures have a complete history.
Multivariate inputs: The model inputs for each forecast are comprised of multiple weather observations.
Multi-step outputs: The model outputs are a discontiguous sequence of forecasted air quality measures.
Multi-site outputs: The mode must output a multi-step forecast for multiple physical sites.

Download the Dataset Files

The dataset is available for free from the Kaggle website.

You must create an account and sign-in with Kaggle before you can get access to download the dataset.

The dataset can be downloaded from here:

Competition Data

Description of the Dataset Files

There are 4 files of interest that you must download separately; they are:

File: SiteLocations.csv

This file contains a list of site locations marked by unique identifiers and their precise location on Earth measured by longitude and latitude.

All coordinates appear to be relatively close in the North-Western Hemisphere, e.g. America.

Below is a sample of the file.

"SITE_ID","LATITUDE","LONGITUDE"
1,41.6709918952829,-87.7324568962847
32,41.755832412403,-87.545349670582
50,41.7075695897648,-87.5685738570845
57,41.9128621248178,-87.7227234452095
64,41.7907868783739,-87.6016464917605
...

"SITE_ID","LATITUDE","LONGITUDE"

1,41.6709918952829,-87.7324568962847

32,41.755832412403,-87.545349670582

50,41.7075695897648,-87.5685738570845

57,41.9128621248178,-87.7227234452095

64,41.7907868783739,-87.6016464917605

...

File: SiteLocations_with_more_sites.csv

This file has the same format as SiteLocations.csv and appears to list all of the same locations as that file with some additional locations.

As the filename suggests, it is just an updated version of the list of sites.

Below is a sample of the file.

"SITE_ID","LATITUDE","LONGITUDE"
1,41.6709918952829,-87.7324568962847
14,41.834243,-87.6238
22,41.6871654376343,-87.5393154841479
32,41.755832412403,-87.545349670582
50,41.7075695897648,-87.5685738570845
...

"SITE_ID","LATITUDE","LONGITUDE"

1,41.6709918952829,-87.7324568962847

14,41.834243,-87.6238

22,41.6871654376343,-87.5393154841479

32,41.755832412403,-87.545349670582

50,41.7075695897648,-87.5685738570845

...

File: TrainingData.csv

This file contains the training data for modeling.

The data is presented in an unnormalized manner. Each row of data contains one set of meteorological measurements for one hour across multiple locations as well as the targets or outcomes for each location for that hour.

The measures include:

Time information, including the block of time, the index within the contiguous block of time, the average month, day of the week, and hour of the day.
Wind measurements such as direction and speed.
Temperature measurements such as minimum and maximum ambient temperature.
Pressure measurements such as minimum and maximum barometric pressure.

The target variables are a collection of different air quality or pollution measures at different physical locations.

Not all locations have all weather measurements and not all locations are concerned with all target measures. Further, for those recorded variables, there are missing values marked as NA.

Below is a sample of the file.

"rowID","chunkID","position_within_chunk","month_most_common","weekday","hour","Solar.radiation_64","WindDirection..Resultant_1","WindDirection..Resultant_1018","WindSpeed..Resultant_1","WindSpeed..Resultant_1018","Ambient.Max.Temperature_14","Ambient.Max.Temperature_22","Ambient.Max.Temperature_50","Ambient.Max.Temperature_52","Ambient.Max.Temperature_57","Ambient.Max.Temperature_76","Ambient.Max.Temperature_2001","Ambient.Max.Temperature_3301","Ambient.Max.Temperature_6005","Ambient.Min.Temperature_14","Ambient.Min.Temperature_22","Ambient.Min.Temperature_50","Ambient.Min.Temperature_52","Ambient.Min.Temperature_57","Ambient.Min.Temperature_76","Ambient.Min.Temperature_2001","Ambient.Min.Temperature_3301","Ambient.Min.Temperature_6005","Sample.Baro.Pressure_14","Sample.Baro.Pressure_22","Sample.Baro.Pressure_50","Sample.Baro.Pressure_52","Sample.Baro.Pressure_57","Sample.Baro.Pressure_76","Sample.Baro.Pressure_2001","Sample.Baro.Pressure_3301","Sample.Baro.Pressure_6005","Sample.Max.Baro.Pressure_14","Sample.Max.Baro.Pressure_22","Sample.Max.Baro.Pressure_50","Sample.Max.Baro.Pressure_52","Sample.Max.Baro.Pressure_57","Sample.Max.Baro.Pressure_76","Sample.Max.Baro.Pressure_2001","Sample.Max.Baro.Pressure_3301","Sample.Max.Baro.Pressure_6005","Sample.Min.Baro.Pressure_14","Sample.Min.Baro.Pressure_22","Sample.Min.Baro.Pressure_50","Sample.Min.Baro.Pressure_52","Sample.Min.Baro.Pressure_57","Sample.Min.Baro.Pressure_76","Sample.Min.Baro.Pressure_2001","Sample.Min.Baro.Pressure_3301","Sample.Min.Baro.Pressure_6005","target_1_57","target_10_4002","target_10_8003","target_11_1","target_11_32","target_11_50","target_11_64","target_11_1003","target_11_1601","target_11_4002","target_11_8003","target_14_4002","target_14_8003","target_15_57","target_2_57","target_3_1","target_3_50","target_3_57","target_3_1601","target_3_4002","target_3_6006","target_4_1","target_4_50","target_4_57","target_4_1018","target_4_1601","target_4_2001","target_4_4002","target_4_4101","target_4_6006","target_4_8003","target_5_6006","target_7_57","target_8_57","target_8_4002","target_8_6004","target_8_8003","target_9_4002","target_9_8003"
1,1,1,10,"Saturday",21,0.01,117,187,0.3,0.3,NA,NA,NA,14.9,NA,NA,NA,NA,NA,NA,NA,NA,5.8,NA,NA,NA,NA,NA,NA,NA,NA,747,NA,NA,NA,NA,NA,NA,NA,NA,750,NA,NA,NA,NA,NA,NA,NA,NA,743,NA,NA,NA,NA,NA,2.67923294292042,6.1816228132982,NA,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,NA,2.38965627997991,NA,5.56815355612325,0.690015329704154,NA,NA,NA,NA,NA,NA,2.84349016287551,0.0920223353681394,1.69321097077376,0.368089341472558,0.184044670736279,0.368089341472558,0.276067006104418,0.892616653070952,1.74842437199465,NA,NA,5.1306307034019,1.34160578423204,2.13879182993514,3.01375212399952,NA,5.67928016629218,NA
2,1,2,10,"Saturday",22,0.01,231,202,0.5,0.6,NA,NA,NA,14.9,NA,NA,NA,NA,NA,NA,NA,NA,5.8,NA,NA,NA,NA,NA,NA,NA,NA,747,NA,NA,NA,NA,NA,NA,NA,NA,750,NA,NA,NA,NA,NA,NA,NA,NA,743,NA,NA,NA,NA,NA,2.67923294292042,8.47583334194495,NA,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,NA,1.99138023331659,NA,5.56815355612325,0.923259948195698,NA,NA,NA,NA,NA,NA,3.1011527019063,0.0920223353681394,1.94167127626774,0.368089341472558,0.184044670736279,0.368089341472558,0.368089341472558,1.73922213845783,2.14412041407765,NA,NA,5.1306307034019,1.19577906855465,2.72209869264472,3.88871241806389,NA,7.42675098668978,NA
3,1,3,10,"Saturday",23,0.01,247,227,0.5,1.5,NA,NA,NA,14.9,NA,NA,NA,NA,NA,NA,NA,NA,5.8,NA,NA,NA,NA,NA,NA,NA,NA,747,NA,NA,NA,NA,NA,NA,NA,NA,750,NA,NA,NA,NA,NA,NA,NA,NA,743,NA,NA,NA,NA,NA,2.67923294292042,8.92192983362627,NA,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,NA,1.7524146053186,NA,5.56815355612325,0.680296803933673,NA,NA,NA,NA,NA,NA,3.06434376775904,0.0920223353681394,2.52141198908702,0.460111676840697,0.184044670736279,0.368089341472558,0.368089341472558,1.7852333061419,1.93246904273093,NA,NA,5.13639545700122,1.40965825154816,3.11096993445111,3.88871241806389,NA,7.68373198968942,NA
4,1,4,10,"Sunday",0,0.01,219,218,0.2,1.2,NA,NA,NA,14,NA,NA,NA,NA,NA,NA,NA,NA,4.8,NA,NA,NA,NA,NA,NA,NA,NA,751,NA,NA,NA,NA,NA,NA,NA,NA,754,NA,NA,NA,NA,NA,NA,NA,NA,748,NA,NA,NA,NA,NA,2.67923294292042,5.09824561921501,NA,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,NA,2.38965627997991,NA,5.6776192223642,0.612267123540305,NA,NA,NA,NA,NA,NA,3.21157950434806,0.184044670736279,2.374176252498,0.460111676840697,0.184044670736279,0.368089341472558,0.276067006104418,1.86805340797323,2.08890701285676,NA,NA,5.21710200739181,1.47771071886428,2.04157401948354,3.20818774490271,NA,4.83124285639335,NA
5,1,5,10,"Sunday",1,0.01,2,216,0.2,0.3,NA,NA,NA,14,NA,NA,NA,NA,NA,NA,NA,NA,4.8,NA,NA,NA,NA,NA,NA,NA,NA,751,NA,NA,NA,NA,NA,NA,NA,NA,754,NA,NA,NA,NA,NA,NA,NA,NA,748,NA,NA,NA,NA,NA,2.67923294292042,4.87519737337435,NA,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,NA,2.31000107064725,NA,5.6776192223642,0.694874592589394,NA,NA,NA,NA,NA,NA,3.67169118118876,0.184044670736279,2.46619858786614,0.460111676840697,0.184044670736279,0.368089341472558,0.276067006104418,1.70241320431058,2.60423209091834,NA,NA,5.21710200739181,1.45826715677396,2.13879182993514,3.4998411762575,NA,4.62565805399363,NA
...

"rowID","chunkID","position_within_chunk","month_most_common","weekday","hour","Solar.radiation_64","WindDirection..Resultant_1","WindDirection..Resultant_1018","WindSpeed..Resultant_1","WindSpeed..Resultant_1018","Ambient.Max.Temperature_14","Ambient.Max.Temperature_22","Ambient.Max.Temperature_50","Ambient.Max.Temperature_52","Ambient.Max.Temperature_57","Ambient.Max.Temperature_76","Ambient.Max.Temperature_2001","Ambient.Max.Temperature_3301","Ambient.Max.Temperature_6005","Ambient.Min.Temperature_14","Ambient.Min.Temperature_22","Ambient.Min.Temperature_50","Ambient.Min.Temperature_52","Ambient.Min.Temperature_57","Ambient.Min.Temperature_76","Ambient.Min.Temperature_2001","Ambient.Min.Temperature_3301","Ambient.Min.Temperature_6005","Sample.Baro.Pressure_14","Sample.Baro.Pressure_22","Sample.Baro.Pressure_50","Sample.Baro.Pressure_52","Sample.Baro.Pressure_57","Sample.Baro.Pressure_76","Sample.Baro.Pressure_2001","Sample.Baro.Pressure_3301","Sample.Baro.Pressure_6005","Sample.Max.Baro.Pressure_14","Sample.Max.Baro.Pressure_22","Sample.Max.Baro.Pressure_50","Sample.Max.Baro.Pressure_52","Sample.Max.Baro.Pressure_57","Sample.Max.Baro.Pressure_76","Sample.Max.Baro.Pressure_2001","Sample.Max.Baro.Pressure_3301","Sample.Max.Baro.Pressure_6005","Sample.Min.Baro.Pressure_14","Sample.Min.Baro.Pressure_22","Sample.Min.Baro.Pressure_50","Sample.Min.Baro.Pressure_52","Sample.Min.Baro.Pressure_57","Sample.Min.Baro.Pressure_76","Sample.Min.Baro.Pressure_2001","Sample.Min.Baro.Pressure_3301","Sample.Min.Baro.Pressure_6005","target_1_57","target_10_4002","target_10_8003","target_11_1","target_11_32","target_11_50","target_11_64","target_11_1003","target_11_1601","target_11_4002","target_11_8003","target_14_4002","target_14_8003","target_15_57","target_2_57","target_3_1","target_3_50","target_3_57","target_3_1601","target_3_4002","target_3_6006","target_4_1","target_4_50","target_4_57","target_4_1018","target_4_1601","target_4_2001","target_4_4002","target_4_4101","target_4_6006","target_4_8003","target_5_6006","target_7_57","target_8_57","target_8_4002","target_8_6004","target_8_8003","target_9_4002","target_9_8003"

1,1,1,10,"Saturday",21,0.01,117,187,0.3,0.3,NA,NA,NA,14.9,NA,NA,NA,NA,NA,NA,NA,NA,5.8,NA,NA,NA,NA,NA,NA,NA,NA,747,NA,NA,NA,NA,NA,NA,NA,NA,750,NA,NA,NA,NA,NA,NA,NA,NA,743,NA,NA,NA,NA,NA,2.67923294292042,6.1816228132982,NA,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,NA,2.38965627997991,NA,5.56815355612325,0.690015329704154,NA,NA,NA,NA,NA,NA,2.84349016287551,0.0920223353681394,1.69321097077376,0.368089341472558,0.184044670736279,0.368089341472558,0.276067006104418,0.892616653070952,1.74842437199465,NA,NA,5.1306307034019,1.34160578423204,2.13879182993514,3.01375212399952,NA,5.67928016629218,NA

2,1,2,10,"Saturday",22,0.01,231,202,0.5,0.6,NA,NA,NA,14.9,NA,NA,NA,NA,NA,NA,NA,NA,5.8,NA,NA,NA,NA,NA,NA,NA,NA,747,NA,NA,NA,NA,NA,NA,NA,NA,750,NA,NA,NA,NA,NA,NA,NA,NA,743,NA,NA,NA,NA,NA,2.67923294292042,8.47583334194495,NA,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,NA,1.99138023331659,NA,5.56815355612325,0.923259948195698,NA,NA,NA,NA,NA,NA,3.1011527019063,0.0920223353681394,1.94167127626774,0.368089341472558,0.184044670736279,0.368089341472558,0.368089341472558,1.73922213845783,2.14412041407765,NA,NA,5.1306307034019,1.19577906855465,2.72209869264472,3.88871241806389,NA,7.42675098668978,NA

3,1,3,10,"Saturday",23,0.01,247,227,0.5,1.5,NA,NA,NA,14.9,NA,NA,NA,NA,NA,NA,NA,NA,5.8,NA,NA,NA,NA,NA,NA,NA,NA,747,NA,NA,NA,NA,NA,NA,NA,NA,750,NA,NA,NA,NA,NA,NA,NA,NA,743,NA,NA,NA,NA,NA,2.67923294292042,8.92192983362627,NA,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,NA,1.7524146053186,NA,5.56815355612325,0.680296803933673,NA,NA,NA,NA,NA,NA,3.06434376775904,0.0920223353681394,2.52141198908702,0.460111676840697,0.184044670736279,0.368089341472558,0.368089341472558,1.7852333061419,1.93246904273093,NA,NA,5.13639545700122,1.40965825154816,3.11096993445111,3.88871241806389,NA,7.68373198968942,NA

4,1,4,10,"Sunday",0,0.01,219,218,0.2,1.2,NA,NA,NA,14,NA,NA,NA,NA,NA,NA,NA,NA,4.8,NA,NA,NA,NA,NA,NA,NA,NA,751,NA,NA,NA,NA,NA,NA,NA,NA,754,NA,NA,NA,NA,NA,NA,NA,NA,748,NA,NA,NA,NA,NA,2.67923294292042,5.09824561921501,NA,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,NA,2.38965627997991,NA,5.6776192223642,0.612267123540305,NA,NA,NA,NA,NA,NA,3.21157950434806,0.184044670736279,2.374176252498,0.460111676840697,0.184044670736279,0.368089341472558,0.276067006104418,1.86805340797323,2.08890701285676,NA,NA,5.21710200739181,1.47771071886428,2.04157401948354,3.20818774490271,NA,4.83124285639335,NA

5,1,5,10,"Sunday",1,0.01,2,216,0.2,0.3,NA,NA,NA,14,NA,NA,NA,NA,NA,NA,NA,NA,4.8,NA,NA,NA,NA,NA,NA,NA,NA,751,NA,NA,NA,NA,NA,NA,NA,NA,754,NA,NA,NA,NA,NA,NA,NA,NA,748,NA,NA,NA,NA,NA,2.67923294292042,4.87519737337435,NA,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,NA,2.31000107064725,NA,5.6776192223642,0.694874592589394,NA,NA,NA,NA,NA,NA,3.67169118118876,0.184044670736279,2.46619858786614,0.460111676840697,0.184044670736279,0.368089341472558,0.276067006104418,1.70241320431058,2.60423209091834,NA,NA,5.21710200739181,1.45826715677396,2.13879182993514,3.4998411762575,NA,4.62565805399363,NA

...

File: SubmissionZerosExceptNAs.csv

This file contains a sample of the submission for the prediction problem.

Each row specifies the prediction for each target measure across all target locations for a given hour in a chunk of contiguous time.

Below is a sample of the file.

"rowID","chunkID","position_within_chunk","hour","month_most_common","target_1_57","target_10_4002","target_10_8003","target_11_1","target_11_32","target_11_50","target_11_64","target_11_1003","target_11_1601","target_11_4002","target_11_8003","target_14_4002","target_14_8003","target_15_57","target_2_57","target_3_1","target_3_50","target_3_57","target_3_1601","target_3_4002","target_3_6006","target_4_1","target_4_50","target_4_57","target_4_1018","target_4_1601","target_4_2001","target_4_4002","target_4_4101","target_4_6006","target_4_8003","target_5_6006","target_7_57","target_8_57","target_8_4002","target_8_6004","target_8_8003","target_9_4002","target_9_8003"
193,1,193,21,10,0,0,-1e+06,0,0,0,0,0,0,0,-1e+06,0,-1e+06,0,0,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,0,0,0,0,0,0,0,0,0,-1e+06,-1e+06,0,0,0,0,-1e+06,0,-1e+06
194,1,194,22,10,0,0,-1e+06,0,0,0,0,0,0,0,-1e+06,0,-1e+06,0,0,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,0,0,0,0,0,0,0,0,0,-1e+06,-1e+06,0,0,0,0,-1e+06,0,-1e+06
195,1,195,23,10,0,0,-1e+06,0,0,0,0,0,0,0,-1e+06,0,-1e+06,0,0,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,0,0,0,0,0,0,0,0,0,-1e+06,-1e+06,0,0,0,0,-1e+06,0,-1e+06
196,1,196,0,10,0,0,-1e+06,0,0,0,0,0,0,0,-1e+06,0,-1e+06,0,0,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,0,0,0,0,0,0,0,0,0,-1e+06,-1e+06,0,0,0,0,-1e+06,0,-1e+06
197,1,197,1,10,0,0,-1e+06,0,0,0,0,0,0,0,-1e+06,0,-1e+06,0,0,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,0,0,0,0,0,0,0,0,0,-1e+06,-1e+06,0,0,0,0,-1e+06,0,-1e+06
...

"rowID","chunkID","position_within_chunk","hour","month_most_common","target_1_57","target_10_4002","target_10_8003","target_11_1","target_11_32","target_11_50","target_11_64","target_11_1003","target_11_1601","target_11_4002","target_11_8003","target_14_4002","target_14_8003","target_15_57","target_2_57","target_3_1","target_3_50","target_3_57","target_3_1601","target_3_4002","target_3_6006","target_4_1","target_4_50","target_4_57","target_4_1018","target_4_1601","target_4_2001","target_4_4002","target_4_4101","target_4_6006","target_4_8003","target_5_6006","target_7_57","target_8_57","target_8_4002","target_8_6004","target_8_8003","target_9_4002","target_9_8003"

193,1,193,21,10,0,0,-1e+06,0,0,0,0,0,0,0,-1e+06,0,-1e+06,0,0,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,0,0,0,0,0,0,0,0,0,-1e+06,-1e+06,0,0,0,0,-1e+06,0,-1e+06

194,1,194,22,10,0,0,-1e+06,0,0,0,0,0,0,0,-1e+06,0,-1e+06,0,0,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,0,0,0,0,0,0,0,0,0,-1e+06,-1e+06,0,0,0,0,-1e+06,0,-1e+06

195,1,195,23,10,0,0,-1e+06,0,0,0,0,0,0,0,-1e+06,0,-1e+06,0,0,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,0,0,0,0,0,0,0,0,0,-1e+06,-1e+06,0,0,0,0,-1e+06,0,-1e+06

196,1,196,0,10,0,0,-1e+06,0,0,0,0,0,0,0,-1e+06,0,-1e+06,0,0,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,0,0,0,0,0,0,0,0,0,-1e+06,-1e+06,0,0,0,0,-1e+06,0,-1e+06

197,1,197,1,10,0,0,-1e+06,0,0,0,0,0,0,0,-1e+06,0,-1e+06,0,0,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,0,0,0,0,0,0,0,0,0,-1e+06,-1e+06,0,0,0,0,-1e+06,0,-1e+06

...

Framing the Prediction Problem

A large part of the challenge of this prediction problem is the vast number of ways that the problem can be framed for modeling.

This is challenging because it is not clear which framing may be the best for this specific modeling problem.

For example, below are some questions to provoke thought about how the problem could be framed.

Is it better to impute or ignore missing observations?
Is it better to feed in a time series of weather observations or only the observations for the current hour?
Is it better to use weather observations from one or multiple source locations for a forecast?
Is it better to have one model for each location or one mode for all locations?
Is it better to have one model for each forecast time or one for all forecast times?

Summary

In this post, you discovered the Kaggle air-quality dataset that provides a standard dataset for complex time series forecasting.

Specifically, you learned:

The competition and motivation for addressing the air-quality dataset.
An overview of the defined prediction problem and the data challenges it covers.
A description of the free data files that can download and start working with immediately.

Have you worked on this dataset, or do you intend to?
Share your experiences in the comments below.

35 Responses to A Standard Multivariate, Multi-Step, and Multi-Site Time Series Forecasting Problem

Paris Tzou January 19, 2018 at 9:13 am #

In this book, “Multivariate” is not dealt with.

Reply
- Jason Brownlee January 20, 2018 at 8:12 am #
  
  How so?
  
  Multiple inputs or multiple input series is multivariate.
  
  Reply
  - Tom April 5, 2018 at 1:35 am #
    
    I believe he is referring to your book ad at the end of the post.
    
    Reply
    - Jason Brownlee April 5, 2018 at 6:13 am #
      
      Yes. I hope to write a book dedicated to the topic next.
      
      Reply
Hitendra Khairnar January 19, 2018 at 5:44 pm #

How road traffic path recommendation problem can be framed as Multivariate, Multi-Step, and Multi-Site Time Series Forecasting Problem? I m thinking about different input parameters as follows 1. Distance between source and destination 2. Speed and type of Vehicle 3. Traffic volume and desnsity 4. Environmental conditions 5. Infrastructure conditions etc. and looking for suitable path for travelling from source to destination with minimum distance, feasible speed etc. Are there any data sets related to traffic data are available? Please suggest. Thanks

Reply
- Jason Brownlee January 20, 2018 at 8:17 am #
  
  I don’t know of any sorry.
  
  Reply
- Sidhu February 27, 2020 at 5:18 pm #
  
  Hi Jason , thanks for your articles on sequence prediction using LSTM. I have a problem where multiple users are measured on 6 different metrics over time.
  I am thinking it as a ” Multiple Time Series ” problem.
  
  Specifically, here’s how the problem is.
  
  User Id Energy sleep quality nutritional quality …
  1 4 3 4
  1 5 3 2
  2 1 5 3
  2 4 3 2
  
  So , I need to consider the above sequence for each user Id and predict the next sequence.
  I want to know how can I proceed with a problem like this.
  Please help me with this.
  
  Reply
  - Jason Brownlee February 28, 2020 at 5:59 am #
    
    This will help you figure out the type of problem:
    https://machinelearningmastery.com/taxonomy-of-time-series-forecasting-problems/
    
    Reply
Tobi Adeyemi January 23, 2018 at 8:23 pm #

Hi Jason, it will be interesting to see how you will approach this problem. Do you think a reliable predictive model can be built from a dataset with this much missing variables? I look forward to your feedback.

Reply
- Jason Brownlee January 24, 2018 at 9:54 am #
  
  Perhaps, it really depends on the specifics of the data.
  
  Reply
Alex Ludert January 28, 2018 at 5:04 am #

Hi Jason!

Great post! I’m trying to immitate this analysis in python and am having trouble dealing with the missing values in the random forest regression. From what I’m reading, the random forest can deal with the NaNs for the tree splitting but not for the regression.

Looking at Ben’s matlab code it seems he’s “imputing” the missing values to 0. Am I correct in this? I’m not sure imputing the missing values to 0 for a regression is best. Any thoughts?

Thank you in advance.

Reply
- Jason Brownlee January 28, 2018 at 8:27 am #
  
  It really depends on the specific implementation.
  
  Perhaps try the xgboost library? or the sklearn implementation?
  
  Reply
karamba February 14, 2018 at 11:15 am #

Hi Jason,

Recently I have read Your article on this blog. Since that time I have read a lot of Your articles published here. Your knowledge and expierience is impressive.

I’m trying to solve a multivariete (multiple input data) multi-step (multiple output) forecasting problem and I don’t know, which method should I use.

I have time series data that represent weather(temperature, pressure, humidity, windspeed, wind direction etc.); pollution(pm10,pm2.5, NO2 etc.); traffic(average speed, street traffic congestion factor). I have a data from about ~100 localization inside middle-size city (327 km^2). Measurements are peformed each hour.

I’m trying to implement a software which enable to perform 24h prediction of pollution for speciffic pollutant (for example pm10) using current data and data from history. Is there any preffered method to peform these type of analysis? Initially, I took into account two types of methods

a) Multivariate Time Series Forecasting with LSTMs
b) Method using RandomForest algorithms described in this article

Maybe there are other methods I have not heard about. If You had to solve this type of problem, what methods would You use?

Thanks for the work You do on this blog 🙂

Reply
- Jason Brownlee February 14, 2018 at 2:43 pm #
  
  I would recommend a deep MLP on this type of problem.
  
  Reply
  - karamba February 15, 2018 at 2:50 am #
    
    Thanks a lot.
    
    I have one more question. Is there any article on Your blog about making multivariete/multi step timeseries forecasting using MLP?
    
    I was trying to find some materials on the Internet about this method, but i canno’t find anything interesting. I have seen some scientifc papers, but none of them discribed exact the same problem.
    
    Reply
    - Jason Brownlee February 15, 2018 at 8:47 am #
      
      I hope to write a book on this topic next.
      
      Reply
      - karamba February 15, 2018 at 6:10 pm #
        
        It would be a very good idea. I would like to read such a book.
        
        Regardless, I will try to solve this problem myself. I just thought you had materials that might be useful to me, but now I understand why You have not published them yet.
        
        I am grateful for the information provided so far. Good luck 🙂
      - Soren P March 14, 2018 at 2:19 am #
        
        Hi Jason, what is your ETA for this book covering deep deep MLP for multivariate time series forecasting?
        
        I bet fixed windowed data i key, what your take on that?
        
        I know its VERT implementation dependent, but in these cases just how deep is deep in your view?
      - Jason Brownlee March 14, 2018 at 6:32 am #
        
        No fixed ETA at this stage. In progress.
LuoXin April 15, 2018 at 11:43 am #

Hi Jason:
Thank you very much for your share in the LSTM timeseries prediction,it help me a lot in learning LSTM.But I still have some confusion about several time steps prediction by LSTM.
The problem is that I have 12 years one dim timeseries data and want to use 84 hours data to predict the next 12 hours data by BP network and LSTM.I find the BP have the better performance,but I think the LSTM will do better,perhaps I have something wrong,could you tell me how to build a multi-steps prediction LSTM network?
I am looking forward to hearing from your!

Reply
- Jason Brownlee April 16, 2018 at 6:04 am #
  
  I answer the question of how to prepare time series data for LSTMs here:
  https://machinelearningmastery.com/faq/single-faq/how-do-i-prepare-my-data-for-an-lstm
  
  I answer the question of why LSTMs are poor at time series forecasting here:
  https://machinelearningmastery.com/faq/single-faq/how-do-i-use-lstms-for-time-series-forecasting
  
  Reply
song June 4, 2018 at 6:03 pm #

Hi Jason,
Thank you very much for your share in the LSTM timeseries prediction,which help me a lot in learning time series prediction. However, I still have some problems with time series prediction.They are listed as follows:
1) I want to know the difference between sequence prediction and time series prediction . Whether can we use the model that deal with sequence prediction to make time series prediction or not?
2）As I saw your past blogs, you have apply lstm model to Shampoo-Sales dataset which just contain 36 records. As we know, deep learning model are often used on big dataset. So, your experience may get poor performance. I have a dataset of 1000 records. In this case, I can use lstm to build model？
3）Can you privide you some useful examples of variable time steps of lstm？

I hope to receive your reply.

Reply
- Jason Brownlee June 5, 2018 at 6:36 am #
  
  Good questions.
  
  Time series is a type of sequence prediction. Sequence prediction models can be used for time series.
  
  LSTMs may be good for your problem, but I would encourage you to test a suite of other methods as well to confirm that the LSTMs are skillful.
  
  When working with LSTMs with a variable number of input time steps, you must pad the inputs and use a masking layer to ignore the padding. I have examples on the blog.
  
  Does that help?
  
  Reply
Dwayne June 26, 2018 at 9:57 am #

Hey Jason!
Could you elabaorate more on using random forest to do a multi step time series forecast in Python? One of your further reading topics covers this on matlab. However, I don’t have a lot of experience in matlab and I am finding it difficult to understand.

Reply
- Jason Brownlee June 26, 2018 at 2:27 pm #
  
  You must split the data into windows with input vectors of past obs mapped to the next observation.
  
  This post will help to prepare the data:
  https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/
  
  Does that help?
  
  Reply
César Bouyssi February 15, 2019 at 2:59 am #

Hi Jason,
I’ve been looking at your blog all day and found plenty of very interesting articles.
I’m currently working on the following problem:

Forecasting the power production (in different fields – solar, wind, etc), with historical production data. I was thinking to improve the predictions by using the dates as an input too (month can be a good information, for solar exposition, and for the wind). I was also thinking of using weather dataset as an additionnal input.

Do you thing a multivariate, multistep time serie could be a good approach (maybe with a LSTM network) ?

Reply
- Jason Brownlee February 15, 2019 at 8:15 am #
  
  I recommend this process in order to be systematic:
  https://machinelearningmastery.com/how-to-develop-a-skilful-time-series-forecasting-model/
  
  Reply
Laura March 20, 2019 at 8:13 pm #

Hi Jason,

I have a more general question. In one of your tutorials you said that one should make the time series stationary. If I have a multivariate and test each variable for stationarity and get the result that some variables should be differenced and some are already stationary, what would you do? Probably it is not reasonable to difference only some of the variables..?

Reply
- Jason Brownlee March 21, 2019 at 8:05 am #
  
  Start by modeling the data directly, then try making each/all stationary and see how it impacts the models ability to learn the problem.
  
  Reply
Liuzy December 23, 2019 at 1:15 pm #

Hello, I am a Chinese researcher and have a strong interest in applying machine learning algorithms in the field of air quality, but I cannot download the data you use. Can you send me a copy?
thank you

Reply
- Jason Brownlee December 24, 2019 at 6:37 am #
  
  You must sig-up to kaggle, then you can download the dataset.
  
  Sorry, I cannot send it to you.
  
  Reply
Rahul Joarder July 30, 2021 at 5:16 pm #

Hi Jason,

In general multi site Time series, can we forecast for the multiple sites using 1 singular model(I want to use LSTM), or would we needed diff models for diff sites(How practical would this approach be if there are 1000+ site like customer data)

Reply
- Jason Brownlee July 31, 2021 at 5:35 am #
  
  Good question, this will help:
  https://machinelearningmastery.com/faq/single-faq/how-to-develop-forecast-models-for-multiple-sites
  
  Reply
Rahul Joarder August 3, 2021 at 8:15 pm #

How can we train a single model(LSTM or SARIMAX) on a group of sites or all the sites . We would have diff values(for different sites) for the same feature at a particular time step? Do you have a article on this as this approach is not clear to me? Or could you just give give a brief about it, I would be able to take on from there!

Thanks anyways for for the quick reply! Love your articles!

Reply
- Jason Brownlee August 4, 2021 at 5:13 am #
  
  Yes, start here:
  https://machinelearningmastery.com/start-here/#deep_learning_time_series
  
  Reply

Navigation

A Standard Multivariate, Multi-Step, and Multi-Site Time Series Forecasting Problem

EMC Data Science Global Hackathon

Predictive Modeling Problem

Download the Dataset Files

Description of the Dataset Files

File: SiteLocations.csv

File: SiteLocations_with_more_sites.csv

File: TrainingData.csv

File: SubmissionZerosExceptNAs.csv

Framing the Prediction Problem

Further Reading

Summary

Want to Develop Time Series Forecasts with Python?

Develop Your Own Forecasts in Minutes

Finally Bring Time Series Forecasting to
Your Own Projects

More On This Topic

35 Responses to A Standard Multivariate, Multi-Step, and Multi-Site Time Series Forecasting Problem

Leave a Reply Click here to cancel reply.

Navigation

EMC Data Science Global Hackathon

Predictive Modeling Problem

Download the Dataset Files

Description of the Dataset Files

File: SiteLocations.csv

File: SiteLocations_with_more_sites.csv

File: TrainingData.csv

File: SubmissionZerosExceptNAs.csv

Framing the Prediction Problem

Further Reading

Summary

Want to Develop Time Series Forecasts with Python?

Develop Your Own Forecasts in Minutes

Finally Bring Time Series Forecasting to Your Own Projects

More On This Topic

35 Responses to A Standard Multivariate, Multi-Step, and Multi-Site Time Series Forecasting Problem

Leave a Reply Click here to cancel reply.

Finally Bring Time Series Forecasting to
Your Own Projects