Machine learning can be applied to time series datasets.
These are problems where a numeric or categorical value must be predicted, but the rows of data are ordered by time.
A problem when getting started in time series forecasting with machine learning is finding good quality standard datasets on which to practice.
In this post, you will discover 8 standard time series datasets that you can use to get started and practice time series forecasting with machine learning.
After reading this post, you will know:
- 4 univariate time series datasets.
- 3 multivariate time series datasets.
- Websites that you can use to search and download more datasets.
Kick-start your project with my new book Time Series Forecasting With Python, including step-by-step tutorials and the Python source code files for all examples.
Let’s get started.
- Updated Apr/2019: Updated the links to the datasets.
Univariate Time Series Datasets
Time series datasets that only have one variable are called univariate datasets.
These datasets are a great place to get started because:
- They are so simple and easy to understand.
- You can plot them easily in excel or your favorite plotting tool.
- You can easily plot the predictions compared to the expected results.
- You can quickly try and evaluate a suite of traditional and newer methods.
There are many sources of time series dataset, such as the “Time Series Data Library” created by Rob Hyndman, Professor of Statistics at Monash University, Australia
Below are 4 univariate time series datasets that you can download from a range of fields such as Sales, Meteorology, Physics and Demography.
Stop learning Time Series Forecasting the slow way!
Take my free 7-day email course and discover how to get started (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
Shampoo Sales Dataset
This dataset describes the monthly number of sales of shampoo over a 3 year period.
The units are a sales count and there are 36 observations. The original dataset is credited to Makridakis, Wheelwright and Hyndman (1998).
Below is a sample of the first 5 rows of data including the header row.
1 2 3 4 5 6 |
"Month","Sales of shampoo over a three year period" "1-01",266.0 "1-02",145.9 "1-03",183.1 "1-04",119.3 "1-05",180.3 |
Below is a plot of the entire dataset.
The dataset shows an increasing trend and possibly some seasonal component.
Minimum Daily Temperatures Dataset
This dataset describes the minimum daily temperatures over 10 years (1981-1990) in the city Melbourne, Australia.
The units are in degrees Celsius and there are 3650 observations. The source of the data is credited as the Australian Bureau of Meteorology.
Below is a sample of the first 5 rows of data including the header row.
1 2 3 4 5 6 |
"Date","Daily minimum temperatures in Melbourne, Australia, 1981-1990" "1981-01-01",20.7 "1981-01-02",17.9 "1981-01-03",18.8 "1981-01-04",14.6 "1981-01-05",15.8 |
Below is a plot of the entire dataset.
The dataset shows a strong seasonality component and has a nice fine grained detail to work with.
Monthly Sunspot Dataset
This dataset describes a monthly count of the number of observed sunspots for just over 230 years (1749-1983).
The units are a count and there are 2,820 observations. The source of the dataset is credited to Andrews & Herzberg (1985).
Below is a sample of the first 5 rows of data including the header row.
1 2 3 4 5 6 |
"Month","Zuerich monthly sunspot numbers 1749-1983" "1749-01",58.0 "1749-02",62.6 "1749-03",70.0 "1749-04",55.7 "1749-05",85.0 |
Below is a plot of the entire dataset.
The dataset shows seasonality with large differences between seasons.
Daily Female Births Dataset
This dataset describes the number of daily female births in California in 1959.
The units are a count and there are 365 observations. The source of the dataset is credited to Newton (1988).
Below is a sample of the first 5 rows of data including the header row.
1 2 3 4 5 6 |
"Date","Daily total female births in California, 1959" "1959-01-01",35 "1959-01-02",32 "1959-01-03",30 "1959-01-04",31 "1959-01-05",44 |
Below is a plot of the entire dataset.
Multivariate Time Series Datasets
Multivariate datasets are generally more challenging and are the sweet spot for machine learning methods.
A great source of multivariate time series data is the UCI Machine Learning Repository.
At the time of writing, there are 63 time series datasets that you can download for free and work with.
Below is a selection of 3 recommended multivariate time series datasets from Meteorology, Medicine and Monitoring domains.
EEG Eye State Dataset
This dataset describes EEG data for an individual and whether their eyes were open or closed. The objective of the problem is to predict whether eyes are open or closed given EEG data alone.
The objective of the problem is to predict whether eyes are open or closed given EEG data alone.
This is a classification predictive modeling problems and there are a total of 14,980 observations and 15 input variables. The class value of ‘1’ indicates the eye-closed and ‘0’ the eye-open state. Data is ordered by time and observations were recorded over a period of 117 seconds.
Data is ordered by time and observations were recorded over a period of 117 seconds.
Below is a sample of the first 5 rows with no header row.
1 2 3 4 5 |
4329.23,4009.23,4289.23,4148.21,4350.26,4586.15,4096.92,4641.03,4222.05,4238.46,4211.28,4280.51,4635.9,4393.85,0 4324.62,4004.62,4293.85,4148.72,4342.05,4586.67,4097.44,4638.97,4210.77,4226.67,4207.69,4279.49,4632.82,4384.1,0 4327.69,4006.67,4295.38,4156.41,4336.92,4583.59,4096.92,4630.26,4207.69,4222.05,4206.67,4282.05,4628.72,4389.23,0 4328.72,4011.79,4296.41,4155.9,4343.59,4582.56,4097.44,4630.77,4217.44,4235.38,4210.77,4287.69,4632.31,4396.41,0 4326.15,4011.79,4292.31,4151.28,4347.69,4586.67,4095.9,4627.69,4210.77,4244.1,4212.82,4288.21,4632.82,4398.46,0 |
Occupancy Detection Dataset
This dataset describes measurements of a room and the objective is to predict whether or not the room is occupied.
There are 20,560 one-minute observations taken over the period of a few weeks. This is a classification prediction problem. There are 7 attributes including various light and climate properties of the room.
The source for the data is credited to Luis Candanedo from UMONS.
Below is a sample of the first 5 rows of data including the header row.
1 2 3 4 5 6 7 |
"date","Temperature","Humidity","Light","CO2","HumidityRatio","Occupancy" "1","2015-02-04 17:51:00",23.18,27.272,426,721.25,0.00479298817650529,1 "2","2015-02-04 17:51:59",23.15,27.2675,429.5,714,0.00478344094931065,1 "3","2015-02-04 17:53:00",23.15,27.245,426,713.5,0.00477946352442199,1 "4","2015-02-04 17:54:00",23.15,27.2,426,708.25,0.00477150882608175,1 "5","2015-02-04 17:55:00",23.1,27.2,426,704.5,0.00475699293331518,1 "6","2015-02-04 17:55:59",23.1,27.2,419,701,0.00475699293331518,1 |
The data is provided in 3 files that suggest the splits that may be used for training and testing a model.
Ozone Level Detection Dataset
This dataset describes 6 years of ground ozone concentration observations and the objective is to predict whether it is an “ozone day” or not.
The dataset contains 2,536 observations and 73 attributes. This is a classification prediction problem and the final attribute indicates the class value as “1” for an ozone day and “0” for a normal day.
Two versions of the data are provided, eight-hour peak set and one-hour peak set. I would suggest using the one hour peak set for now.
Below is a sample of the first 5 rows with no header row.
1 2 3 4 5 6 |
1/1/1998,0.8,1.8,2.4,2.1,2,2.1,1.5,1.7,1.9,2.3,3.7,5.5,5.1,5.4,5.4,4.7,4.3,3.5,3.5,2.9,3.2,3.2,2.8,2.6,5.5,3.1,5.2,6.1,6.1,6.1,6.1,5.6,5.2,5.4,7.2,10.6,14.5,17.2,18.3,18.9,19.1,18.9,18.3,17.3,16.8,16.1,15.4,14.9,14.8,15,19.1,12.5,6.7,0.11,3.83,0.14,1612,-2.3,0.3,7.18,0.12,3178.5,-15.5,0.15,10.67,-1.56,5795,-12.1,17.9,10330,-55,0,0. 1/2/1998,2.8,3.2,3.3,2.7,3.3,3.2,2.9,2.8,3.1,3.4,4.2,4.5,4.5,4.3,5.5,5.1,3.8,3,2.6,3,2.2,2.3,2.5,2.8,5.5,3.4,15.1,15.3,15.6,15.6,15.9,16.2,16.2,16.2,16.6,17.8,19.4,20.6,21.2,21.8,22.4,22.1,20.8,19.1,18.1,17.2,16.5,16.1,16,16.2,22.4,17.8,9,0.25,-0.41,9.53,1594.5,-2.2,0.96,8.24,7.3,3172,-14.5,0.48,8.39,3.84,5805,14.05,29,10275,-55,0,0. 1/3/1998,2.9,2.8,2.6,2.1,2.2,2.5,2.5,2.7,2.2,2.5,3.1,4,4.4,4.6,5.6,5.4,5.2,4.4,3.5,2.7,2.9,3.9,4.1,4.6,5.6,3.5,16.6,16.7,16.7,16.8,16.8,16.8,16.9,16.9,17.1,17.6,19.1,21.3,21.8,22,22.1,22.2,21.3,19.8,18.6,18,18,18.2,18.3,18.4,22.2,18.7,9,0.56,0.89,10.17,1568.5,0.9,0.54,3.8,4.42,3160,-15.9,0.6,6.94,9.8,5790,17.9,41.3,10235,-40,0,0. 1/4/1998,4.7,3.8,3.7,3.8,2.9,3.1,2.8,2.5,2.4,3.1,3.3,3.1,2.3,2.1,2.2,3.8,2.8,2.4,1.9,3.2,4.1,3.9,4.5,4.3,4.7,3.2,18.3,18.2,18.3,18.4,18.6,18.6,18.5,18.7,18.6,18.8,19,19,19.3,19.4,19.6,19.2,18.9,18.8,18.6,18.5,18.3,18.5,18.8,18.9,19.6,18.7,9.9,0.89,-0.34,8.58,1546.5,3,0.77,4.17,8.11,3145.5,-16.8,0.49,8.73,10.54,5775,31.15,51.7,10195,-40,2.08,0. 1/5/1998,2.6,2.1,1.6,1.4,0.9,1.5,1.2,1.4,1.3,1.4,2.2,2,3,3,3.1,3.1,2.7,3,2.4,2.8,2.5,2.5,3.7,3.4,3.7,2.3,18.8,18.6,18.5,18.5,18.6,18.9,19.2,19.4,19.8,20.5,21.1,21.9,23.8,25.1,25.8,26,25.6,24.2,22.9,21.6,20,19.5,19.1,19.1,26,21.1,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,0.58,0. 1/6/1998,3.1,3.5,3.3,2.5,1.6,1.7,1.6,1.6,2.3,1.8,2.5,3.9,3.4,2.7,3.4,2.5,2.2,4.4,4.3,3.2,6.2,6.8,5.1,4,6.8,3.2,18.9,19.5,19.6,19.5,19.5,19.5,19.4,19.2,19.1,19.5,19.6,18.6,18.6,18.9,19.2,19.3,19.2,18.8,17.6,16.9,15.6,15.4,15.9,15.8,19.6,18.5,14.4,0.68,1.52,8.62,1499.5,4.3,0.61,9.04,10.81,3111,-11.8,0.09,11.98,11.28,5770,27.95,46.25,10120,?,5.84,0. |
Summary
In this post, you discovered a suite of standard time series forecast datasets that you can use to get started and practice time series forecasting with machine learning methods.
Specifically, you learned about:
- 4 univariate time series forecasting datasets.
- 3 multivariate time series forecasting datasets.
- Two websites where you can download many more datasets.
Did you use one of the above datasets in your own project?
Share your findings in the comments below.
Hey there, great tutorial! I need your help:
I have to make a weather forecasting project for my college. It has to be based on a time series dataset I guess. But I’m having a difficult time trying to get a suitable multivariate dataset, also I would like to ask you for an ML model to use in this kind of problem. I will appreciate any resource you could provide me.
Consider your government’s meteorological organization. Most give data freely.
Hi, I am looking for industrial time series datasets. Any suggestions.. Thanks.
What is wrong with the examples in this post?
I my work on weather dataset there are 4 classes clear, partially cloud , overcast,rain .
And I use lstm model . Which lstm model I use for multi classes classification.
I recommend trying a few different model architectures and compare results to classical ML models in order to discover what works well for your specific dataset.
Hi Jason,
many thanks for your article, I found usefull datset.
I did not find any dataset on UCI about temperature and energy consumption inside a building, I was wondering if you could help me in some way.
I hope to hear from you soon
Sorry, I’m not aware of such a dataset off-the-cuff.
I have a multivariate-dataset with observations from day 1 to 49 for each of the almost 30 patients. The end result is whether the patient has PTSD (1) or not ( 0 ). Please suggest how am I supposed to approach this problem in terms of data pre-processing.
Sounds like a sequence classification problem.
This post might give you some ideas:
https://machinelearningmastery.com/sequence-prediction/
LSTMs might be a good fit:
https://machinelearningmastery.com/start-here/#lstm
any sample code in python or C for time series ie preparing data via pandas(separating needed columns),analysing same for training,preparing model,training the model,applying same on test data…..
Please excuse me incase I have requesting anything wrong.
I have many examples, try searching on the blog.
Hi, I am trying to create a model that uses past data (sales volume + weather condition for example) to predict the 5 next day of sales volume but I would like to use weather prediction of the next 5 days also to forecast the volumes.
Can you tell me about the model to use (I guess RNN) and how to build my dataset.
Regards
I recommend following this process:
https://machinelearningmastery.com/how-to-develop-a-skilful-time-series-forecasting-model/
You can get code examples for multivariate input and multi-step output here:
https://machinelearningmastery.com/start-here/#deep_learning_time_series
Hi Jason,
My question may come to you a bit weird so that i beg your pardon in advance. I am working on short term load forecasting. As i know AEMO opens data about electricity. I can access the half-hourly load demand of past years(from 2006 through 2018) however i cannot access the half-hourly weather data(temperature and bulb) of Australian regions(QSL,VIC,NSW etc). I will make comparative analysis with journal papers so that i am looking for these data and authors of some papers did not shared their AEMO data yet. How can i get or find these data?Can you direct me on this issue?
My best advice is to contact the authors directly, and perhaps their advisors/colleagues?
I have a data set of shipping cost per day (in on year), however, not every day has a shipping cost. What’s the best we to deal with missing daily cost in order to make a Time Series analysis?
Perhaps start by filling the missing values with the mean/average values of the series?
I need to find data set and decompose for BTS for fault prediction from fault history
total donw time and 3 cell/ sector how it coud possible
Perhaps check on Kaggle?
Hi one. May I get your email address please? i’m also working on similar project
Do you think multivariate time series can take advantage of CNNs?
Can you combine CNNs with LSTMs?
How would you build a time series autoencoder for where each instant has 30 variables?
Yes and yes.
I have examples, perhaps start here:
https://machinelearningmastery.com/how-to-develop-convolutional-neural-network-models-for-time-series-forecasting/
And here:
https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/
Multivariate datasets are generally more challenging as you said. How to apply neural network algorithm on these datasets in WEKA? I am doing something wrong as I am getting the same result for yearly/monthly/weekly datasets. Please guide.
Good question.
There may be a way, I don’t have an example sorry.
Yes, I found a way. We can use the Overlay for training and test data using the advance configuration in time series package. We can set the single or multiple dependent parameters in overlay. While using overlay, data set is separated automatically in training and test data as per the values we have set in Evaluation tab.
Ohhhk… Thanks for your prompt reply Jason. I am rendering around it.
You’re welcome.
Hi jason,
Can you help me on how to convert a txt file to csv file?
Perhaps change the file extension from .txt to .csv?
Is it mandatory to convert the text file into a csv file and then into a pandas dataframe for further work? Or does it provide conflict if it is not done?
No, Pandas does not care about file extensions, only the content.
do anyone have a discreet dataset?
For time series, yes, there are some exampels of time series classification here:
https://machinelearningmastery.com/how-to-model-human-activity-from-smartphone-data/
Dear Jason,
Thank you for the wonderful post. I have a dataset, similar to Occupancy Detection Dataset, which you have described above.
1. Can we apply LSTMs, CNNs on these data?
2. Are these kind of data count under multivariate time series data? What I have understood till now, in time-series data there is a sequence in the rows and columns i.e. we can’t move any columns and any rows since time-series data have a sequence.
3. What kind of models can we apply to such a problem?
Regards,
Aashish
Perhaps try a suite of algorithms and compare results.
Yes, multivariate inputs. More on the types of time series problems here:
https://machinelearningmastery.com/taxonomy-of-time-series-forecasting-problems/
Hey Jason, Great Post.
I deal with system and application monitoring data a lot. I am looking for production ready software that would help me store data in Timeseries Database and apply predictive analytics (RNN, S/ARIMA) continuously. I see there are couple of cool libraries like TICK stack, LoudML and Facebook prophet.
Any tutorial would be great demonstrating the deployment of such continuous predictive system.
Best Regards,
Rajesh
Thanks.
Great suggestion!
Hi Jason,
Where can I get information about RNN or LSTM time series prediction datasets that need improvements, for example in terms of accuracy?
We minimize error for time series, not accuracy.
What do you mean by “need improvement”?
If you want to solve real problems where people care about the outcome, perhaps start with kaggle or take on some consulting work?
Hi there,
Is there is any solution to handle 3d data with a “traditional” ML solution?
For example, if I have a time series generated with 1000 users. In this scenario, we have 1000x time series. How can I make a generalized Varmax or Arimax model for every user, if I don’t want to use LSTM ?
You can transform the dataset into a supervised learning problem and test a suite of standard ml algorithms:
https://machinelearningmastery.com/time-series-forecasting-supervised-learning/
And here:
https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/
Thank you!
You’re welcome.
Hi there.
Do we categorize GPS trajectories as Univariate Time Series?
Perhaps start by assigning categories to your sequences first, then explore modeling it as a time series classification task.
The tutorials here will help:
https://machinelearningmastery.com/start-here/#deep_learning_time_series
can u post some thing like “How to prepare time series dataset for machine learning” that are implemented using sklearn
I have many such tutorials, perhaps start here:
https://machinelearningmastery.com/machine-learning-data-transforms-for-time-series-forecasting/
Hi,
My data is in the format timestamp, no of customers. I want to convert it into an hourly time series. How should I do that?
It really depends on your data, sorry, I cannot give better advice than that.
Hi Jason,
I have a dataset with columns as follows “Account Jan Feb Mar Q1 Apr May Jun Q2 Jul Aug Sep Q3 Oct Nov Dec Q4 YearTotal Year”
How am i suppose to consume this data for forecasting model as my month columns dont have any dates to them instead they have the sales figures for each account. Eg.
Account jan feb march Q1 Year
Revision 267829.5 279052.45 260298.54 807180.49 2019
My aim is to predict the Q3 and Q4 for the year 2020.
Please give your thoughts.
Perhaps start with a persistence model, then move on to evaluate a suite of models in order to discover what works well or best for your dataset.
I saw to your persistence model which you have used on shampoo and monthly car sales data. They both are univariate datasets in my case i have multivariate, can you please suggest how to approach mulivariate.
How to do time series by considering 3 to 5 columns and predict. If there is a way i can share some sample with you, if so do suggest.
Yes, the tutorials here will get you started with multivariate time series forecasting:
https://machinelearningmastery.com/start-here/#deep_learning_time_series
Hi Jason,
What I would like to ask is this, I have a time series historical data. It is daily sales data however, I have different product id’s. For example, I have 3 different dates for product 1, but I have 8 different dates for product 2.
I am expected to build an algorithm to forecast the sales of any product for next day.
How should I proceed?
e.g.
productid date soldquantity
1 23.11.2018 0
21 30.11.2018 0
21 27.12.2018 0
21 9.01.2019 0
21 18.12.2018 0
21 5.01.2019 0
21 7.01.2019 0
21 31.12.2018 0
21 26.12.2018 0
21 25.12.2018 0
21 10.01.2019 0
31 1.12.2018 0
31 19.11.2018 0
31 11.11.2018 0
31 27.11.2018 0
31 22.11.2018 0
I would expect each product id is a separate series.
You can use a machine learning or deep learning model to learn per product or across products.
Hi again. Thanks for the quick answer.
I considered taking all products as a seperate series, however I have more than 10 thousand products.
Which machine learning method could be used? I am very new at this.
I recommend starting with linear models like linear regression or SARIMA then move on to more advanced methods and see if they offer a benefit.
This framework will help:
https://machinelearningmastery.com/how-to-develop-a-skilful-time-series-forecasting-model/
Hi! you may want to ctrl+f “At the time of writing, there are” and find that you left this sentence twice in a row. Thanks for the article! it helped me find a dataset I needed.
Thanks, fixed!
Jason, can you help us to understand FourierFeaturizer and how interpret it from pmdarima python package. I wanted to use it to forecast seasonal data with long seasonal periods.
Regular approach is taking good amount of time . so based https://robjhyndman.com/hyndsight/longseasonality/ exploring the usage of FourierFeaturizer.
Thank you for the suggestion.
Hey Jason, the examples in this article look great! I’m actually looking for a signal processing dataset to apply time series modelling for a project. Could you suggest any open source datasets in this context?
Thanks.
This may help:
https://machinelearningmastery.com/faq/single-faq/where-can-i-get-a-dataset-on-___/
I have satellite time series (multivariate-dataset) with images from day 1 to 10 with almost 7 classes . Please suggest how am I supposed to approach this problem in terms of data augmentation
Perhaps you can use a pre-trained model with a custom CNN-LSTM type architecture.
Hello Jason,
I have a GPS dataset (latitude, longitude, timestamp) as a dataset. Each track (series of GPS points) of a participant is compared with another participant who walks on the same track. I want to do time series classification on this data, which kind of data can this be?
Thank you
That sounds like a great project. It is time series classification, try a suite of models and discover what works well or best for your data.
i am working with time series about education in my AI thesis project. the values are by year from 2013 to 2021, so i have nine records. i think it is a small dataset for a PHD, what do you think ?? Any suggestions ??
9 records probably can’t help you go too far, but it should be a good start.
Hello !
Brother can you provide Supply chain multi mode(Air, Truck, ocean etc) travel time prediction dataset.
Hello !
Brother can you provide Supply chain multi mode(Air, Truck, ocean etc) travel time prediction dataset.
I will be very thankful !
Hi Sham…I do not have such a dataset. You may want to check Kaggle or StackOverflow.
Hello!
Thanks for your post.
Can I get the reference about where datasets on the post came from?
I want to get more about each dataset.
Are they from ““Time Series Data Library” created by Rob Hyndman, Professor of Statistics at Monash University, Australia” as you said in the first part on the post as below?
“There are many sources of time series dataset, such as the “Time Series Data Library” created by Rob Hyndman, Professor of Statistics at Monash University, Australia”
Hi Hanson…each dataset contains a link that you can follow as the source. Also, in some cases the author’s name is provided so that you can perform a search on the author and the datasets they have published.
Additional comment:
I read there is some description about the source, such as
“The source of the data is credited as the Australian Bureau of Meteorology.” for Minimum Daily Temperatures Dataset.
But could you let me know how to get the source of the data in detail?
Thank you!
Hi Hanson…each dataset contains a link that you can follow as the source. Also, in some cases the author’s name is provided so that you can perform a search on the author and the datasets they have published.
Additional comment:
I read the part about where the source data was such as
“The source of the data is credited as the Australian Bureau of Meteorology.” for Minimum Daily Temperatures Dataset.
But could you let me know how I can get the data through Australian Bureau of Meteorology in detail?
Thank you!
Hi Hanson…each dataset contains a link that you can follow as the source. Also, in some cases the author’s name is provided so that you can perform a search on the author and the datasets they have published.