7 Time Series Datasets for Machine Learning

Last Updated on August 21, 2019

Machine learning can be applied to time series datasets.

These are problems where a numeric or categorical value must be predicted, but the rows of data are ordered by time.

A problem when getting started in time series forecasting with machine learning is finding good quality standard datasets on which to practice.

In this post, you will discover 8 standard time series datasets that you can use to get started and practice time series forecasting with machine learning.

After reading this post, you will know:

  • 4 univariate time series datasets.
  • 3 multivariate time series datasets.
  • Websites that you can use to search and download more datasets.

Kick-start your project with my new book Time Series Forecasting With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

  • Updated Apr/2019: Updated the links to the datasets.

Univariate Time Series Datasets

Time series datasets that only have one variable are called univariate datasets.

These datasets are a great place to get started because:

  • They are so simple and easy to understand.
  • You can plot them easily in excel or your favorite plotting tool.
  • You can easily plot the predictions compared to the expected results.
  • You can quickly try and evaluate a suite of traditional and newer methods.

There are many sources of time series dataset, such as the “Time Series Data Library” created by Rob Hyndman, Professor of Statistics at Monash University, Australia

Below are 4 univariate time series datasets that you can download from a range of fields such as Sales, Meteorology, Physics and Demography.

Stop learning Time Series Forecasting the slow way!

Take my free 7-day email course and discover how to get started (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Shampoo Sales Dataset

This dataset describes the monthly number of sales of shampoo over a 3 year period.

The units are a sales count and there are 36 observations. The original dataset is credited to Makridakis, Wheelwright and Hyndman (1998).

Below is a sample of the first 5 rows of data including the header row.

Below is a plot of the entire dataset.

Shampoo Sales Dataset

Shampoo Sales Dataset

The dataset shows an increasing trend and possibly some seasonal component.

Minimum Daily Temperatures Dataset

This dataset describes the minimum daily temperatures over 10 years (1981-1990) in the city Melbourne, Australia.

The units are in degrees Celsius and there are 3650 observations. The source of the data is credited as the Australian Bureau of Meteorology.

Below is a sample of the first 5 rows of data including the header row.

Below is a plot of the entire dataset.

Minimum Daily Temperatures

Minimum Daily Temperatures

The dataset shows a strong seasonality component and has a nice fine grained detail to work with.

Monthly Sunspot Dataset

This dataset describes a monthly count of the number of observed sunspots for just over 230 years (1749-1983).

The units are a count and there are 2,820 observations. The source of the dataset is credited to Andrews & Herzberg (1985).

Below is a sample of the first 5 rows of data including the header row.

Below is a plot of the entire dataset.

Monthly Sun Spot Dataset

Monthly Sun Spot Dataset

The dataset shows seasonality with large differences between seasons.

Daily Female Births Dataset

This dataset describes the number of daily female births in California in 1959.

The units are a count and there are 365 observations. The source of the dataset is credited to Newton (1988).

Below is a sample of the first 5 rows of data including the header row.

Below is a plot of the entire dataset.

Daily Female Births Dataset

Daily Female Births Dataset

Multivariate Time Series Datasets

Multivariate datasets are generally more challenging and are the sweet spot for machine learning methods.

A great source of multivariate time series data is the UCI Machine Learning Repository. At the time of writing, there are

At the time of writing, there are 63 time series datasets that you can download for free and work with.

Below is a selection of 3 recommended multivariate time series datasets from Meteorology, Medicine and Monitoring domains.

EEG Eye State Dataset

This dataset describes EEG data for an individual and whether their eyes were open or closed. The objective of the problem is to predict whether eyes are open or closed given EEG data alone.

The objective of the problem is to predict whether eyes are open or closed given EEG data alone.

This is a classification predictive modeling problems and there are a total of 14,980 observations and 15 input variables. The class value of ‘1’ indicates the eye-closed and ‘0’ the eye-open state. Data is ordered by time and observations were recorded over a period of 117 seconds.

Data is ordered by time and observations were recorded over a period of 117 seconds.

Below is a sample of the first 5 rows with no header row.

Occupancy Detection Dataset

This dataset describes measurements of a room and the objective is to predict whether or not the room is occupied.

There are 20,560 one-minute observations taken over the period of a few weeks. This is a classification prediction problem. There are 7 attributes including various light and climate properties of the room.

The source for the data is credited to Luis Candanedo from UMONS.

Below is a sample of the first 5 rows of data including the header row.

The data is provided in 3 files that suggest the splits that may be used for training and testing a model.

Ozone Level Detection Dataset

This dataset describes 6 years of ground ozone concentration observations and the objective is to predict whether it is an “ozone day” or not.

The dataset contains 2,536 observations and 73 attributes. This is a classification prediction problem and the final attribute indicates the class value as “1” for an ozone day and “0” for a normal day.

Two versions of the data are provided, eight-hour peak set and one-hour peak set. I would suggest using the one hour peak set for now.

Below is a sample of the first 5 rows with no header row.

Summary

In this post, you discovered a suite of standard time series forecast datasets that you can use to get started and practice time series forecasting with machine learning methods.

Specifically, you learned about:

  • 4 univariate time series forecasting datasets.
  • 3 multivariate time series forecasting datasets.
  • Two websites where you can download many more datasets.

Did you use one of the above datasets in your own project?
Share your findings in the comments below.

Want to Develop Time Series Forecasts with Python?

Introduction to Time Series Forecasting With Python

Develop Your Own Forecasts in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Introduction to Time Series Forecasting With Python

It covers self-study tutorials and end-to-end projects on topics like: Loading data, visualization, modeling, algorithm tuning, and much more...

Finally Bring Time Series Forecasting to
Your Own Projects

Skip the Academics. Just Results.

See What's Inside

56 Responses to 7 Time Series Datasets for Machine Learning

  1. R. Edwin July 6, 2017 at 3:27 am #

    Hey there, great tutorial! I need your help:
    I have to make a weather forecasting project for my college. It has to be based on a time series dataset I guess. But I’m having a difficult time trying to get a suitable multivariate dataset, also I would like to ask you for an ML model to use in this kind of problem. I will appreciate any resource you could provide me.

    • Jason Brownlee July 6, 2017 at 10:26 am #

      Consider your government’s meteorological organization. Most give data freely.

  2. Parijat September 29, 2017 at 4:47 am #

    Hi, I am looking for industrial time series datasets. Any suggestions.. Thanks.

    • Jason Brownlee September 29, 2017 at 5:09 am #

      What is wrong with the examples in this post?

  3. Domenico November 4, 2017 at 12:45 am #

    Hi Jason,
    many thanks for your article, I found usefull datset.
    I did not find any dataset on UCI about temperature and energy consumption inside a building, I was wondering if you could help me in some way.
    I hope to hear from you soon

    • Jason Brownlee November 4, 2017 at 5:31 am #

      Sorry, I’m not aware of such a dataset off-the-cuff.

  4. Nisha Chaube January 21, 2018 at 7:28 am #

    I have a multivariate-dataset with observations from day 1 to 49 for each of the almost 30 patients. The end result is whether the patient has PTSD (1) or not ( 0 ). Please suggest how am I supposed to approach this problem in terms of data pre-processing.

  5. VEERENDRA JONNALAGADDA June 1, 2018 at 5:22 am #

    any sample code in python or C for time series ie preparing data via pandas(separating needed columns),analysing same for training,preparing model,training the model,applying same on test data…..

    Please excuse me incase I have requesting anything wrong.

  6. Florent January 20, 2019 at 7:51 pm #

    Hi, I am trying to create a model that uses past data (sales volume + weather condition for example) to predict the 5 next day of sales volume but I would like to use weather prediction of the next 5 days also to forecast the volumes.

    Can you tell me about the model to use (I guess RNN) and how to build my dataset.

    Regards

  7. Avram March 8, 2019 at 11:38 pm #

    Hi Jason,
    My question may come to you a bit weird so that i beg your pardon in advance. I am working on short term load forecasting. As i know AEMO opens data about electricity. I can access the half-hourly load demand of past years(from 2006 through 2018) however i cannot access the half-hourly weather data(temperature and bulb) of Australian regions(QSL,VIC,NSW etc). I will make comparative analysis with journal papers so that i am looking for these data and authors of some papers did not shared their AEMO data yet. How can i get or find these data?Can you direct me on this issue?

    • Jason Brownlee March 9, 2019 at 6:29 am #

      My best advice is to contact the authors directly, and perhaps their advisors/colleagues?

  8. fernando A gutierrez March 12, 2019 at 6:56 am #

    I have a data set of shipping cost per day (in on year), however, not every day has a shipping cost. What’s the best we to deal with missing daily cost in order to make a Time Series analysis?

    • Jason Brownlee March 12, 2019 at 7:00 am #

      Perhaps start by filling the missing values with the mean/average values of the series?

  9. one July 3, 2019 at 1:07 pm #

    I need to find data set and decompose for BTS for fault prediction from fault history
    total donw time and 3 cell/ sector how it coud possible

    • Jason Brownlee July 4, 2019 at 7:37 am #

      Perhaps check on Kaggle?

    • nandy October 3, 2019 at 5:05 pm #

      Hi one. May I get your email address please? i’m also working on similar project

  10. Abderahmane Bouziane July 23, 2019 at 6:20 am #

    Do you think multivariate time series can take advantage of CNNs?
    Can you combine CNNs with LSTMs?
    How would you build a time series autoencoder for where each instant has 30 variables?

  11. Shital September 19, 2019 at 3:59 pm #

    Multivariate datasets are generally more challenging as you said. How to apply neural network algorithm on these datasets in WEKA? I am doing something wrong as I am getting the same result for yearly/monthly/weekly datasets. Please guide.

    • Jason Brownlee September 20, 2019 at 5:35 am #

      Good question.

      There may be a way, I don’t have an example sorry.

      • Shital Bhojani October 1, 2019 at 2:09 pm #

        Yes, I found a way. We can use the Overlay for training and test data using the advance configuration in time series package. We can set the single or multiple dependent parameters in overlay. While using overlay, data set is separated automatically in training and test data as per the values we have set in Evaluation tab.

  12. Shital Bhojani September 20, 2019 at 2:37 pm #

    Ohhhk… Thanks for your prompt reply Jason. I am rendering around it.

  13. Arjun November 19, 2019 at 3:35 pm #

    Hi jason,
    Can you help me on how to convert a txt file to csv file?

    • Jason Brownlee November 20, 2019 at 6:08 am #

      Perhaps change the file extension from .txt to .csv?

  14. Arjun November 19, 2019 at 4:26 pm #

    Is it mandatory to convert the text file into a csv file and then into a pandas dataframe for further work? Or does it provide conflict if it is not done?

    • Jason Brownlee November 20, 2019 at 6:09 am #

      No, Pandas does not care about file extensions, only the content.

  15. adil shahzad November 27, 2019 at 8:02 pm #

    do anyone have a discreet dataset?

  16. Aashish Agarwal December 21, 2019 at 9:47 am #

    Dear Jason,

    Thank you for the wonderful post. I have a dataset, similar to Occupancy Detection Dataset, which you have described above.

    1. Can we apply LSTMs, CNNs on these data?
    2. Are these kind of data count under multivariate time series data? What I have understood till now, in time-series data there is a sequence in the rows and columns i.e. we can’t move any columns and any rows since time-series data have a sequence.
    3. What kind of models can we apply to such a problem?

    Regards,
    Aashish

  17. Rajesh December 22, 2019 at 11:29 pm #

    Hey Jason, Great Post.

    I deal with system and application monitoring data a lot. I am looking for production ready software that would help me store data in Timeseries Database and apply predictive analytics (RNN, S/ARIMA) continuously. I see there are couple of cool libraries like TICK stack, LoudML and Facebook prophet.

    Any tutorial would be great demonstrating the deployment of such continuous predictive system.

    Best Regards,
    Rajesh

  18. Laila January 8, 2020 at 6:54 am #

    Hi Jason,

    Where can I get information about RNN or LSTM time series prediction datasets that need improvements, for example in terms of accuracy?

    • Jason Brownlee January 8, 2020 at 8:36 am #

      We minimize error for time series, not accuracy.

      What do you mean by “need improvement”?

      If you want to solve real problems where people care about the outcome, perhaps start with kaggle or take on some consulting work?

  19. GKboy March 30, 2020 at 7:17 pm #

    Hi there,

    Is there is any solution to handle 3d data with a “traditional” ML solution?
    For example, if I have a time series generated with 1000 users. In this scenario, we have 1000x time series. How can I make a generalized Varmax or Arimax model for every user, if I don’t want to use LSTM ?

  20. Remirab April 13, 2020 at 11:15 pm #

    Hi there.

    Do we categorize GPS trajectories as Univariate Time Series?

  21. Suresh Reshu April 22, 2020 at 1:34 am #

    can u post some thing like “How to prepare time series dataset for machine learning” that are implemented using sklearn

  22. Shubhi Jain May 6, 2020 at 5:39 am #

    Hi,

    My data is in the format timestamp, no of customers. I want to convert it into an hourly time series. How should I do that?

    • Jason Brownlee May 6, 2020 at 6:31 am #

      It really depends on your data, sorry, I cannot give better advice than that.

  23. Sachin Kannan August 31, 2020 at 12:19 am #

    Hi Jason,

    I have a dataset with columns as follows “Account Jan Feb Mar Q1 Apr May Jun Q2 Jul Aug Sep Q3 Oct Nov Dec Q4 YearTotal Year”

    How am i suppose to consume this data for forecasting model as my month columns dont have any dates to them instead they have the sales figures for each account. Eg.

    Account jan feb march Q1 Year
    Revision 267829.5 279052.45 260298.54 807180.49 2019

    My aim is to predict the Q3 and Q4 for the year 2020.

    Please give your thoughts.

    • Jason Brownlee August 31, 2020 at 6:16 am #

      Perhaps start with a persistence model, then move on to evaluate a suite of models in order to discover what works well or best for your dataset.

      • Sachin Kannan September 1, 2020 at 1:34 am #

        I saw to your persistence model which you have used on shampoo and monthly car sales data. They both are univariate datasets in my case i have multivariate, can you please suggest how to approach mulivariate.

        How to do time series by considering 3 to 5 columns and predict. If there is a way i can share some sample with you, if so do suggest.

  24. Beste Karacay September 2, 2020 at 5:44 am #

    Hi Jason,

    What I would like to ask is this, I have a time series historical data. It is daily sales data however, I have different product id’s. For example, I have 3 different dates for product 1, but I have 8 different dates for product 2.
    I am expected to build an algorithm to forecast the sales of any product for next day.
    How should I proceed?

    e.g.
    productid date soldquantity
    1 23.11.2018 0
    21 30.11.2018 0
    21 27.12.2018 0
    21 9.01.2019 0
    21 18.12.2018 0
    21 5.01.2019 0
    21 7.01.2019 0
    21 31.12.2018 0
    21 26.12.2018 0
    21 25.12.2018 0
    21 10.01.2019 0
    31 1.12.2018 0
    31 19.11.2018 0
    31 11.11.2018 0
    31 27.11.2018 0
    31 22.11.2018 0

    • Jason Brownlee September 2, 2020 at 6:34 am #

      I would expect each product id is a separate series.

      You can use a machine learning or deep learning model to learn per product or across products.

Leave a Reply