Machine Learning Datasets in R (10 datasets you can use right now)

You need standard datasets to practice machine learning.

In this short post you will discover how you can load standard classification and regression datasets in R.

This post will show you 3 R libraries that you can use to load standard datasets and 10 specific datasets that you can use for machine learning in R.

It is invaluable to load standard datasets in R so that you can test, practice and experiment with machine learning techniques and improve your skill with the platform.

Let’s get started.

Practice On Small Well-Understood Datasets

There are hundreds of standard test datasets that you can use to practice and get better at machine learning.

Most of them are hosted for free on the UCI Machine Learning Repository. These datasets are useful because they are well understood, they are well behaved and they are small.

This last point is critical when practicing machine learning because:

  • You can download them fast.
  • You can fit them into memory easily.
  • You can run algorithms on them quickly.

Learn more about practicing machine learning using datasets from the UCI Machine Learning Repository in the post:

Access Standard Datasets in R

You can load the standard datasets into R as CSV files.

There is a more convenient approach to loading the standard dataset. They have been packaged and are available in third party R libraries that you can download from the Comprehensive R Archive Network (CRAN).

Which libraries should you use and what datasets are good to start with.

Need more Help with R for Machine Learning?

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

How To Load Standard Datasets in R

In this section you will discover the libraries that you can use to get access to standard machine learning datasets.

You will also discover specific classification and regression that you can load and use to practice machine learning in R.

Library: datasets

Iris Flowers Dataset

Iris Flowers Dataset
Photo by Rick Ligthelm, some rights reserved.

The datasets library comes with base R which means you do not need to explicitly load the library. It includes a large number of datasets that you can use.

You can load a dataset from this library by typing:

For example, to load the very commonly used iris dataset:

To see a list of the datasets available in this library, you can type:

Some highlights datasets from this package that you could use are below.

Iris Flowers Dataset

  • Description: Predict iris flower species from flower measurements.
  • Type: Multi-class classification
  • Dimensions: 150 instances, 5 attributes
  • Inputs: Numeric
  • Output: Categorical, 3 class labels
  • UCI Machine Learning Repository: Description
  • Published accuracy results: Summary

You will see:

Longley’s Economic Regression Data

  • Description: Predict number of people employed from economic variables
  • Type: Regression
  • Dimensions: 16 instances, 7 attributes
  • Inputs: Numeric
  • Output: Numeric

You will see:

Library: mlbench

Soybean Dataset

Soybean Dataset
Photo by United Soybean Board, some rights reserved.

Direct from the manual for the library:

A collection of artificial and real-world machine learning benchmark problems, including, e.g., several data sets from the UCI repository.

You can learn more about the mlbench library on the mlbench CRAN page.

If not installed, you can install this library as follows:

You can load the library as follows:

To see a list of the datasets available in this library, you can type:

Some highlights datasets from this library that you could use are:

Boston Housing Data

  • Description: Predict the house price in Boston from house details
  • Type: Regression
  • Dimensions: 506 instances, 14 attributes
  • Inputs: Numeric
  • Output: Numeric
  • UCI Machine Learning Repository: Description

You will see:

Wisconsin Breast Cancer Database

  • Description: Predict whether a cancer is malignant or benign from biopsy details.
  • Type: Binary Classification
    Dimensions: 699 instances, 11 attributes
  • Inputs: Integer (Nominal)
  • Output: Categorical, 2 class labels
  • UCI Machine Learning Repository: Description
  • Published accuracy results: Summary

You will see:

Glass Identification Database

  • Description: Predict the glass type from chemical properties.
  • Type: Classification
  • Dimensions: 214 instances, 10 attributes
  • Inputs: Numeric
  • Output: Categorical, 7 class labels
  • UCI Machine Learning Repository: Description
  • Published accuracy results: Summary

You will see:

Johns Hopkins University Ionosphere database

  • Description: Predict high-energy structures in the atmosphere from antenna data.
  • Type: Classification
  • Dimensions: 351 instances, 35 attributes
  • Inputs: Numeric
  • Output: Categorical, 2 class labels
  • UCI Machine Learning Repository: Description
  • Published accuracy results: Summary

You will see:

Pima Indians Diabetes Database

  • Description: Predict the onset of diabetes in female Pima Indians from medical record data.
  • Type: Binary Classification
  • Dimensions: 768 instances, 9 attributes
  • Inputs: Numeric
  • Output: Categorical, 2 class labels
  • UCI Machine Learning Repository: Description
  • Published accuracy results: Summary

You will see:

Sonar, Mines vs. Rocks

  • Description: Predict metal or rock returns from sonar return data.
  • Type: Binary Classification
  • Dimensions: 208 instances, 61 attributes
  • Inputs: Numeric
  • Output: Categorical, 2 class labels
  • UCI Machine Learning Repository: Description
  • Published accuracy results: Summary

You will see:

Soybean Database

  • Description: Predict problems with soybean crops from crop data.
  • Type: Multi-Class Classification
  • Dimensions: 683 instances, 26 attributes
  • Inputs: Integer (Nominal)
  • Output: Categorical, 19 class labels
  • UCI Machine Learning Repository: Description

You will see:

Library: AppliedPredictiveModeling

Abalone Dataset

Abalone Dataset
Photo by MAURO CATEB, some rights reserved.

Many books that use R also include their own R library that provides all of the code and datasets used in the book.

The excellent book Applied Predictive Modeling has its own library called AppliedPredictiveModeling.

If not installed, you can install this library as follows:

You can load the library as follows:

To see a list of the datasets available in this library, you can type:

One highlight datasets from this library that you could use is:

Abalone Data

  • Description: Predict abalone age from abalone measurement data.
  • Type: Regression or Classification
  • Dimensions: 4177 instances, 9 attributes
  • Inputs: Numerical and categorical
  • Output: Integer
  • UCI Machine Learning Repository: Description

You will see:

Summary

In this post you discovered that you do not need to collect or load your own data in order to practice machine learning in R.

You learned about 3 different libraries that provide sample machine learning datasets that you can use:

  • datasets library
  • mlbench library
  • AppliedPredictiveModeling library

You also discovered 10 specific standard machine learning datasets that you can use to practice classification and regression machine learning techniques.

  • Iris flowers datasets (multi-class classification)
  • Longley’s Economic Regression Data (regression)
  • Boston Housing Data (regression)
  • Wisconsin Breast Cancer Database (binary classification)
  • Glass Identification Database (multi-class classification)
  • Johns Hopkins University Ionosphere database (binary classification)
  • Pima Indians Diabetes Database (binary classification)
  • Sonar, Mines vs. Rocks (binary classification)
  • Soybean Database (multi-class classification)
  • Abalone Data (regression or classification)

Next Step

Did you try out these recipes?

  1. Start your R interactive environment.
  2. Type or copy-and-paste the recipes above and try them out.
  3. Use the built-in help in R to learn more about the functions used.

Do you have a question. Ask it in the comments and I will do my best to answer it.


Frustrated With Your Progress In R Machine Learning?

Master Machine Learning With R

Develop Your Own Models in Minutes

…with just a few lines of R code

Discover how in my new Ebook:
Machine Learning Mastery With R

Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, build models, tuning, and much more…

Finally Bring Machine Learning To
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.


19 Responses to Machine Learning Datasets in R (10 datasets you can use right now)

  1. Rotimi February 16, 2016 at 6:11 am #

    Thanks, great post as always

  2. Rotimi February 16, 2016 at 6:11 am #

    Great post! Thank you sir

  3. Asia December 2, 2016 at 12:35 pm #

    Thanks! I’m looking for regression datasets and I didn’t knew one of which you wrote. But in one place you wrote that the set is for regression and in another place you wrote is for classification. It is misleading.

  4. V Malsoru May 13, 2017 at 7:06 am #

    I Install R and practiced some algorithms such as Apriori using “arules” packages, but how to install “mlbench” packages to run the following datasets

    “Boston Housing Data (regression)
    Wisconsin Breast Cancer Database (binary classification)
    Glass Identification Database (multi-class classification)
    Johns Hopkins University Ionosphere database (binary classification)
    Pima Indians Diabetes Database (binary classification)
    Sonar, Mines vs. Rocks (binary classification)
    Soybean Database (multi-class classification)
    Abalone Data (regression or classification)”. Please suggest.

    • Jason Brownlee May 14, 2017 at 7:21 am #

      You can install the mlbench package as follows:

  5. Malsoru May 13, 2017 at 7:49 am #

    To run cancer datasets, which packages are needed, could you please suggest.

  6. Malsoru June 14, 2017 at 5:52 pm #

    Pima Indians Diabetes Database (binary classification).
    Could You Please suggest one more “Diabetes” datasets with one or two attributes are different / One or two more or less than Pima Indians Diabetes Database (binary classification). Not same Pima Indians Diabetes Database (binary classification). But i need only “Diabetes “.

    • Jason Brownlee June 15, 2017 at 8:44 am #

      Perhaps you can search kaggle or the uci machine learning repository?

  7. Nate George January 27, 2018 at 2:31 pm #

    This has got to be the only post of yours where the pictures actually match the topic. The rest seem to be random.

  8. Nate George January 27, 2018 at 2:33 pm #

    Also, it’s install.packages(), not install.library()

  9. GEORGE MASON UNIVERSITY September 18, 2018 at 6:35 am #

    May I know how to apply central limit theorm to large multivariate dataset

  10. Rakesh Patel October 8, 2018 at 4:42 am #

    Sir i wanna work with age data-set to find the age..

    But i am unable to find the csv file of age data-set..

    sir will you provide ma a link so that i can work in my project..

Leave a Reply