# 10 Standard Datasets for Practicing Applied Machine Learning

Last Updated on May 20, 2020

The key to getting good at applied machine learning is practicing on lots of different datasets.

This is because each problem is different, requiring subtly different data preparation and modeling methods.

In this post, you will discover 10 top standard machine learning datasets that you can use for practice.

Let’s dive in.

• Update Mar/2018: Added alternate link to download the Pima Indians and Boston Housing datasets as the originals appear to have been taken down.
• Update Feb/2019: Minor update to the expected default RMSE for the insurance dataset.

## Overview

### A structured Approach

Each dataset is summarized in a consistent way. This makes them easy to compare and navigate for you to practice a specific data preparation technique or modeling method.

The aspects that you need to know about each dataset are:

1. Name: How to refer to the dataset.
2. Problem Type: Whether the problem is regression or classification.
3. Inputs and Outputs: The numbers and known names of input and output features.
4. Performance: Baseline performance for comparison using the Zero Rule algorithm, as well as best known performance (if known).
5. Sample: A snapshot of the first 5 rows of raw data.

### Standard Datasets

Below is a list of the 10 datasets we’ll cover.

Each dataset is small enough to fit into memory and review in a spreadsheet. All datasets are comprised of tabular data and no (explicitly) missing values.

1. Swedish Auto Insurance Dataset.
2. Wine Quality Dataset.
3. Pima Indians Diabetes Dataset.
4. Sonar Dataset.
5. Banknote Dataset.
6. Iris Flowers Dataset.
7. Abalone Dataset.
8. Ionosphere Dataset.
9. Wheat Seeds Dataset.
10. Boston House Price Dataset.

## 1. Swedish Auto Insurance Dataset

The Swedish Auto Insurance Dataset involves predicting the total payment for all claims in thousands of Swedish Kronor, given the total number of claims.

It is a regression problem. It is comprised of 63 observations with 1 input variable and one output variable. The variable names are as follows:

1. Number of claims.
2. Total payment for all claims in thousands of Swedish Kronor.

The baseline performance of predicting the mean value is an RMSE of approximately 81 thousand Kronor.

A sample of the first 5 rows is listed below.

Below is a scatter plot of the entire dataset.

Swedish Auto Insurance Dataset

## 2. Wine Quality Dataset

The Wine Quality Dataset involves predicting the quality of white wines on a scale given chemical measures of each wine.

It is a multi-class classification problem, but could also be framed as a regression problem. The number of observations for each class is not balanced. There are 4,898 observations with 11 input variables and one output variable. The variable names are as follows:

1. Fixed acidity.
2. Volatile acidity.
3. Citric acid.
4. Residual sugar.
5. Chlorides.
6. Free sulfur dioxide.
7. Total sulfur dioxide.
8. Density.
9. pH.
10. Sulphates.
11. Alcohol.
12. Quality (score between 0 and 10).

The baseline performance of predicting the mean value is an RMSE of approximately 0.148 quality points.

A sample of the first 5 rows is listed below.

## 3. Pima Indians Diabetes Dataset

The Pima Indians Diabetes Dataset involves predicting the onset of diabetes within 5 years in Pima Indians given medical details.

It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 768 observations with 8 input variables and 1 output variable. Missing values are believed to be encoded with zero values. The variable names are as follows:

1. Number of times pregnant.
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
3. Diastolic blood pressure (mm Hg).
4. Triceps skinfold thickness (mm).
5. 2-Hour serum insulin (mu U/ml).
6. Body mass index (weight in kg/(height in m)^2).
7. Diabetes pedigree function.
8. Age (years).
9. Class variable (0 or 1).

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 65%. Top results achieve a classification accuracy of approximately 77%.

A sample of the first 5 rows is listed below.

## 4. Sonar Dataset

The Sonar Dataset involves the prediction of whether or not an object is a mine or a rock given the strength of sonar returns at different angles.

It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 208 observations with 60 input variables and 1 output variable. The variable names are as follows:

1. Sonar returns at different angles
2. Class (M for mine and R for rock)

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 53%. Top results achieve a classification accuracy of approximately 88%.

A sample of the first 5 rows is listed below.

## 5. Banknote Dataset

The Banknote Dataset involves predicting whether a given banknote is authentic given a number of measures taken from a photograph.

It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 1,372 observations with 4 input variables and 1 output variable. The variable names are as follows:

1. Variance of Wavelet Transformed image (continuous).
2. Skewness of Wavelet Transformed image (continuous).
3. Kurtosis of Wavelet Transformed image (continuous).
4. Entropy of image (continuous).
5. Class (0 for authentic, 1 for inauthentic).

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 50%.

A sample of the first 5 rows is listed below.

## 6. Iris Flowers Dataset

The Iris Flowers Dataset involves predicting the flower species given measurements of iris flowers.

It is a multi-class classification problem. The number of observations for each class is balanced. There are 150 observations with 4 input variables and 1 output variable. The variable names are as follows:

1. Sepal length in cm.
2. Sepal width in cm.
3. Petal length in cm.
4. Petal width in cm.
5. Class (Iris Setosa, Iris Versicolour, Iris Virginica).

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 26%.

A sample of the first 5 rows is listed below.

## 7. Abalone Dataset

The Abalone Dataset involves predicting the age of abalone given objective measures of individuals.

It is a multi-class classification problem, but can also be framed as a regression. The number of observations for each class is not balanced. There are 4,177 observations with 8 input variables and 1 output variable. The variable names are as follows:

1. Sex (M, F, I).
2. Length.
3. Diameter.
4. Height.
5. Whole weight.
6. Shucked weight.
7. Viscera weight.
8. Shell weight.
9. Rings.

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 16%. The baseline performance of predicting the mean value is an RMSE of approximately 3.2 rings.

A sample of the first 5 rows is listed below.

## 8. Ionosphere Dataset

The Ionosphere Dataset requires the prediction of structure in the atmosphere given radar returns targeting free electrons in the ionosphere.

It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 351 observations with 34 input variables and 1 output variable. The variable names are as follows:

1. 17 pairs of radar return data.
2. Class (g for good and b for bad).

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 64%. Top results achieve a classification accuracy of approximately 94%.

A sample of the first 5 rows is listed below.

## 9. Wheat Seeds Dataset

The Wheat Seeds Dataset involves the prediction of species given measurements of seeds from different varieties of wheat.

It is a binary (2-class) classification problem. The number of observations for each class is balanced. There are 210 observations with 7 input variables and 1 output variable. The variable names are as follows:

1. Area.
2. Perimeter.
3. Compactness
4. Length of kernel.
5. Width of kernel.
6. Asymmetry coefficient.
7. Length of kernel groove.
8. Class (1, 2, 3).

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 28%.

A sample of the first 5 rows is listed below.

## 10. Boston House Price Dataset

The Boston House Price Dataset involves the prediction of a house price in thousands of dollars given details of the house and its neighborhood.

It is a regression problem. There are 506 observations with 13 input variables and 1 output variable. The variable names are as follows:

1. CRIM: per capita crime rate by town.
2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS: proportion of nonretail business acres per town.
4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
5. NOX: nitric oxides concentration (parts per 10 million).
6. RM: average number of rooms per dwelling.
7. AGE: proportion of owner-occupied units built prior to 1940.
8. DIS: weighted distances to five Boston employment centers.
10. TAX: full-value property-tax rate per $10,000. 11. PTRATIO: pupil-teacher ratio by town. 12. B: 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town. 13. LSTAT: % lower status of the population. 14. MEDV: Median value of owner-occupied homes in$1000s.

The baseline performance of predicting the mean value is an RMSE of approximately 9.21 thousand dollars.

A sample of the first 5 rows is listed below.

## Summary

In this post, you discovered 10 top standard datasets that you can use to practice applied machine learning.

1. Pick one dataset.
2. Grab your favorite tool (like Weka, scikit-learn or R)
3. See how much you can beat the standard scores.

### 65 Responses to 10 Standard Datasets for Practicing Applied Machine Learning

1. Benson March 24, 2017 at 1:47 am #

Thanks Jason. I will use these Datasets for practice.

• Jason Brownlee March 24, 2017 at 7:56 am #

Let me know how you go Benson.

2. Jayant Sahewal May 16, 2017 at 4:11 pm #

Format for Swedish Auto Insurance data has changed. It’s not in CSV format anymore and there are extra rows at the beginning of the data

3. Ujjayant May 28, 2017 at 5:09 am #

Your posts have been a big help. Could you recommend a dataset which i can use to practice clustering and PCA on ?

• Jason Brownlee June 2, 2017 at 12:07 pm #

Thanks.

Perhaps something where all features have the same units, like the iris flowers dataset?

4. Joe November 28, 2017 at 11:28 am #

Hello, in reference to the Swedish auto data, is it not possible to use Scikit-Learn to perform linear regression? I get deprecation errors that request that I reshape the data. When I reshape, I get the error that the samples are different sizes. What am I missing please. Thank you.

• Jason Brownlee November 29, 2017 at 8:12 am #

Sorry, I don’t know Joe. Perhaps try posting your code and errors to stackoverflow?

5. shivaprasad November 30, 2017 at 5:14 am #

sir for wheat dataset i got result like this

0.97619047619
[[ 9 0 1]
[ 0 20 0]
[ 0 0 12]]
precision recall f1-score support

1.0 1.00 0.90 0.95 10
2.0 1.00 1.00 1.00 20
3.0 0.92 1.00 0.96 12

avg / total 0.98 0.98 0.98 42

is it correct sir?

• Jason Brownlee November 30, 2017 at 8:27 am #

What do you mean by correct?

• shivaprasad November 30, 2017 at 4:19 pm #

Sir ,the confusion matrix and the accuracy what i got, is it acceptable?is that right?

• Jason Brownlee December 1, 2017 at 7:24 am #

It really depends on the problem. Sorry, I don’t know the problem well enough, perhaps compare it to the confusion matrix of other algorithms.

• shivaprasad December 1, 2017 at 4:30 pm #

Thank you sir

6. shivaprasad December 1, 2017 at 4:31 pm #

I will do it

7. Aaron April 6, 2018 at 12:04 am #

Thank you very much.

8. Daniel April 25, 2018 at 11:48 pm #

Thanks for the datasets they r going to help me as i learn ML

9. bibhu September 14, 2018 at 10:15 pm #

WHAT IS THE DIFFERENCE BETWEEN NUMERIC AND CLINICAL CANCER. OR BOTH ARE SAME . I NEED LEUKEMIA ,LUNG,COLON DATASETS FOR MY WORK. I TOO NEED IMAGE DATSET FOR MY RESEARCH .WHERE TO GET THE DATASETS

• Jason Brownlee September 15, 2018 at 6:08 am #

10. Gui October 23, 2018 at 12:53 am #

Thanks a lot for sharing Jason!

I applied sklearn random forest and svm classifier to the wheat seed dataset in my very first Python notebook! 😀 The error oscilliates between 10% and 20% from an execution to an other. Can share it if anyone interrested.

Bye

• Jason Brownlee October 23, 2018 at 6:28 am #

Nice work!

• Rakhan January 12, 2020 at 8:32 pm #

I’m interested in the SVM classifier for the wheat seed dataset. You said you’re happy to share.

11. sak June 16, 2019 at 5:09 pm #

for sonar dataset got 90.47% accuracy

12. Khuram October 16, 2019 at 5:25 pm #

Thanks Jason

I tried decision tree classifier with 70% training and 30% testing on Banknote dataset.
Achieved accuracy of 99%.

[Accuracy: 0.9902912621359223]

13. Charles Mwashashu November 6, 2019 at 8:59 pm #

used k- nearest neighbors classifier with 75% training & 25% testing on the iris data set. Achieved 0.973684 accuracy.

14. Charles Mwashashu November 6, 2019 at 11:32 pm #

used k- nearest neighbors classifier with 75% training & 25% testing on the iris data set. Achieved 0.9970845481049563 accuracy.
99.71%

15. Enrique December 5, 2019 at 10:27 am #

Do you have any of these solved that I can reference back to?

• Jason Brownlee December 5, 2019 at 1:19 pm #

Yes, I have solutions to most of them on the blog, you can try a blog search.

Hi, I used Support Vector Classifier and KNN classifier on the Wheat Seeds Dataset (80% train data, 20% test data )

Accuracy Score of SVC : 0.9047619047619048
Accuracy Score of KNN : 0.8809523809523809

17. Laynetrain January 18, 2020 at 5:55 am #

Hiya! Found some incredible toplogical trends in Iris that I am looking to replicate in another multi-class problem.

Are people typically classifying the gender of the species, or the ring number as a discrete output?

• Laynetrain January 18, 2020 at 5:56 am #

In the Abalone dataset*

• Jason Brownlee January 18, 2020 at 8:53 am #

The age is the target on that dataset, but you can frame any predictive modeling problem you like with the dataset for practice.

18. Muhammad Qamar Ijaz February 18, 2020 at 4:11 pm #

Hi sir I am looking for a data sets for wheat production bu using SVM regression algorithm .So please give me a proper data sets for machine running .

19. Lujaina April 1, 2020 at 9:27 am #

I need a data set that
Contains at least 5 dimensions/features, including at least one categorical and one numerical dimension.
• Contains a clear class label attribute (binary or multi-label).
• Be of a simple tabular structure (i.e., no time series, multimedia, etc.).
• Be of reasonable size, and contains at least 2K tuples.

20. Rishav Raj May 11, 2020 at 11:51 pm #

Where can I find the default result for the problems so I can compare with my result?

21. Sebastian May 20, 2020 at 7:57 am #

Thanks for the post – it is very helpfull!

I would like to know if anyone knows about a classification-dataset, where the importances for the features regarding the output classes is known. For example: Feature 1 is a good indicator for class 1, or Feature 3,4,5 are good indicators for class 2, …

Hope anyone can help 😉

• Jason Brownlee May 20, 2020 at 1:33 pm #

You’re welcome.

Generally, we let the model discover the importance and how best to use input features.

• Sebastian May 21, 2020 at 9:18 pm #

Thank you very much for your answer. I was asking because I want to validate my approach to access the feature importance via global sensitivity analysis (Sobol Indices). In order to do I am searching for a dataset (or a dummy-dataset) with the described properties.

• Jason Brownlee May 22, 2020 at 6:07 am #

What are “Sobol Indices”?

• Sebastian May 22, 2020 at 11:07 pm #

It’s a variance based global sensitity analysis (ANOVA). It is quite similar to permutation-importance ranking but can reveal cross-correlations of features by calculation of the so called “total effect index”. If you are further interessed in the topic I can recommend the following paper:

https://www.researchgate.net/publication/306326267_Global_Sensitivity_Estimates_for_Neural_Network_Classifiers

Some Python code for straightforward calculation of sobol indices is provided here:

Coming back to my first question: Do you know about a dataset with those properties or do you have any idea how I can build up a dummy dataset with known feature importance for each output?

• Jason Brownlee May 23, 2020 at 6:24 am #

Thanks.

Yes, you can contrive a dataset with relevant/irrelevant inputs via the make_classification() function. I use it all the time.

Beyond that, you will have to contrive your own problem I would expect. Feature importance is not objective!

22. takenira August 19, 2020 at 6:06 pm #

day4 lesson

code:-

import pandas as pd

url = “https://goo.gl/bDdBiA”
names = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’]
description = data.describe()
print(description)

output:-
preg plas pres skin test mass pedi age class
count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
768.000000 768.000000 768.000000
mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578
0.471876 33.240885 0.348958
std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.331329
11.760232 0.476951
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000
21.000000 0.000000
25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750
24.000000 0.000000
50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000
0.372500 29.000000 0.000000
75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000
0.626250 41.000000 1.000000
max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000
2.420000 81.000000 1.000000

The output not properly fit in comment section

23. misterybodon February 9, 2021 at 12:51 pm #

Hi there,

thanks a lot for the post!

I’m quite a beginner and something I’m not sure. I’ve fit the data with a straigth line (first dataset), but how do we measure accuracy?
Should we leave out some data points, and use to test or what?

• Jason Brownlee February 9, 2021 at 1:34 pm #

Good question, this may help:
https://machinelearningmastery.com/regression-metrics-for-machine-learning/

• imms February 10, 2021 at 3:57 pm #

Thanks, following the post, but with my own code:

mean = np.mean(Y)
l = Y.shape[1]
res = Y-mean
rmse = np.sqrt(np.dot(res,res.T)/l)

I calculated the rmse, and it yields

python >>> rmse array([[86.63170447]]) 

Also the values of wine quality have a max of 8 not 10, at least that’s what I get.

Tx for the help

• imms February 10, 2021 at 3:58 pm #

the rmse is for the swedish kr

24. estevao February 9, 2021 at 7:28 pm #

Would you please tell where 1.48 comes from in with wine dataset? I’ve calculated mean squared error but it yields 0.034, using np.sqrt(np.mean(Y)/len(Y))

25. mrnb February 10, 2021 at 5:31 pm #

With a linear model coded from scratch got

 naive estimation rmse: [[86.63170447]] model rmse: [[35.44597361]] 

I’ve also done a simple visual of the model’s evolution here:

https://imgur.com/1X7h7gC

26. Ashwin April 14, 2021 at 3:44 pm #

Thank you so much, Jason. Was really looking for these datasets for practice today.