10 Standard Datasets for Practicing Applied Machine Learning

The key to getting good at applied machine learning is practicing on lots of different datasets.

This is because each problem is different, requiring subtly different data preparation and modeling methods.

In this post, you will discover 10 top standard machine learning datasets that you can use for practice.

Let’s dive in.

  • Update Mar/2018: Added alternate link to download the Pima Indians and Boston Housing datasets as the originals appear to have been taken down.
  • Update Feb/2019: Minor update to the expected default RMSE for the insurance dataset.
  • Update Oct/2021: Minor update to the description of wheat seed dataset.

Overview

A structured Approach

Each dataset is summarized in a consistent way. This makes them easy to compare and navigate for you to practice a specific data preparation technique or modeling method.

The aspects that you need to know about each dataset are:

  1. Name: How to refer to the dataset.
  2. Problem Type: Whether the problem is regression or classification.
  3. Inputs and Outputs: The numbers and known names of input and output features.
  4. Performance: Baseline performance for comparison using the Zero Rule algorithm, as well as best known performance (if known).
  5. Sample: A snapshot of the first 5 rows of raw data.
  6. Links: Where you can download the dataset and learn more.

Standard Datasets

Below is a list of the 10 datasets we’ll cover.

Each dataset is small enough to fit into memory and review in a spreadsheet. All datasets are comprised of tabular data and no (explicitly) missing values.

  1. Swedish Auto Insurance Dataset.
  2. Wine Quality Dataset.
  3. Pima Indians Diabetes Dataset.
  4. Sonar Dataset.
  5. Banknote Dataset.
  6. Iris Flowers Dataset.
  7. Abalone Dataset.
  8. Ionosphere Dataset.
  9. Wheat Seeds Dataset.
  10. Boston House Price Dataset.

1. Swedish Auto Insurance Dataset

The Swedish Auto Insurance Dataset involves predicting the total payment for all claims in thousands of Swedish Kronor, given the total number of claims.

It is a regression problem. It is comprised of 63 observations with 1 input variable and one output variable. The variable names are as follows:

  1. Number of claims.
  2. Total payment for all claims in thousands of Swedish Kronor.

The baseline performance of predicting the mean value is an RMSE of approximately 81 thousand Kronor.

A sample of the first 5 rows is listed below.

Below is a scatter plot of the entire dataset.

Swedish Auto Insurance Dataset

Swedish Auto Insurance Dataset

2. Wine Quality Dataset

The Wine Quality Dataset involves predicting the quality of white wines on a scale given chemical measures of each wine.

It is a multi-class classification problem, but could also be framed as a regression problem. The number of observations for each class is not balanced. There are 4,898 observations with 11 input variables and one output variable. The variable names are as follows:

  1. Fixed acidity.
  2. Volatile acidity.
  3. Citric acid.
  4. Residual sugar.
  5. Chlorides.
  6. Free sulfur dioxide.
  7. Total sulfur dioxide.
  8. Density.
  9. pH.
  10. Sulphates.
  11. Alcohol.
  12. Quality (score between 0 and 10).

The baseline performance of predicting the mean value is an RMSE of approximately 0.148 quality points.

A sample of the first 5 rows is listed below.

3. Pima Indians Diabetes Dataset

The Pima Indians Diabetes Dataset involves predicting the onset of diabetes within 5 years in Pima Indians given medical details.

It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 768 observations with 8 input variables and 1 output variable. Missing values are believed to be encoded with zero values. The variable names are as follows:

  1. Number of times pregnant.
  2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
  3. Diastolic blood pressure (mm Hg).
  4. Triceps skinfold thickness (mm).
  5. 2-Hour serum insulin (mu U/ml).
  6. Body mass index (weight in kg/(height in m)^2).
  7. Diabetes pedigree function.
  8. Age (years).
  9. Class variable (0 or 1).

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 65%. Top results achieve a classification accuracy of approximately 77%.

A sample of the first 5 rows is listed below.

4. Sonar Dataset

The Sonar Dataset involves the prediction of whether or not an object is a mine or a rock given the strength of sonar returns at different angles.

It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 208 observations with 60 input variables and 1 output variable. The variable names are as follows:

  1. Sonar returns at different angles
  2. Class (M for mine and R for rock)

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 53%. Top results achieve a classification accuracy of approximately 88%.

A sample of the first 5 rows is listed below.

5. Banknote Dataset

The Banknote Dataset involves predicting whether a given banknote is authentic given a number of measures taken from a photograph.

It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 1,372 observations with 4 input variables and 1 output variable. The variable names are as follows:

  1. Variance of Wavelet Transformed image (continuous).
  2. Skewness of Wavelet Transformed image (continuous).
  3. Kurtosis of Wavelet Transformed image (continuous).
  4. Entropy of image (continuous).
  5. Class (0 for authentic, 1 for inauthentic).

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 50%.

A sample of the first 5 rows is listed below.

6. Iris Flowers Dataset

The Iris Flowers Dataset involves predicting the flower species given measurements of iris flowers.

It is a multi-class classification problem. The number of observations for each class is balanced. There are 150 observations with 4 input variables and 1 output variable. The variable names are as follows:

  1. Sepal length in cm.
  2. Sepal width in cm.
  3. Petal length in cm.
  4. Petal width in cm.
  5. Class (Iris Setosa, Iris Versicolour, Iris Virginica).

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 26%.

A sample of the first 5 rows is listed below.

7. Abalone Dataset

The Abalone Dataset involves predicting the age of abalone given objective measures of individuals.

It is a multi-class classification problem, but can also be framed as a regression. The number of observations for each class is not balanced. There are 4,177 observations with 8 input variables and 1 output variable. The variable names are as follows:

  1. Sex (M, F, I).
  2. Length.
  3. Diameter.
  4. Height.
  5. Whole weight.
  6. Shucked weight.
  7. Viscera weight.
  8. Shell weight.
  9. Rings.

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 16%. The baseline performance of predicting the mean value is an RMSE of approximately 3.2 rings.

A sample of the first 5 rows is listed below.

8. Ionosphere Dataset

The Ionosphere Dataset requires the prediction of structure in the atmosphere given radar returns targeting free electrons in the ionosphere.

It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 351 observations with 34 input variables and 1 output variable. The variable names are as follows:

  1. 17 pairs of radar return data.
  2. Class (g for good and b for bad).

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 64%. Top results achieve a classification accuracy of approximately 94%.

A sample of the first 5 rows is listed below.

9. Wheat Seeds Dataset

The Wheat Seeds Dataset involves the prediction of species given measurements of seeds from different varieties of wheat.

It is a multiclass (3-class) classification problem. The number of observations for each class is balanced. There are 210 observations with 7 input variables and 1 output variable. The variable names are as follows:

  1. Area.
  2. Perimeter.
  3. Compactness
  4. Length of kernel.
  5. Width of kernel.
  6. Asymmetry coefficient.
  7. Length of kernel groove.
  8. Class (1, 2, 3).

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 28%.

A sample of the first 5 rows is listed below.

10. Boston House Price Dataset

The Boston House Price Dataset involves the prediction of a house price in thousands of dollars given details of the house and its neighborhood.

It is a regression problem. There are 506 observations with 13 input variables and 1 output variable. The variable names are as follows:

  1. CRIM: per capita crime rate by town.
  2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
  3. INDUS: proportion of nonretail business acres per town.
  4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  5. NOX: nitric oxides concentration (parts per 10 million).
  6. RM: average number of rooms per dwelling.
  7. AGE: proportion of owner-occupied units built prior to 1940.
  8. DIS: weighted distances to five Boston employment centers.
  9. RAD: index of accessibility to radial highways.
  10. TAX: full-value property-tax rate per $10,000.
  11. PTRATIO: pupil-teacher ratio by town.
  12. B: 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town.
  13. LSTAT: % lower status of the population.
  14. MEDV: Median value of owner-occupied homes in $1000s.

The baseline performance of predicting the mean value is an RMSE of approximately 9.21 thousand dollars.

A sample of the first 5 rows is listed below.

Summary

In this post, you discovered 10 top standard datasets that you can use to practice applied machine learning.

Here is your next step:

  1. Pick one dataset.
  2. Grab your favorite tool (like Weka, scikit-learn or R)
  3. See how much you can beat the standard scores.
  4. Report your results in the comments below.

72 Responses to 10 Standard Datasets for Practicing Applied Machine Learning

  1. Avatar
    Benson March 24, 2017 at 1:47 am #

    Thanks Jason. I will use these Datasets for practice.

    • Avatar
      Jason Brownlee March 24, 2017 at 7:56 am #

      Let me know how you go Benson.

    • Avatar
      Shayan December 21, 2022 at 4:51 am #

      Thanks. Please mention some datasets that have more than one output variable.

  2. Avatar
    Jayant Sahewal May 16, 2017 at 4:11 pm #

    Format for Swedish Auto Insurance data has changed. It’s not in CSV format anymore and there are extra rows at the beginning of the data

  3. Avatar
    Ujjayant May 28, 2017 at 5:09 am #

    Your posts have been a big help. Could you recommend a dataset which i can use to practice clustering and PCA on ?

    • Avatar
      Jason Brownlee June 2, 2017 at 12:07 pm #

      Thanks.

      Perhaps something where all features have the same units, like the iris flowers dataset?

  4. Avatar
    Joe November 28, 2017 at 11:28 am #

    Hello, in reference to the Swedish auto data, is it not possible to use Scikit-Learn to perform linear regression? I get deprecation errors that request that I reshape the data. When I reshape, I get the error that the samples are different sizes. What am I missing please. Thank you.

    • Avatar
      Jason Brownlee November 29, 2017 at 8:12 am #

      Sorry, I don’t know Joe. Perhaps try posting your code and errors to stackoverflow?

  5. Avatar
    shivaprasad November 30, 2017 at 5:14 am #

    sir for wheat dataset i got result like this

    0.97619047619
    [[ 9 0 1]
    [ 0 20 0]
    [ 0 0 12]]
    precision recall f1-score support

    1.0 1.00 0.90 0.95 10
    2.0 1.00 1.00 1.00 20
    3.0 0.92 1.00 0.96 12

    avg / total 0.98 0.98 0.98 42

    is it correct sir?

    • Avatar
      Jason Brownlee November 30, 2017 at 8:27 am #

      What do you mean by correct?

      • Avatar
        shivaprasad November 30, 2017 at 4:19 pm #

        Sir ,the confusion matrix and the accuracy what i got, is it acceptable?is that right?

        • Avatar
          Jason Brownlee December 1, 2017 at 7:24 am #

          It really depends on the problem. Sorry, I don’t know the problem well enough, perhaps compare it to the confusion matrix of other algorithms.

          • Avatar
            shivaprasad December 1, 2017 at 4:30 pm #

            Thank you sir

  6. Avatar
    shivaprasad December 1, 2017 at 4:31 pm #

    I will do it

  7. Avatar
    Aaron April 6, 2018 at 12:04 am #

    Thank you very much.

  8. Avatar
    Daniel April 25, 2018 at 11:48 pm #

    Thanks for the datasets they r going to help me as i learn ML

  9. Avatar
    bibhu September 14, 2018 at 10:15 pm #

    WHAT IS THE DIFFERENCE BETWEEN NUMERIC AND CLINICAL CANCER. OR BOTH ARE SAME . I NEED LEUKEMIA ,LUNG,COLON DATASETS FOR MY WORK. I TOO NEED IMAGE DATSET FOR MY RESEARCH .WHERE TO GET THE DATASETS

  10. Avatar
    Gui October 23, 2018 at 12:53 am #

    Thanks a lot for sharing Jason!

    I applied sklearn random forest and svm classifier to the wheat seed dataset in my very first Python notebook! 😀 The error oscilliates between 10% and 20% from an execution to an other. Can share it if anyone interrested.

    Bye

    • Avatar
      Jason Brownlee October 23, 2018 at 6:28 am #

      Nice work!

    • Avatar
      Rakhan January 12, 2020 at 8:32 pm #

      I’m interested in the SVM classifier for the wheat seed dataset. You said you’re happy to share.

  11. Avatar
    sak June 16, 2019 at 5:09 pm #

    for sonar dataset got 90.47% accuracy

  12. Avatar
    Khuram October 16, 2019 at 5:25 pm #

    Thanks Jason

    I tried decision tree classifier with 70% training and 30% testing on Banknote dataset.
    Achieved accuracy of 99%.

    [Accuracy: 0.9902912621359223]

  13. Avatar
    Charles Mwashashu November 6, 2019 at 8:59 pm #

    used k- nearest neighbors classifier with 75% training & 25% testing on the iris data set. Achieved 0.973684 accuracy.

  14. Avatar
    Charles Mwashashu November 6, 2019 at 11:32 pm #

    used k- nearest neighbors classifier with 75% training & 25% testing on the iris data set. Achieved 0.9970845481049563 accuracy.
    99.71%

  15. Avatar
    Enrique December 5, 2019 at 10:27 am #

    Do you have any of these solved that I can reference back to?

    • Avatar
      Jason Brownlee December 5, 2019 at 1:19 pm #

      Yes, I have solutions to most of them on the blog, you can try a blog search.

  16. Avatar
    Asad Asadzade December 21, 2019 at 5:46 am #

    Hi, I used Support Vector Classifier and KNN classifier on the Wheat Seeds Dataset (80% train data, 20% test data )

    Accuracy Score of SVC : 0.9047619047619048
    Accuracy Score of KNN : 0.8809523809523809

  17. Avatar
    Laynetrain January 18, 2020 at 5:55 am #

    Hiya! Found some incredible toplogical trends in Iris that I am looking to replicate in another multi-class problem.

    Are people typically classifying the gender of the species, or the ring number as a discrete output?

    • Avatar
      Laynetrain January 18, 2020 at 5:56 am #

      In the Abalone dataset*

    • Avatar
      Jason Brownlee January 18, 2020 at 8:53 am #

      The age is the target on that dataset, but you can frame any predictive modeling problem you like with the dataset for practice.

  18. Avatar
    Muhammad Qamar Ijaz February 18, 2020 at 4:11 pm #

    Hi sir I am looking for a data sets for wheat production bu using SVM regression algorithm .So please give me a proper data sets for machine running .

  19. Avatar
    Lujaina April 1, 2020 at 9:27 am #

    I need a data set that
    Contains at least 5 dimensions/features, including at least one categorical and one numerical dimension.
    • Contains a clear class label attribute (binary or multi-label).
    • Be of a simple tabular structure (i.e., no time series, multimedia, etc.).
    • Be of reasonable size, and contains at least 2K tuples.

  20. Avatar
    Rishav Raj May 11, 2020 at 11:51 pm #

    Where can I find the default result for the problems so I can compare with my result?

  21. Avatar
    Sebastian May 20, 2020 at 7:57 am #

    Thanks for the post – it is very helpfull!

    I would like to know if anyone knows about a classification-dataset, where the importances for the features regarding the output classes is known. For example: Feature 1 is a good indicator for class 1, or Feature 3,4,5 are good indicators for class 2, …

    Hope anyone can help 😉

    • Avatar
      Jason Brownlee May 20, 2020 at 1:33 pm #

      You’re welcome.

      Generally, we let the model discover the importance and how best to use input features.

      • Avatar
        Sebastian May 21, 2020 at 9:18 pm #

        Thank you very much for your answer. I was asking because I want to validate my approach to access the feature importance via global sensitivity analysis (Sobol Indices). In order to do I am searching for a dataset (or a dummy-dataset) with the described properties.

        • Avatar
          Jason Brownlee May 22, 2020 at 6:07 am #

          What are “Sobol Indices”?

          • Avatar
            Sebastian May 22, 2020 at 11:07 pm #

            It’s a variance based global sensitity analysis (ANOVA). It is quite similar to permutation-importance ranking but can reveal cross-correlations of features by calculation of the so called “total effect index”. If you are further interessed in the topic I can recommend the following paper:

            https://www.researchgate.net/publication/306326267_Global_Sensitivity_Estimates_for_Neural_Network_Classifiers

            Some Python code for straightforward calculation of sobol indices is provided here:

            https://salib.readthedocs.io/en/latest/api.html#sobol-sensitivity-analysis

            Coming back to my first question: Do you know about a dataset with those properties or do you have any idea how I can build up a dummy dataset with known feature importance for each output?

          • Avatar
            Jason Brownlee May 23, 2020 at 6:24 am #

            Thanks.

            Yes, you can contrive a dataset with relevant/irrelevant inputs via the make_classification() function. I use it all the time.

            Beyond that, you will have to contrive your own problem I would expect. Feature importance is not objective!

  22. Avatar
    takenira August 19, 2020 at 6:06 pm #

    day4 lesson

    code:-

    import pandas as pd

    url = “https://goo.gl/bDdBiA”
    names = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’]
    data = pd.read_csv(url, names=names)
    description = data.describe()
    print(description)

    output:-
    preg plas pres skin test mass pedi age class
    count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
    768.000000 768.000000 768.000000
    mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578
    0.471876 33.240885 0.348958
    std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.331329
    11.760232 0.476951
    min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000
    21.000000 0.000000
    25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750
    24.000000 0.000000
    50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000
    0.372500 29.000000 0.000000
    75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000
    0.626250 41.000000 1.000000
    max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000
    2.420000 81.000000 1.000000

    The output not properly fit in comment section

  23. Avatar
    misterybodon February 9, 2021 at 12:51 pm #

    Hi there,

    thanks a lot for the post!

    I’m quite a beginner and something I’m not sure. I’ve fit the data with a straigth line (first dataset), but how do we measure accuracy?
    Should we leave out some data points, and use to test or what?

      • Avatar
        imms February 10, 2021 at 3:57 pm #

        Thanks, following the post, but with my own code:

        mean = np.mean(Y)
        l = Y.shape[1]
        res = Y-mean
        rmse = np.sqrt(np.dot(res,res.T)/l)

        I calculated the rmse, and it yields

        python
        >>> rmse
        array([[86.63170447]])

        Also the values of wine quality have a max of 8 not 10, at least that’s what I get.

        Tx for the help

        • Avatar
          imms February 10, 2021 at 3:58 pm #

          the rmse is for the swedish kr

  24. Avatar
    estevao February 9, 2021 at 7:28 pm #

    Would you please tell where 1.48 comes from in with wine dataset? I’ve calculated mean squared error but it yields 0.034, using np.sqrt(np.mean(Y)/len(Y))

    • Avatar
      estevao February 9, 2021 at 7:44 pm #

      or is this the sqrt(sum^m (y-y_i)*(y-y_i)/m) ?

      • Avatar
        Jason Brownlee February 10, 2021 at 8:05 am #

        It is probably the RMSE of a model that predicts the mean value from the training dataset.

        The calculation for RMSE is here:
        https://machinelearningmastery.com/regression-metrics-for-machine-learning/

        • Avatar
          AMIT VISHNU BHISE February 16, 2022 at 3:19 am #

          Hi Jason,

          The Wine quality dataset poses a multi-class classification problem. How could we have RMSE as a metric? Is that a mistake or am I missing something?

          • Avatar
            James Carmichael February 16, 2022 at 11:28 am #

            Hi Amit…yes the dataset can be utilized for classification, however in order to get that point the RMSE can be used to determine how accurate the predictions are based upon comparing averages of each quantity represented in the features.

  25. Avatar
    mrnb February 10, 2021 at 5:31 pm #

    With a linear model coded from scratch got


    naive estimation rmse: [[86.63170447]]
    model rmse: [[35.44597361]]

    I’ve also done a simple visual of the model’s evolution here:

    https://imgur.com/1X7h7gC

  26. Avatar
    Ashwin April 14, 2021 at 3:44 pm #

    Thank you so much, Jason. Was really looking for these datasets for practice today.

  27. Avatar
    Egor October 20, 2021 at 3:34 am #

    Thanks a lot! I’ll use some of these for practice
    Btw, it is written in a Wheat Seeds Dataset that it is a binary classification problem, however, 3 classes are given. Is this a mistake or something? Thank you

    • Adrian Tam
      Adrian Tam October 20, 2021 at 10:25 am #

      Yes, that was a mistake. Thanks for pointing out.

    • Avatar
      James Carmichael August 17, 2022 at 6:03 am #

      Thank you for the feedback Yao!

Leave a Reply