It is a good idea to have small well understood datasets when getting started in machine learning and learning a new tool.
The Weka machine learning workbench provides a directory of small well understood datasets in the installed directory.
In this post you will discover some of these small well understood datasets distributed with Weka, their details and where to learn more about them.
We will focus on a handful of datasets of differing types. After reading this post you will know:
- Where the sample datasets are located or where to download them afresh if you need them.
- Specific standard datasets you can use to explore different aspects of classification and regression predictive models.
- Where to go for more information about specific datasets and state of the art results.
Let’s get started.
Standard Weka Datasets
An installation of the open source Weka machine learning workbench includes a data/ directory full of standard machine learning problems.
This is very useful when you are getting started in machine learning or learning how to get started with the Weka platform. It provides standard machine learning datasets for common classification and regression problems, for example, below is a snapshot from this directory:
All datasets are in the Weka native ARFF file format and can be loaded directly into Weka, meaning you can start developing practice models immediately.
There are some special distributions of Weka that may not include the data/ directory. If you have chosen to install one of these distributions, you can download the .zip distribution of Weka, unzip it and copy the data/ directory to somewhere that you can access it easily from weka.
There are many datasets to play with in the data/ directory, in the following sections I will point out a few that you can focus on for practicing and investigating predictive modeling problems.
Need more help with Weka for Machine Learning?
Take my free 14-day email course and discover how to use the platform step-by-step.
Click to sign-up and also get a free PDF Ebook version of the course.
Binary Classification Datasets
Binary classification is where the output variable to be predicted is nominal comprised of two classes.
This is perhaps the most well studied type of predictive modeling problem and the type of problem that is good to start with.
There are three standard binary classification problems in the data/ directory that you can focus on:
- Pima Indians Onset of Diabetes: (diabetes.arff) Each instance represents medical details for one patient and the task is to predict whether the patient will have an onset of diabetes within the next five years. There are 8 numerical input variables all of which have varying scales. You can learn more about this dataset on the UCI Machine Learning Repository. Top results are in the order of 77% accuracy.
- Breast Cancer: (breast-cancer.arff) Each instance represents medical details of patients and samples of their tumor tissue and the task is to predict whether or not the patient has breast cancer. There are 9 input variables all of which a nominal. You can learn more about the datasets in the UCI Machine Learning Repository. Top results are in the order of 75% accuracy.
- Ionosphere (ionosphere.arff) Each instance describes the properties of radar returns from the atmosphere and the task is to predict whether or not there is structure in the ionosphere. There are 34 numerical input variables of generally the same scale. You can learn more about this dataset on the UCI Machine Learning Repository. Top results are in the order of 98% accuracy.
Multi-Class Classification Datasets
There are many classification type problems, where the output variable has more than two classes. These are called multi-class classification problems.
This is a good type of problem to look at after you have some confidence with binary classification.
Three standard multi-class classification problems in the data/ directory that you can focus on are:
- Iris Flowers Classification: (iris.arff) Each instance describes measurements of iris flowers and the task is to predict to which species of 3 iris flower the observation belongs. There are 4 numerical input variables with the same units and generally the same scale. You can learn more about the datasets in the UCI Machine Learning Repository. Top results are in the order of 96% accuracy.
- Large Soybean Database: (soybean.arff) Each instance describes properties of a crop of soybeans and the task is to predict which of the 19 diseases the crop suffers. There are 35 nominal input variables. You can learn more about this dataset on the UCI Machine Learning Repository.
- Glass Identification: (glass.arff) Each instance describes the chemical composition of samples of glass and the task is to predict the type or use of the class from one of 7 classes. There are 10 numeric attributes that describe the chemical properties of the glass ad its refractive index. You can learn more about this dataset on the UCI Machine Learning Repository.
Regression problems are those where you must predict a real valued output.
The selection of regression problems in the data/ directory is small. Regression is an important class of predictive modeling problem. As such I recommend downloading the free add-on pack of regression problems collected from the UCI Machine Learning Repository.
It is available from the datasets page on the Weka web page and is the first in the list called:
- A jar file containing 37 regression problems, obtained from various sources (datasets-numeric.jar)
It is a .jar file which is a type of compressed Java archive. You should be able to unzip it with most modern unzip programs.
If you have Java installed (which you very likely do to use Weka), you can also unzip the .jar file manually on the command line using the following command in the directory where the jar was downloaded:
jar -xvf datasets-numeric.jar
Unzipping the file will create a new directory called numeric that contains 37 regression datasets in ARFF native Weka format.
Three regression datasets in the numeric/ directory that you can focus on are:
- Longley Economic Dataset: (longley.arff) Each instance describes the gross economic properties of a nation for a given year and the task is to predict the number of people employed as an integer. There are 6 numeric input variables of varying scales.
- Boston House Price Dataset: (housing.arff) Each instance describes the properties of a Boston suburb and the task is to predict the house prices in thousands of dollars. There are 13 numerical input variables with varying scales describing the properties of suburbs. You can learn more about this dataset on the UCI Machine Learning Repository.
- Sleep in Mammals Dataset: (sleep.arff) Each instance describes the properties of different mammals and the task is to predict the number of hours of total sleep they require on average. There are 7 numeric input variables of different scales and measures.
In this post you discovered the standard machine learning datasets distributed with the Weka machine learning platform.
Specifically, you learned:
- Three popular binary classification problems you can use for practice: diabetes, breast-cancer and ionosphere.
- Three popular multi-class classification problems you can use for practice: iris, soybean and glass.
- Three popular regression problems you can use for practice: longley, housing and sleep.
Do you have any questions about standard machine learning datasets in Weka or about this post? Ask your questions in the comments and I will do my best to answer.
Want Machine Learning Without The Code?
Develop Your Own Models in Minutes
…with just a few a few clicks
Discover how in my new Ebook:
Machine Learning Mastery With Weka
Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, build models, tuning, and much more…
Finally Bring The Machine Learning To
Your Own Projects
Skip the Academics. Just Results.