BigML Tutorial: Develop Your First Decision Tree and Make Predictions

BigML is a fresh new and interesting machine learning as a service company based out of Corvallis, Oregon, USA.

In a previous post, we reviewed the BigML service, the key features and the ways in which you could use this service in your business, on you side project or to present to clients. In this tutorial we will walk through a step-by-step tutorial on developing a predictive model using the BigML platform and use it to make predictions on data that was not used to create the model. The model will be a decision tree.

You can follow along by signing up for a free trial BigML account. Configure your account to “development mode” and you will not require any credits to complete the tasks in this tutorial.

Iris Species Classification Problem

For this tutorial we will use the well studied Iris flower dataset. This dataset is comprised of 150 instances that describe the measurements of iris flowers, each of which is classified as one of three species of iris. The attributes are numeric and the problem is a multi-class classification problem.

Sample of the Iris flower dataset

Sample of the Iris flower dataset, screenshot from Wikipedia

You can read more about this problem on the Wikipedia page and download the data from the Iris page on the UCI Machine Learning Repository.

1. Load Data and Create Dataset

In this section you will prepare your data source and dataset for use in BigML.

1.1. Create the Data Source

First we need to create a data source. This is the raw data from which we can create data sets or views of the raw data.

  1. Log in to your BigML account.
  2. Click on the “Dashboard” button to go to your BigML dashboard.
  3. Click on the “Source” tab to list all data sources for your account.
  4. Click on the “Link” button to specify a remote data file.
  5. Enter the URL (http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data) and description (“Iris flower data source“) for the Iris flower dataset on the UCI Machine Learning Repository.
  6. Click the “Create” button to create the new data source.
  7. Click on the “Iris flower data source” to review it.
BigML Data Source

BigML Data Source

You will note that the attribute data types have correctly been identified as numeric and that the class label is the last attribute (field5).

1.2. Create the Dataset

Now that we have a data source with the raw data loaded, we can create a data set view of that data source. We will then create two more datasets, one called training that we will use to train a predictive model and second called test that we will use to evaluate the created predictive model and will use as the basis to make predictions.

  1. Click on the “Iris flower data source” in the “Sources” tab to open it, if not already open.
  2. Click on the cloud button and select “One-click Dataset“.
  3. This will create a new dataset from the data source. This is a view onto the data source that can be modified in preparation for modeling.
  4. Click or hover on the small sigma to see a summary of a given attribute.
  5. Click the cloud button and select “1 Click Training | Test“.
  6. Click on the “Datasets” tab and review the 3 datasets that we have created.
BigML Dataset

BigML Dataset

2. Create and Evaluate Model

In this section we will create a predictive model from our prepared training dataset and evaluate the model using our prepared test dataset.

2.1. Create Predictive Model

Now you will create a predictive model from the training dataset.

  1. Click on the “Iris flower data source’s dataset | Training (80%)” dataset in the “Datasets” tab.
  2. Click the cloud icon and select “1-Click Model“.
  3. Hover on different nodes in the model to review the flow of the data through the decision tree.
  4. Click the “Sunburst” button to open the sunburst view of the model and explore the decision tree.
  5. Click the “Model Summary Report” button to review a text description of the rules derived from the decision tree model.
BigML Predictive Model

BigML Predictive Model

2.2. Evaluate Predictive Model

Now you will evaluate the predictive accuracy the predictive model you created using the test dataset.

  1. Click on the iris flower model in the “Models” tab.
  2. Click the cloud button and select “Evaluate
  3. The evaluation will automatically select the test dataset that you created previously that contains 20% of the original dataset that the predictive model has not seen before.
  4. Click the “Evaluate” button to evaluate the model.
  5. The accuracy of the model is summarized in terms of classification accuracy, precision, recall, F-score and phi score. We can see the accuracy is 93.33%.
  6. Click the “Confusion Matrix” to review the confusion matrix for the model predictions.
BigML Evaluate Predictive Model

BigML Evaluate Predictive Model, Showing Confusion Matrix

3. Make Predictions

Now you will make predictions using the predictive model on data that the model has not seen before.

  1. Click on the iris flower model in the “Models” tab.
  2. Click on the cloud button and select “Batch Prediction“.
  3. Click on the “Search dataset …” drop down and type “iris“.
  4. Select the “Iris flower data source‚Äôs dataset | Test 20%” dataset.
  5. Click the “Predict” button
  6. Click the “Download batch prediction” file for the predictions for each row in the test dataset.
BigML Download Model Predictions

BigML Download Model Predictions

Summary

In this tutorial you have learned how to create a data source, data set, create a predictive model, evaluate it and finally to make predictions on unseen data using the prepared predictive model. BigML is an easy to use platform and you should be able to do all of this in 5-10 minutes flat.

From here, there are a number of things you could do to extend this tutorial:

  • You could create a new decision tree with a different pruning method and compare its evaluation to the decision tree that you have already created to see if it is more accurate.
  • You could use ensembles of decision trees to model the problem and see if you can improve the classification accuracy by comparing the evaluation of the ensembles to the evaluation of the decision tree you already created.
  • You could write a script or use the BigML command line tool (called bigmler) to make predictions with new data as it became available.
  • You could could use the BigML API to integrate use of the remote model into a webpage and make predictions for new data automatically as it was made available from another source.

If you have an idea for an interesting tutorial using BigML, please leave a comment. Let your imagination go wild.

No comments yet.

Leave a Reply