A Gentle Introduction to XGBoost for Applied Machine Learning

XGBoost is an algorithm that has recently been dominating applied machine learning and Kaggle competitions for structured or tabular data.

XGBoost is an implementation of gradient boosted decision trees designed for speed and performance.

In this post you will discover XGBoost and get a gentle introduction to what is, where it came from and how you can learn more.

After reading this post you will know:

  • What XGBoost is and the goals of the project.
  • Why XGBoost must be apart of your machine learning toolkit.
  • Where you can learn more to start using XGBoost on your next machine learning project.

Let’s get started.

A Gentle Introduction to XGBoost for Applied Machine Learning

A Gentle Introduction to XGBoost for Applied Machine Learning
Photo by Sigfrid Lundberg, some rights reserved.

Need help with XGBoost in Python?

Take my free 7-day email course and discover configuration, tuning and more (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

What is XGBoost?

XGBoost stands for eXtreme Gradient Boosting.

The name xgboost, though, actually refers to the engineering goal to push the limit of computations resources for boosted tree algorithms. Which is the reason why many people use xgboost.

— Tianqi Chen, in answer to the question “What is the difference between the R gbm (gradient boosting machine) and xgboost (extreme gradient boosting)?” on Quora

It is an implementation of gradient boosting machines created by Tianqi Chen, now with contributions from many developers. It belongs to a broader collection of tools under the umbrella of the Distributed Machine Learning Community or DMLC who are also the creators of the popular mxnet deep learning library.

Tianqi Chen provides a brief and interesting back story on the creation of XGBoost in the post Story and Lessons Behind the Evolution of XGBoost.

XGBoost is a software library that you can download and install on your machine, then access from a variety of interfaces. Specifically, XGBoost supports the following main interfaces:

  • Command Line Interface (CLI).
  • C++ (the language in which the library is written).
  • Python interface as well as a model in scikit-learn.
  • R interface as well as a model in the caret package.
  • Julia.
  • Java and JVM languages like Scala and platforms like Hadoop.

XGBoost Features

The library is laser focused on computational speed and model performance, as such there are few frills. Nevertheless, it does offer a number of advanced features.

Model Features

The implementation of the model supports the features of the scikit-learn and R implementations, with new additions like regularization. Three main forms of gradient boosting are supported:

  • Gradient Boosting algorithm also called gradient boosting machine including the learning rate.
  • Stochastic Gradient Boosting with sub-sampling at the row, column and column per split levels.
  • Regularized Gradient Boosting with both L1 and L2 regularization.

System Features

The library provides a system for use in a range of computing environments, not least:

  • Parallelization of tree construction using all of your CPU cores during training.
  • Distributed Computing for training very large models using a cluster of machines.
  • Out-of-Core Computing for very large datasets that don’t fit into memory.
  • Cache Optimization of data structures and algorithm to make best use of hardware.

Algorithm Features

The implementation of the algorithm was engineered for efficiency of compute time and memory resources. A design goal was to make the best use of available resources to train the model. Some key algorithm implementation features include:

  • Sparse Aware implementation with automatic handling of missing data values.
  • Block Structure to support the parallelization of tree construction.
  • Continued Training so that you can further boost an already fitted model on new data.

XGBoost is free open source software available for use under the permissive Apache-2 license.

Why Use XGBoost?

The two reasons to use XGBoost are also the two goals of the project:

  1. Execution Speed.
  2. Model Performance.

1. XGBoost Execution Speed

Generally, XGBoost is fast. Really fast when compared to other implementations of gradient boosting.

Szilard Pafka performed some objective benchmarks comparing the performance of XGBoost to other implementations of gradient boosting and bagged decision trees. He wrote up his results in May 2015 in the blog post titled “Benchmarking Random Forest Implementations“.

He also provides all the code on GitHub and a more extensive report of results with hard numbers.

Benchmark Performance of XGBoost

Benchmark Performance of XGBoost, taken from Benchmarking Random Forest Implementations.

His results showed that XGBoost was almost always faster than the other benchmarked implementations from R, Python Spark and H2O.

From his experiment, he commented:

I also tried xgboost, a popular library for boosting which is capable to build random forests as well. It is fast, memory efficient and of high accuracy

— Szilard Pafka, Benchmarking Random Forest Implementations.

2. XGBoost Model Performance

XGBoost dominates structured or tabular datasets on classification and regression predictive modeling problems.

The evidence is that it is the go-to algorithm for competition winners on the Kaggle competitive data science platform.

For example, there is an incomplete list of first, second and third place competition winners that used titled: XGBoost: Machine Learning Challenge Winning Solutions.

To make this point more tangible, below are some insightful quotes from Kaggle competition winners:

As the winner of an increasing amount of Kaggle competitions, XGBoost showed us again to be a great all-round algorithm worth having in your toolbox.

Dato Winners’ Interview: 1st place, Mad Professors

When in doubt, use xgboost.

Avito Winner’s Interview: 1st place, Owen Zhang

I love single models that do well, and my best single model was an XGBoost that could get the 10th place by itself.

Caterpillar Winners’ Interview: 1st place

I only used XGBoost.

Liberty Mutual Property Inspection, Winner’s Interview: 1st place, Qingchen Wang

The only supervised learning method I used was gradient boosting, as implemented in the excellent xgboost package.

Recruit Coupon Purchase Winner’s Interview: 2nd place, Halla Yang

What Algorithm Does XGBoost Use?

The XGBoost library implements the gradient boosting decision tree algorithm.

This algorithm goes by lots of different names such as gradient boosting, multiple additive regression trees, stochastic gradient boosting or gradient boosting machines.

Boosting is an ensemble technique where new models are added to correct the errors made by existing models. Models are added sequentially until no further improvements can be made. A popular example is the AdaBoost algorithm that weights data points that are hard to predict.

Gradient boosting is an approach where new models are created that predict the residuals or errors of prior models and then added together to make the final prediction. It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models.

This approach supports both regression and classification predictive modeling problems.

For more on boosting and gradient boosting, see Trevor Hastie’s talk on Gradient Boosting Machine Learning.

Official XGBoost Resources

The best source of information on XGBoost is the official GitHub repository for the project.

From there you can get access to the Issue Tracker and the User Group that can be used for asking questions and reporting bugs.

A great source of links with example code and help is the Awesome XGBoost page.

There is also an official documentation page that includes a getting started guide for a range of different languages, tutorials, how-to guides and more.

There are some more formal papers on XGBoost that are worth a read for more background on the library:

Talks on XGBoost

When getting started with a new tool like XGBoost, it can be helpful to review a few talks on the topic before diving into the code.

XGBoost: A Scalable Tree Boosting System

Tianqi Chen, the creator of the library gave a talk to the LA Data Science group in June 2016 titled “XGBoost: A Scalable Tree Boosting System“.

You can review the slides from his talk here:

There is more information on the DataScience LA blog.

XGBoost: eXtreme Gradient Boosting

Tong He, a contributor to XGBoost for the R interface gave a talk at the NYC Data Science Academy in December 2015 titled “XGBoost: eXtreme Gradient Boosting“.

You can review the slides from his talk here:

There is more information about this talk on the NYC Data Science Academy blog.

Installing XGBoost

There is a comprehensive installation guide on the XGBoost documentation website.

It covers installation for Linux, Mac OS X and Windows.

It also covers installation on platforms such as R and Python.

XGBoost in R

If you are an R user, the best place to get started is the CRAN page for the xgboost package.

From this page you can access the R vignette Package ‘xgboost’ [pdf].

 

There are also some excellent R tutorials linked from this page to get you started:

There is also the official XGBoost R Tutorial and Understand your dataset with XGBoost tutorial.

XGBoost in Python

Installation instructions are available on the Python section of the XGBoost installation guide.

The official Python Package Introduction is the best place to start when working with XGBoost in Python.

To get started quickly, you can type:

There is also an excellent list of sample source code in Python on the XGBoost Python Feature Walkthrough.

Summary

In this post you discovered the XGBoost algorithm for applied machine learning.

You learned:

  • That XGBoost is a library for developing fast and high performance gradient boosting tree models.
  • That XGBoost is achieving the best performance on a range of difficult machine learning tasks.
  • That you can use this library from the command line, Python and R and how to get started.

Have you used XGBoost? Share your experiences in the comments below.

Do you have any questions about XGBoost or about this post? Ask your question in the comments below and I will do my best to answer them.

Want To Learn The Algorithm Winning Competitions?

XGBoost With Python

Develop Your Own XGBoost Models in Minutes

…with just a few lines of Python

Discover how in my new Ebook:
XGBoost With Python

It covers self-study tutorials like:
Algorithm Fundamentals, Scaling, Hyperparameters, and much more…

Bring The Power of XGBoost To
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.

6 Responses to A Gentle Introduction to XGBoost for Applied Machine Learning

  1. Seo Young Jae July 10, 2017 at 6:25 pm #

    Good information, thank you. Just one question.

    Biggest difference from the gbm is normalization?

    Does gbm not normalize, but does xgboost automatically normalize variables and automatically handle missing values? Did I get it right?

    • Jason Brownlee July 11, 2017 at 10:28 am #

      The biggest difference is performance, not normalization.

  2. Seo Young Jae July 10, 2017 at 7:46 pm #

    I ran xgboost on R.

    However, I found that input values can not be performed in the form of factors.

    In case of gbm, it is possible to use factor type variable.

    In that respect, xgboost seems to have some disadvantages.

    • Jason Brownlee July 11, 2017 at 10:29 am #

      You must transform your categorical variables to be integer encoded or one hot encoded.

      • Seo Young Jae July 11, 2017 at 2:32 pm #

        Is it ok to force a categorical variable to be a continuous variable?

        • Jason Brownlee July 12, 2017 at 9:39 am #

          It depends on the variable. It might make sense if the variable is ordinal. If not, a one hot encoding would be the preferred approach.

Leave a Reply