A Gentle Introduction to XGBoost for Applied Machine Learning

Last Updated on February 17, 2021

XGBoost is an algorithm that has recently been dominating applied machine learning and Kaggle competitions for structured or tabular data.

XGBoost is an implementation of gradient boosted decision trees designed for speed and performance.

In this post you will discover XGBoost and get a gentle introduction to what is, where it came from and how you can learn more.

After reading this post you will know:

• What XGBoost is and the goals of the project.
• Why XGBoost must be a part of your machine learning toolkit.

Kick-start your project with my new book XGBoost With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

• Updated Feb/2021: Fixed broken links.

A Gentle Introduction to XGBoost for Applied Machine Learning
Photo by Sigfrid Lundberg, some rights reserved.

Need help with XGBoost in Python?

Take my free 7-day email course and discover xgboost (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

What is XGBoost?

XGBoost stands for eXtreme Gradient Boosting.

The name xgboost, though, actually refers to the engineering goal to push the limit of computations resources for boosted tree algorithms. Which is the reason why many people use xgboost.

— Tianqi Chen, in answer to the question “What is the difference between the R gbm (gradient boosting machine) and xgboost (extreme gradient boosting)?” on Quora

It is an implementation of gradient boosting machines created by Tianqi Chen, now with contributions from many developers. It belongs to a broader collection of tools under the umbrella of the Distributed Machine Learning Community or DMLC who are also the creators of the popular mxnet deep learning library.

Tianqi Chen provides a brief and interesting back story on the creation of XGBoost in the post Story and Lessons Behind the Evolution of XGBoost.

XGBoost is a software library that you can download and install on your machine, then access from a variety of interfaces. Specifically, XGBoost supports the following main interfaces:

• Command Line Interface (CLI).
• C++ (the language in which the library is written).
• Python interface as well as a model in scikit-learn.
• R interface as well as a model in the caret package.
• Julia.
• Java and JVM languages like Scala and platforms like Hadoop.

XGBoost Features

The library is laser focused on computational speed and model performance, as such there are few frills. Nevertheless, it does offer a number of advanced features.

Model Features

The implementation of the model supports the features of the scikit-learn and R implementations, with new additions like regularization. Three main forms of gradient boosting are supported:

• Gradient Boosting algorithm also called gradient boosting machine including the learning rate.
• Stochastic Gradient Boosting with sub-sampling at the row, column and column per split levels.
• Regularized Gradient Boosting with both L1 and L2 regularization.

System Features

The library provides a system for use in a range of computing environments, not least:

• Parallelization of tree construction using all of your CPU cores during training.
• Distributed Computing for training very large models using a cluster of machines.
• Out-of-Core Computing for very large datasets that don’t fit into memory.
• Cache Optimization of data structures and algorithm to make best use of hardware.

Algorithm Features

The implementation of the algorithm was engineered for efficiency of compute time and memory resources. A design goal was to make the best use of available resources to train the model. Some key algorithm implementation features include:

• Sparse Aware implementation with automatic handling of missing data values.
• Block Structure to support the parallelization of tree construction.
• Continued Training so that you can further boost an already fitted model on new data.

XGBoost is free open source software available for use under the permissive Apache-2 license.

Why Use XGBoost?

The two reasons to use XGBoost are also the two goals of the project:

1. Execution Speed.
2. Model Performance.

1. XGBoost Execution Speed

Generally, XGBoost is fast. Really fast when compared to other implementations of gradient boosting.

Szilard Pafka performed some objective benchmarks comparing the performance of XGBoost to other implementations of gradient boosting and bagged decision trees. He wrote up his results in May 2015 in the blog post titled “Benchmarking Random Forest Implementations“.

He also provides all the code on GitHub and a more extensive report of results with hard numbers.

Benchmark Performance of XGBoost, taken from Benchmarking Random Forest Implementations.

His results showed that XGBoost was almost always faster than the other benchmarked implementations from R, Python Spark and H2O.

From his experiment, he commented:

I also tried xgboost, a popular library for boosting which is capable to build random forests as well. It is fast, memory efficient and of high accuracy

— Szilard Pafka, Benchmarking Random Forest Implementations.

2. XGBoost Model Performance

XGBoost dominates structured or tabular datasets on classification and regression predictive modeling problems.

The evidence is that it is the go-to algorithm for competition winners on the Kaggle competitive data science platform.

For example, there is an incomplete list of first, second and third place competition winners that used titled: XGBoost: Machine Learning Challenge Winning Solutions.

To make this point more tangible, below are some insightful quotes from Kaggle competition winners:

As the winner of an increasing amount of Kaggle competitions, XGBoost showed us again to be a great all-round algorithm worth having in your toolbox.

When in doubt, use xgboost.

I love single models that do well, and my best single model was an XGBoost that could get the 10th place by itself.

I only used XGBoost.

The only supervised learning method I used was gradient boosting, as implemented in the excellent xgboost package.

What Algorithm Does XGBoost Use?

The XGBoost library implements the gradient boosting decision tree algorithm.

This algorithm goes by lots of different names such as gradient boosting, multiple additive regression trees, stochastic gradient boosting or gradient boosting machines.

Boosting is an ensemble technique where new models are added to correct the errors made by existing models. Models are added sequentially until no further improvements can be made. A popular example is the AdaBoost algorithm that weights data points that are hard to predict.

Gradient boosting is an approach where new models are created that predict the residuals or errors of prior models and then added together to make the final prediction. It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models.

This approach supports both regression and classification predictive modeling problems.

For more on boosting and gradient boosting, see Trevor Hastie’s talk on Gradient Boosting Machine Learning.

Official XGBoost Resources

The best source of information on XGBoost is the official GitHub repository for the project.

From there you can get access to the Issue Tracker and the User Group that can be used for asking questions and reporting bugs.

A great source of links with example code and help is the Awesome XGBoost page.

There is also an official documentation page that includes a getting started guide for a range of different languages, tutorials, how-to guides and more.

There are some more formal papers on XGBoost that are worth a read for more background on the library:

Talks on XGBoost

When getting started with a new tool like XGBoost, it can be helpful to review a few talks on the topic before diving into the code.

XGBoost: A Scalable Tree Boosting System

Tianqi Chen, the creator of the library gave a talk to the LA Data Science group in June 2016 titled “XGBoost: A Scalable Tree Boosting System“.

You can review the slides from his talk here:

Tong He, a contributor to XGBoost for the R interface gave a talk at the NYC Data Science Academy in December 2015 titled “XGBoost: eXtreme Gradient Boosting“.

You can review the slides from his talk here:

Installing XGBoost

There is a comprehensive installation guide on the XGBoost documentation website.

It covers installation for Linux, Mac OS X and Windows.

It also covers installation on platforms such as R and Python.

XGBoost in R

If you are an R user, the best place to get started is the CRAN page for the xgboost package.

From this page you can access the R vignette Package ‘xgboost’ [pdf].

There is also the official XGBoost R Tutorial and Understand your dataset with XGBoost tutorial.

XGBoost in Python

Installation instructions are available on the Python section of the XGBoost installation guide.

The official Python Package Introduction is the best place to start when working with XGBoost in Python.

To get started quickly, you can type:

There is also an excellent list of sample source code in Python on the XGBoost Python Feature Walkthrough.

Summary

In this post you discovered the XGBoost algorithm for applied machine learning.

You learned:

• That XGBoost is a library for developing fast and high performance gradient boosting tree models.
• That XGBoost is achieving the best performance on a range of difficult machine learning tasks.
• That you can use this library from the command line, Python and R and how to get started.

Discover The Algorithm Winning Competitions!

Develop Your Own XGBoost Models in Minutes

...with just a few lines of Python

Discover how in my new Ebook:
XGBoost With Python

It covers self-study tutorials like:
Algorithm Fundamentals, Scaling, Hyperparameters, and much more...

63 Responses to A Gentle Introduction to XGBoost for Applied Machine Learning

1. Seo Young Jae July 10, 2017 at 6:25 pm #

Good information, thank you. Just one question.

Biggest difference from the gbm is normalization?

Does gbm not normalize, but does xgboost automatically normalize variables and automatically handle missing values? Did I get it right?

• Jason Brownlee July 11, 2017 at 10:28 am #

The biggest difference is performance, not normalization.

2. Seo Young Jae July 10, 2017 at 7:46 pm #

I ran xgboost on R.

However, I found that input values can not be performed in the form of factors.

In case of gbm, it is possible to use factor type variable.

In that respect, xgboost seems to have some disadvantages.

• Jason Brownlee July 11, 2017 at 10:29 am #

You must transform your categorical variables to be integer encoded or one hot encoded.

• Seo Young Jae July 11, 2017 at 2:32 pm #

Is it ok to force a categorical variable to be a continuous variable?

• Jason Brownlee July 12, 2017 at 9:39 am #

It depends on the variable. It might make sense if the variable is ordinal. If not, a one hot encoding would be the preferred approach.

• sapan September 18, 2018 at 3:44 am #

this seems to be a limitation of the xgboost implementation you’re using, not of the algorithm itself.

3. dksahu September 6, 2017 at 5:53 pm #

reference for monotonocity constraint for decision trees in xgboost?

4. Aman Garg September 13, 2017 at 5:44 am #

Could you please tell that if XGBoost can also be used for unsupervised learning – clustering of large datasets?

If yes, does XGBoost provides an edge over other unsupervised algorithms – like K means clustering, DBSCAN etc. ?

• Jason Brownlee September 13, 2017 at 12:37 pm #

Not as far as I know. Gradient boosting is a supervised learning algorithm.

5. Petros Koulouris September 20, 2017 at 5:14 pm #

Jason, I would love to see how to perform repeated cross validation in order to hyper-tune model parameters. I used the caret package and it took 20-30 times longer than to train other models types ie ranger, gbm, glmnet on the same German credit dataset.

Its been touted as extremely fast which I haven’t observed and most tutorials I have found employ caret.

• Jason Brownlee September 21, 2017 at 5:37 am #

Thanks for the suggestion Petros.

6. Sasikanth September 23, 2017 at 10:08 am #

Hello Jason,
Have you tried to install and use LightGBM from Microsoft. It is said to be better and faster than XGboost.

• Jason Brownlee September 24, 2017 at 5:11 am #

I have not, perhaps in the future.

7. Norbert November 10, 2017 at 6:57 pm #

Yep, XGboost rocks. One year ago I have created a quick free online course how to use it efficiently in Python – http://education.parrotprediction.teachable.com/p/practical-xgboost-in-python

• Jason Brownlee November 11, 2017 at 9:20 am #

Cool, thanks for the ref Norbert. That is also about the time I released my book on the topic.

8. Frank Ludeña January 18, 2018 at 8:44 am #

Hello good afternoon, with respect to the fact that xgboost does not support categorical variables, I trained the following model in caret with a factor variable with xgbtree and I had no problem, (a single variable to exemplify). I am doing something wrong?

pase0.xgbTree_x=train(as.factor(PASE)~TIPO_CLIENTE,data=pase0,trControl=trainControl(method=’repeatedcv’,number=5,repeats=10,verboseIter = TRUE),method=’xgbTree’,allowParallel=TRUE,tuneGrid=xgb.tuning)

Tha categorical variable is TIPO_CLIENTE

• Jason Brownlee January 18, 2018 at 10:15 am #

9. Abhilash Menon April 5, 2018 at 2:02 am #

Hi Dr. Brownlee, Is there a way to get all the predictions we make into the test dataset with the predictions of our model as a column in the test dataset. I am concerned about how the order of instances will be preserved in this case (as in, the prediction corresponding to an instance should be in the same row as the instance). Could you please shed some light on this issue?

Thanks!

• Jason Brownlee April 5, 2018 at 6:14 am #

The order of the inputs will match the order of the outputs.

10. Brett April 8, 2018 at 5:00 am #

Not sure if this is the place for it, feel free to delete if not… but I just wanted to drop you a note to say thank you for the site… whenever it pops up in a search (which is often) I know I’m going to get some quality info.

• Jason Brownlee April 8, 2018 at 6:30 am #

Thanks Brett, I really appreciate the kind words!

11. IanDz April 17, 2018 at 7:14 pm #

Jason, just wanted to thank you for all the amazing stuff you do! Your articles are some of the best online!

• Jason Brownlee April 18, 2018 at 8:02 am #

Thanks IanDz, I really appreciate your support!

12. GopalKrishna June 3, 2018 at 12:47 pm #

Jason,

Xgb Importance output includes Split, RealCover and RealCover% in addition to Gain, Cover and Frequency when you pass add. parameters – training set ( or its subset) and label.

While Split value is understood, could you help understand/ interpret RealCover and RealCover% that appear against specific features only.

Also, in such expanded output what meaning should be derived from number of entries in the xgb importance table?

Thanks

13. Purvi Prajapati December 4, 2018 at 10:09 pm #

Could we apply XGBoost for Multi-Label Classification Problem?
Kindly reply me. I am working on Tree based approach for Multi-label classification.

• Jason Brownlee December 5, 2018 at 6:16 am #

Perhaps. Sorry, I don’t have any examples of multi-label prediction. I hope to cover it in the future.

• Purvi Prajapati December 5, 2018 at 3:51 pm #

Let me know is it applicable to Multi-Label Classification or not.

• Apoorv January 23, 2019 at 2:50 am #

I think you can by setting the objective function to any of the the below as per your requirements (from xgboost documentation: https://xgboost.readthedocs.io/en/latest/parameter.html):

multi:softmax: set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes)

multi:softprob: same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata * nclass matrix. The result contains predicted probability of each data point belonging to each class.

14. Mak Wai Keong December 15, 2018 at 7:17 pm #

Hi all

I am very keen to know how Xgb can be used in the context of Learning (such as intelligent tutoring – to pick right context of knowledge for user).

I am new in this area, but is very keen to apply AI to learning.
One way I saw was the use of Dialogs to know what is known and what is not known, and what is to be known.

Looking forward for you experts for tips and advice

15. Monchy January 15, 2019 at 12:48 am #

I see that one-hot encoding of factor variables is required. However, in my R implementation XGBoost performs without any error or warning messages when I include factors. Does the algorithm ignore these variables?

• Jason Brownlee January 15, 2019 at 5:53 am #

I think R handles the factors automatically.

No, they are not ignored.

16. Amal January 22, 2019 at 8:59 pm #

Hello
I need to know what it the best to use in case of binary classification: xgboost or logistic regression with gradient discent and why
thank you so much

• Jason Brownlee January 23, 2019 at 8:47 am #

It is not knowable. You must test a suite of methods and discover what works best for a specific dataset.

17. Thomas March 22, 2019 at 5:58 am #

All the bells and whistles are there but the meat of the algorithm is extremely poorly presented. Can’t believe this is listed second on Google.

• Jason Brownlee March 22, 2019 at 8:42 am #

Sorry to hear that Thomas.

What do you think was missing exactly? What would you like to see?

18. Soroosh June 13, 2019 at 12:23 pm #

Two main points:

1) Comparing XGBoost and Spark Gradient Boosted Trees using a single node is not the right comparison. Spark GBT is designed for multi-computer processing, if you add more nodes, the processing time dramatically drops while Spark manages the cluster. XGBoost can be run on a distributed cluster, but on a Hadoop cluster.

2) XGBoost and Gradient Boosted Trees are bias-based. They reduce variance too, but not as good as variance-based models like Random Forest), so when you are dealing with Kaggle datasets XGBoost works well, but when you are dealing with the real world and data streaming problem, Random Forest is a more stable model (stability in terms of handling high variance data which happens a lot in streaming data)

• Jason Brownlee June 13, 2019 at 2:36 pm #

Thanks.

• Abdallah Elbohy July 22, 2019 at 12:50 pm #

Thanks for adding information. But aren’t there all datasets in kaggle in a real-world?
And which datasets will be more stable with random forests than in XGBoost?

• Jason Brownlee July 22, 2019 at 2:08 pm #

I think so.

Tabular data is often best solved with xgboost, compared to neural nets or other methods.

19. Mrudhula September 6, 2019 at 4:39 am #

may i know the disadvantages of xgboost sir???

• Jason Brownlee September 6, 2019 at 5:06 am #

Good question.

It can be slow.
It can create a complex model.

20. ralph September 8, 2019 at 8:03 pm #

Hi, and thanks for this very clear post!

Just to make sure I understand properly: if speed is not a concern, xgboost will bring nothing more than a classical random forest, right?

• Jason Brownlee September 9, 2019 at 5:14 am #

No, it is a different algorithm called stochastic gradient boosting, and it offers both performance (skill) and speed improvements over other implementations.

21. Ashvin November 14, 2019 at 1:38 am #

Thanks for this article. Is it possible to decompose a dependent variable using XGBOOST, like coefficient times variable in a Linear Model?

22. Anthony The Koala November 25, 2019 at 6:14 am #

Dear Dr Jason,
The “pipped” version of xgboost crashed when using the demonstration of “learning_rate on the Pima Indians Onset of Diabetes dataset” in one of your ‘crash courses’.
Definition: “pipped” meaning pip install –upgrade xgboost.

Solution: while the solution worked for me, I cannot guarantee that it will work for you if your xgboost has crashed. This to get the *.whl version at https://www.lfd.uci.edu/~gohlke/pythonlibs/. Search for xgboost and obtain the suitable version of the *.whl file for the particular versoin of Python and whether you are using a 32-bit or 62-bit version of the Python interpreseter.

Then in your command window, you say:

Thank you,
Anthony of Sydney

23. manuela January 14, 2020 at 5:14 pm #

several research papers use xgboost for feature engineering, is it possible to use it in resarch paper as just an algorithm for enhancing pridiction of other classifier approch?

• Jason Brownlee January 15, 2020 at 8:19 am #

You can use xgboost anyway you want, e.g. for feature selection.

As for describing in the use in a research paper, I cannot comment.

24. Apoorv Vishnoi February 27, 2020 at 5:37 pm #

Hi Jason,
Thanks for writing this article. There is a doubt that I have not been able to clear, even after attempting to read the original paper on xgboost. Like Adaboost does XGB also weigh each sample differently for subsequent models?

• Jason Brownlee February 28, 2020 at 6:00 am #

I believe so. It is key to “boosting”.

25. Igor Ost April 22, 2020 at 7:00 am #

IMHO ” .. must be a part of your …” not apart

26. Miguel August 16, 2020 at 10:03 am #

Jason, could you please explain what does “structured or tabular data” in this context? As opposed to…

Is XGBoost suitable for time series?

Thank you very much for all you excellent material.

27. Hussain November 16, 2021 at 4:06 am #

is xgboost discriminative model?

• Adrian Tam November 16, 2021 at 3:14 pm #

Yes, as you provide data to it rather than it generate data for you.

28. Jack May 24, 2022 at 3:23 am #

Hello, Any Ideas if we can use it in multiple instance classification, So if we have a dataset with multiple systems and each system have multiple rows.

Or would you recommend using a different approach? if so what ?