Stochastic Gradient Boosting with XGBoost and scikit-learn in Python

Last Updated on

A simple technique for ensembling decision trees involves training trees on subsamples of the training dataset.

Subsets of the the rows in the training data can be taken to train individual trees called bagging. When subsets of rows of the training data are also taken when calculating each split point, this is called random forest.

These techniques can also be used in the gradient tree boosting model in a technique called stochastic gradient boosting.

In this post you will discover stochastic gradient boosting and how to tune the sampling parameters using XGBoost with scikit-learn in Python.

After reading this post you will know:

  • The rationale behind training trees on subsamples of data and how this can be used in gradient boosting.
  • How to tune row-based subsampling in XGBoost using scikit-learn.
  • How to tune column-based subsampling by both tree and split-point in XGBoost.

Discover how to configure, fit, tune and evaluation gradient boosting models with XGBoost in my new book, with 15 step-by-step tutorial lessons, and full python code.

Let’s get started.

  • Update Jan/2017: Updated to reflect changes in scikit-learn API version 0.18.1.
Stochastic Gradient Boosting with XGBoost and scikit-learn in Python

Stochastic Gradient Boosting with XGBoost and scikit-learn in Python
Photo by Henning Klokkeråsen, some rights reserved.

Need help with XGBoost in Python?

Take my free 7-day email course and discover xgboost (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Stochastic Gradient Boosting

Gradient boosting is a greedy procedure.

New decision trees are added to the model to correct the residual error of the existing model.

Each decision tree is created using a greedy search procedure to select split points that best minimize an objective function. This can result in trees that use the same attributes and even the same split points again and again.

Bagging is a technique where a collection of decision trees are created, each from a different random subset of rows from the training data. The effect is that better performance is achieved from the ensemble of trees because the randomness in the sample allows slightly different trees to be created, adding variance to the ensembled predictions.

Random forest takes this one step further, by allowing the features (columns) to be subsampled when choosing split points, adding further variance to the ensemble of trees.

These same techniques can be used in the construction of decision trees in gradient boosting in a variation called stochastic gradient boosting.

It is common to use aggressive sub-samples of the training data such as 40% to 80%.

Tutorial Overview

In this tutorial we are going to look at the effect of different subsampling techniques in gradient boosting.

We will tune three different flavors of stochastic gradient boosting supported by the XGBoost library in Python, specifically:

  1. Subsampling of rows in the dataset when creating each tree.
  2. Subsampling of columns in the dataset when creating each tree.
  3. Subsampling of columns for each split in the dataset when creating each tree.

Problem Description: Otto Dataset

In this tutorial we will use the Otto Group Product Classification Challenge dataset.

This dataset is available for free from Kaggle (you will need to sign-up to Kaggle to be able to download this dataset). You can download the training dataset from the Data page and place the unzipped train.csv file into your working directory.

This dataset describes the 93 obfuscated details of more than 61,000 products grouped into 10 product categories (e.g. fashion, electronics, etc.). Input attributes are counts of different events of some kind.

The goal is to make predictions for new products as an array of probabilities for each of the 10 categories and models are evaluated using multiclass logarithmic loss (also called cross entropy).

This competition was completed in May 2015 and this dataset is a good challenge for XGBoost because of the nontrivial number of examples, the difficulty of the problem and the fact that little data preparation is required (other than encoding the string class variables as integers).

Tuning Row Subsampling in XGBoost

Row subsampling involves selecting a random sample of the training dataset without replacement.

Row subsampling can be specified in the scikit-learn wrapper of the XGBoost class in the subsample parameter. The default is 1.0 which is no sub-sampling.

We can use the grid search capability built into scikit-learn to evaluate the effect of different subsample values from 0.1 to 1.0 on the Otto dataset.

There are 9 variations of subsample and each model will be evaluated using 10-fold cross validation, meaning that 9×10 or 90 models need to be trained and tested.

The complete code listing is provided below.

Running this example prints the best configuration as well as the log loss for each tested configuration.

We can see that the best results achieved were 0.3, or training trees using a 30% sample of the training dataset.

We can plot these mean and standard deviation log loss values to get a better understanding of how performance varies with the subsample value.

Plot of Tuning Row Sample Rate in XGBoost

Plot of Tuning Row Sample Rate in XGBoost

We can see that indeed 30% has the best mean performance, but we can also see that as the ratio increased, the variance in performance grows quite markedly.

It is interesting to note that the mean performance of all subsample values outperforms the mean performance without subsampling (subsample=1.0).

Tuning Column Subsampling in XGBoost By Tree

We can also create a random sample of the features (or columns) to use prior to creating each decision tree in the boosted model.

In the XGBoost wrapper for scikit-learn, this is controlled by the colsample_bytree parameter.

The default value is 1.0 meaning that all columns are used in each decision tree. We can evaluate values for colsample_bytree between 0.1 and 1.0 incrementing by 0.1.

The full code listing is provided below.

Running this example prints the best configuration as well as the log loss for each tested configuration.

We can see that the best performance for the model was colsample_bytree=1.0. This suggests that subsampling columns on this problem does not add value.

Plotting the results, we can see the performance of the model plateau (at least at this scale) with values between 0.5 to 1.0.

Plot of Tuning Per-Tree Column Sampling in XGBoost

Plot of Tuning Per-Tree Column Sampling in XGBoost

Tuning Column Subsampling in XGBoost By Split

Rather than subsample the columns once for each tree, we can subsample them at each split in the decision tree. In principle, this is the approach used in random forest.

We can set the size of the sample of columns used at each split in the colsample_bylevel parameter in the XGBoost wrapper classes for scikit-learn.

As before, we will vary the ratio from 10% to the default of 100%.

The full code listing is provided below.

Running this example prints the best configuration as well as the log loss for each tested configuration.

We can see that the best results were achieved by setting colsample_bylevel to 70%, resulting in an (inverted) log loss of -0.001062, which is better than -0.001239 seen when setting the per-tree column sampling to 100%.

This suggest to not give up on column subsampling if per-tree results suggest using 100% of columns, and to instead try per-split column subsampling.

We can plot the performance of each colsample_bylevel variation. The results show relatively low variance and seemingly a plateau in performance after a value of 0.3 at this scale.

Plot of Tuning Per-Split Column Sampling in XGBoost

Plot of Tuning Per-Split Column Sampling in XGBoost


In this post you discovered stochastic gradient boosting with XGBoost in Python.

Specifically, you learned:

  • About stochastic boosting and how you can subsample your training data to improve the generalization of your model
  • How to tune row subsampling with XGBoost in Python and scikit-learn.
  • How to tune column subsampling with XGBoost both per-tree and per-split.

Do you have any questions about stochastic gradient boosting or about this post? Ask your questions in the comments and I will do my best to answer.

Discover The Algorithm Winning Competitions!

XGBoost With Python

Develop Your Own XGBoost Models in Minutes

...with just a few lines of Python

Discover how in my new Ebook:
XGBoost With Python

It covers self-study tutorials like:
Algorithm Fundamentals, Scaling, Hyperparameters, and much more...

Bring The Power of XGBoost To Your Own Projects

Skip the Academics. Just Results.

See What's Inside

15 Responses to Stochastic Gradient Boosting with XGBoost and scikit-learn in Python

  1. Omogbhein azeez May 26, 2017 at 3:03 am #

    Hello Dr,

    Good day sir. How can I plot the tuning of two variables simultaneously e.g gamma and learning_rate. Kind regards.

    • Jason Brownlee June 2, 2017 at 11:48 am #

      Consider saving the values to file for later plotting.

  2. Niranjan December 28, 2017 at 1:35 am #

    what is the accuracy of the model for different levels?

  3. ben hoyle April 22, 2018 at 5:56 pm #

    Do you know how to obtain a probability distribution function “pdf” from gradient boosted trees. e.g. for Random Forests, we can use the distribution of predictions from each tree as a proxy for the pdf, and for AdaBoost one may weight the prediction from each tree by the weight of the tree, to get a “pdf”.

    I’ve not found anything about this for gradient boosted trees…

    • Jason Brownlee April 23, 2018 at 6:14 am #

      Interesting idea. No sorry, you might have to write some custom code.

  4. Aimee July 4, 2018 at 5:45 am #

    Another great article! Thanks. 🙂

    You mentioned that Stochastic Gradient Boosting which implements column subsampling by split is very similar to how Random Forests operate. It seems that the difference is this:

    For Random Forests, the split is based on selection of the column which results in the most homogenous split outcomes (a greedy algorithm).

    For Stochastic Gradient Boosting implementing column subsampling by split, the split is based upon random selection of a column to be split.

    Is this a correct understanding of the distinction between these two methods?


  5. Sinan Ozdemir December 4, 2018 at 9:00 am #

    Hi Jason,

    Do you have a plan to write another book or a tutorial about using XGBoost in time series problems/predictions?

    • Jason Brownlee December 4, 2018 at 2:32 pm #

      Great suggestion. Not at this stage, but I like the idea!

      • Sinan Ozdemir December 5, 2018 at 1:08 am #

        It will be awesome.

        Thank you so much.

  6. Ted August 25, 2019 at 2:21 pm #

    Man your work is awesome, You deserve a medal. My M.L. skills have greatly improved thanks to you.

    Question : What do I stand to loose if I tune multiple parameters at the same time in the same cell.

    E.g Tuning subsample, colsample_bytree & colsample_bylevel on the same cell using the same method of selecting parameter samples to be used, saving them in a dictionary then parsing KFold and GridSearchCV

    • Jason Brownlee August 26, 2019 at 6:09 am #

      Thanks Ted! I’m glad the posts help.

      You can tune multiple parameters at the same time – it’s a great idea, it can just be slow – computationally expensive to run.

  7. Matthias Luthi February 5, 2020 at 11:39 pm #

    Very nice article. My question is, why did you not “keep” the optimal parameters found (row subsampling of 0.3) for the next steps?
    In the end, we want to find the best combination of the subsampling parameters for rows by tree, columns by tree, and columns by level. Testing everything at once in a GridSearch is very time-consuming. However, if you do it iteratively, wouldn’t it make sense to keep the previously found optimal parameter while testing for other kinds of subsampling?

    • Jason Brownlee February 6, 2020 at 8:27 am #

      Yes, but in this tutorial we are demonstrating the effect of the hyperparameters, not trying to best solve the prediction problem.

Leave a Reply