[New Book] Click to get The Beginner's Guide to Data Science!
Use the offer code 20offearlybird to get 20% off. Hurry, sale ends soon!

Stochastic Gradient Boosting with XGBoost and scikit-learn in Python

A simple technique for ensembling decision trees involves training trees on subsamples of the training dataset.

Subsets of the the rows in the training data can be taken to train individual trees called bagging. When subsets of rows of the training data are also taken when calculating each split point, this is called random forest.

These techniques can also be used in the gradient tree boosting model in a technique called stochastic gradient boosting.

In this post you will discover stochastic gradient boosting and how to tune the sampling parameters using XGBoost with scikit-learn in Python.

After reading this post you will know:

  • The rationale behind training trees on subsamples of data and how this can be used in gradient boosting.
  • How to tune row-based subsampling in XGBoost using scikit-learn.
  • How to tune column-based subsampling by both tree and split-point in XGBoost.

Kick-start your project with my new book XGBoost With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

  • Update Jan/2017: Updated to reflect changes in scikit-learn API version 0.18.1.
Stochastic Gradient Boosting with XGBoost and scikit-learn in Python

Stochastic Gradient Boosting with XGBoost and scikit-learn in Python
Photo by Henning Klokkeråsen, some rights reserved.

Need help with XGBoost in Python?

Take my free 7-day email course and discover xgboost (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Stochastic Gradient Boosting

Gradient boosting is a greedy procedure.

New decision trees are added to the model to correct the residual error of the existing model.

Each decision tree is created using a greedy search procedure to select split points that best minimize an objective function. This can result in trees that use the same attributes and even the same split points again and again.

Bagging is a technique where a collection of decision trees are created, each from a different random subset of rows from the training data. The effect is that better performance is achieved from the ensemble of trees because the randomness in the sample allows slightly different trees to be created, adding variance to the ensembled predictions.

Random forest takes this one step further, by allowing the features (columns) to be subsampled when choosing split points, adding further variance to the ensemble of trees.

These same techniques can be used in the construction of decision trees in gradient boosting in a variation called stochastic gradient boosting.

It is common to use aggressive sub-samples of the training data such as 40% to 80%.

Tutorial Overview

In this tutorial we are going to look at the effect of different subsampling techniques in gradient boosting.

We will tune three different flavors of stochastic gradient boosting supported by the XGBoost library in Python, specifically:

  1. Subsampling of rows in the dataset when creating each tree.
  2. Subsampling of columns in the dataset when creating each tree.
  3. Subsampling of columns for each split in the dataset when creating each tree.

Problem Description: Otto Dataset

In this tutorial we will use the Otto Group Product Classification Challenge dataset.

This dataset is available for free from Kaggle (you will need to sign-up to Kaggle to be able to download this dataset). You can download the training dataset train.csv.zip from the Data page and place the unzipped train.csv file into your working directory.

This dataset describes the 93 obfuscated details of more than 61,000 products grouped into 10 product categories (e.g. fashion, electronics, etc.). Input attributes are counts of different events of some kind.

The goal is to make predictions for new products as an array of probabilities for each of the 10 categories and models are evaluated using multiclass logarithmic loss (also called cross entropy).

This competition was completed in May 2015 and this dataset is a good challenge for XGBoost because of the nontrivial number of examples, the difficulty of the problem and the fact that little data preparation is required (other than encoding the string class variables as integers).

Tuning Row Subsampling in XGBoost

Row subsampling involves selecting a random sample of the training dataset without replacement.

Row subsampling can be specified in the scikit-learn wrapper of the XGBoost class in the subsample parameter. The default is 1.0 which is no sub-sampling.

We can use the grid search capability built into scikit-learn to evaluate the effect of different subsample values from 0.1 to 1.0 on the Otto dataset.

There are 9 variations of subsample and each model will be evaluated using 10-fold cross validation, meaning that 9×10 or 90 models need to be trained and tested.

The complete code listing is provided below.

Running this example prints the best configuration as well as the log loss for each tested configuration.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that the best results achieved were 0.3, or training trees using a 30% sample of the training dataset.

We can plot these mean and standard deviation log loss values to get a better understanding of how performance varies with the subsample value.

Plot of Tuning Row Sample Rate in XGBoost

Plot of Tuning Row Sample Rate in XGBoost

We can see that indeed 30% has the best mean performance, but we can also see that as the ratio increased, the variance in performance grows quite markedly.

It is interesting to note that the mean performance of all subsample values outperforms the mean performance without subsampling (subsample=1.0).

Tuning Column Subsampling in XGBoost By Tree

We can also create a random sample of the features (or columns) to use prior to creating each decision tree in the boosted model.

In the XGBoost wrapper for scikit-learn, this is controlled by the colsample_bytree parameter.

The default value is 1.0 meaning that all columns are used in each decision tree. We can evaluate values for colsample_bytree between 0.1 and 1.0 incrementing by 0.1.

The full code listing is provided below.

Running this example prints the best configuration as well as the log loss for each tested configuration.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that the best performance for the model was colsample_bytree=1.0. This suggests that subsampling columns on this problem does not add value.

Plotting the results, we can see the performance of the model plateau (at least at this scale) with values between 0.5 to 1.0.

Plot of Tuning Per-Tree Column Sampling in XGBoost

Plot of Tuning Per-Tree Column Sampling in XGBoost

Tuning Column Subsampling in XGBoost By Split

Rather than subsample the columns once for each tree, we can subsample them at each split in the decision tree. In principle, this is the approach used in random forest.

We can set the size of the sample of columns used at each split in the colsample_bylevel parameter in the XGBoost wrapper classes for scikit-learn.

As before, we will vary the ratio from 10% to the default of 100%.

The full code listing is provided below.

Running this example prints the best configuration as well as the log loss for each tested configuration.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that the best results were achieved by setting colsample_bylevel to 70%, resulting in an (inverted) log loss of -0.001062, which is better than -0.001239 seen when setting the per-tree column sampling to 100%.

This suggest to not give up on column subsampling if per-tree results suggest using 100% of columns, and to instead try per-split column subsampling.

We can plot the performance of each colsample_bylevel variation. The results show relatively low variance and seemingly a plateau in performance after a value of 0.3 at this scale.

Plot of Tuning Per-Split Column Sampling in XGBoost

Plot of Tuning Per-Split Column Sampling in XGBoost

Summary

In this post you discovered stochastic gradient boosting with XGBoost in Python.

Specifically, you learned:

  • About stochastic boosting and how you can subsample your training data to improve the generalization of your model
  • How to tune row subsampling with XGBoost in Python and scikit-learn.
  • How to tune column subsampling with XGBoost both per-tree and per-split.

Do you have any questions about stochastic gradient boosting or about this post? Ask your questions in the comments and I will do my best to answer.

Discover The Algorithm Winning Competitions!

XGBoost With Python

Develop Your Own XGBoost Models in Minutes

...with just a few lines of Python

Discover how in my new Ebook:
XGBoost With Python

It covers self-study tutorials like:
Algorithm Fundamentals, Scaling, Hyperparameters, and much more...

Bring The Power of XGBoost To Your Own Projects

Skip the Academics. Just Results.

See What's Inside

21 Responses to Stochastic Gradient Boosting with XGBoost and scikit-learn in Python

  1. Avatar
    Omogbhein azeez May 26, 2017 at 3:03 am #

    Hello Dr,

    Good day sir. How can I plot the tuning of two variables simultaneously e.g gamma and learning_rate. Kind regards.

    • Avatar
      Jason Brownlee June 2, 2017 at 11:48 am #

      Consider saving the values to file for later plotting.

  2. Avatar
    Niranjan December 28, 2017 at 1:35 am #

    what is the accuracy of the model for different levels?

  3. Avatar
    ben hoyle April 22, 2018 at 5:56 pm #

    Do you know how to obtain a probability distribution function “pdf” from gradient boosted trees. e.g. for Random Forests, we can use the distribution of predictions from each tree as a proxy for the pdf, and for AdaBoost one may weight the prediction from each tree by the weight of the tree, to get a “pdf”.

    I’ve not found anything about this for gradient boosted trees…

    • Avatar
      Jason Brownlee April 23, 2018 at 6:14 am #

      Interesting idea. No sorry, you might have to write some custom code.

  4. Avatar
    Aimee July 4, 2018 at 5:45 am #

    Another great article! Thanks. 🙂

    You mentioned that Stochastic Gradient Boosting which implements column subsampling by split is very similar to how Random Forests operate. It seems that the difference is this:

    For Random Forests, the split is based on selection of the column which results in the most homogenous split outcomes (a greedy algorithm).

    For Stochastic Gradient Boosting implementing column subsampling by split, the split is based upon random selection of a column to be split.

    Is this a correct understanding of the distinction between these two methods?

    Thanks!

  5. Avatar
    Sinan Ozdemir December 4, 2018 at 9:00 am #

    Hi Jason,

    Do you have a plan to write another book or a tutorial about using XGBoost in time series problems/predictions?

    • Avatar
      Jason Brownlee December 4, 2018 at 2:32 pm #

      Great suggestion. Not at this stage, but I like the idea!

      • Avatar
        Sinan Ozdemir December 5, 2018 at 1:08 am #

        It will be awesome.

        Thank you so much.

  6. Avatar
    Ted August 25, 2019 at 2:21 pm #

    Man your work is awesome, You deserve a medal. My M.L. skills have greatly improved thanks to you.

    Question : What do I stand to loose if I tune multiple parameters at the same time in the same cell.

    E.g Tuning subsample, colsample_bytree & colsample_bylevel on the same cell using the same method of selecting parameter samples to be used, saving them in a dictionary then parsing KFold and GridSearchCV

    • Avatar
      Jason Brownlee August 26, 2019 at 6:09 am #

      Thanks Ted! I’m glad the posts help.

      You can tune multiple parameters at the same time – it’s a great idea, it can just be slow – computationally expensive to run.

  7. Avatar
    Matthias Luthi February 5, 2020 at 11:39 pm #

    Very nice article. My question is, why did you not “keep” the optimal parameters found (row subsampling of 0.3) for the next steps?
    In the end, we want to find the best combination of the subsampling parameters for rows by tree, columns by tree, and columns by level. Testing everything at once in a GridSearch is very time-consuming. However, if you do it iteratively, wouldn’t it make sense to keep the previously found optimal parameter while testing for other kinds of subsampling?

    • Avatar
      Jason Brownlee February 6, 2020 at 8:27 am #

      Yes, but in this tutorial we are demonstrating the effect of the hyperparameters, not trying to best solve the prediction problem.

      • Avatar
        Sarah June 21, 2021 at 10:18 pm #

        Hi Jason,

        I have a question. when tuning the multiple hyperparameters to try to best solve the prediction problem, tuning all the hyperparameters together will be accurate but slow. If every time tuning subset of hyperparameters can get the best solution?
        Thank you in advance!

        • Avatar
          Jason Brownlee June 22, 2021 at 6:30 am #

          Yes, all at once is slow, one by one might miss a combination. It is a trade-off.

  8. Avatar
    Sarah July 13, 2021 at 11:09 am #

    Hi Jason,

    What’s the difference of the auc score between the training set and validation set or test set will be better? I mean that the model will predict other samples with high score. I trained a model that the auc of training set is 0.87, the auc of validation set is 0.81, and the auc of test set is 0.83. but the auc for predicting other set is only 0.6. Could you give me some advaices. Thank you in advance.

    • Avatar
      Jason Brownlee July 14, 2021 at 5:24 am #

      If you are performing model selection, then you only need to consider the performance of the model on the out of sample (test) datasets.

  9. Avatar
    Sarah July 14, 2021 at 9:41 pm #

    Thank you for your reply. The performance for predicting the test datasets when building model is well. But using the model to predict the practical dataset proved practically bad! How could I do?

    • Avatar
      Jason Brownlee July 15, 2021 at 5:29 am #

      Perhaps confirm that your test set is representative of the problem?
      Perhaps try repeated k-fold cross-validation to estimate performance?
      Perhaps the new data is very different to the data you used?
      Perhaps ensure that you’re preparing the new data in an identical manner to the training data?

Leave a Reply