The post Gradient Boosting with Scikit-Learn, XGBoost, LightGBM, and CatBoost appeared first on Machine Learning Mastery.

]]>It’s popular for structured predictive modeling problems, such as classification and regression on tabular data, and is often the main algorithm or one of the main algorithms used in winning solutions to machine learning competitions, like those on Kaggle.

There are many implementations of gradient boosting available, including standard implementations in SciPy and efficient third-party libraries. Each uses a different interface and even different names for the algorithm.

In this tutorial, you will discover how to use gradient boosting models for classification and regression in Python.

Standardized code examples are provided for the four major implementations of gradient boosting in Python, ready for you to copy-paste and use in your own predictive modeling project.

After completing this tutorial, you will know:

- Gradient boosting is an ensemble algorithm that fits boosted decision trees by minimizing an error gradient.
- How to evaluate and use gradient boosting with scikit-learn, including gradient boosting machines and the histogram-based algorithm.
- How to evaluate and use third-party gradient boosting algorithms, including XGBoost, LightGBM, and CatBoost.

Let’s get started.

This tutorial is divided into five parts; they are:

- Gradient Boosting Overview
- Gradient Boosting With Scikit-Learn
- Library Installation
- Test Problems
- Gradient Boosting
- Histogram-Based Gradient Boosting

- Gradient Boosting With XGBoost
- Library Installation
- XGBoost for Classification
- XGBoost for Regression

- Gradient Boosting With LightGBM
- Library Installation
- LightGBM for Classification
- LightGBM for Regression

- Gradient Boosting With CatBoost
- Library Installation
- CatBoost for Classification
- CatBoost for Regression

Gradient boosting refers to a class of ensemble machine learning algorithms that can be used for classification or regression predictive modeling problems.

Gradient boosting is also known as gradient tree boosting, stochastic gradient boosting (an extension), and gradient boosting machines, or GBM for short.

Ensembles are constructed from decision tree models. Trees are added one at a time to the ensemble and fit to correct the prediction errors made by prior models. This is a type of ensemble machine learning model referred to as boosting.

Models are fit using any arbitrary differentiable loss function and gradient descent optimization algorithm. This gives the technique its name, “*gradient boosting*,” as the loss gradient is minimized as the model is fit, much like a neural network.

Gradient boosting is an effective machine learning algorithm and is often the main, or one of the main, algorithms used to win machine learning competitions (like Kaggle) on tabular and similar structured datasets.

**Note**: We will not be going into the theory behind how the gradient boosting algorithm works in this tutorial.

For more on the gradient boosting algorithm, see the tutorial:

The algorithm provides hyperparameters that should, and perhaps must, be tuned for a specific dataset. Although there are many hyperparameters to tune, perhaps the most important are as follows:

- The number of trees or estimators in the model.
- The learning rate of the model.
- The row and column sampling rate for stochastic models.
- The maximum tree depth.
- The minimum tree weight.
- The regularization terms alpha and lambda.

**Note**: We will not be exploring how to configure or tune the configuration of gradient boosting algorithms in this tutorial.

For more on tuning the hyperparameters of gradient boosting algorithms, see the tutorial:

There are many implementations of the gradient boosting algorithm available in Python. Perhaps the most used implementation is the version provided with the scikit-learn library.

Additional third-party libraries are available that provide computationally efficient alternate implementations of the algorithm that often achieve better results in practice. Examples include the XGBoost library, the LightGBM library, and the CatBoost library.

**Do you have a different favorite gradient boosting implementation?**

Let me know in the comments below.

When using gradient boosting on your predictive modeling project, you may want to test each implementation of the algorithm.

This tutorial provides examples of each implementation of the gradient boosting algorithm on classification and regression predictive modeling problems that you can copy-paste into your project.

Let’s take a look at each in turn.

**Note**: We are not comparing the performance of the algorithms in this tutorial. Instead, we are providing code examples to demonstrate how to use each different implementation. As such, we are using synthetic test datasets to demonstrate evaluating and making a prediction with each implementation.

This tutorial assumes you have Python and SciPy installed. If you need help, see the tutorial:

In this section, we will review how to use the gradient boosting algorithm implementation in the scikit-learn library.

First, let’s install the library.

Don’t skip this step as you will need to ensure you have the latest version installed.

You can install the scikit-learn library using the pip Python installer, as follows:

sudo pip install scikit-learn

For additional installation instructions specific to your platform, see:

Next, let’s confirm that the library is installed and you are using a modern version.

Run the following script to print the library version number.

# check scikit-learn version import sklearn print(sklearn.__version__)

Running the example, you should see the following version number or higher.

0.22.1

We will demonstrate the gradient boosting algorithm for classification and regression.

As such, we will use synthetic test problems from the scikit-learn library.

We will use the make_classification() function to create a test binary classification dataset.

The dataset will have 1,000 examples, with 10 input features, five of which will be informative and the remaining five that will be redundant. We will fix the random number seed to ensure we get the same examples each time the code is run.

An example of creating and summarizing the dataset is listed below.

# test classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and confirms the expected number of samples and features.

(1000, 10) (1000,)

We will use the make_regression() function to create a test regression dataset.

Like the classification dataset, the regression dataset will have 1,000 examples, with 10 input features, five of which will be informative and the remaining five that will be redundant.

# test regression dataset from sklearn.datasets import make_regression # define dataset X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and confirms the expected number of samples and features.

(1000, 10) (1000,)

Next, let’s look at how we can develop gradient boosting models in scikit-learn.

The scikit-learn library provides the GBM algorithm for regression and classification via the *GradientBoostingClassifier* and *GradientBoostingRegressor* classes.

Let’s take a closer look at each in turn.

The example below first evaluates a GradientBoostingClassifier on the test problem using repeated k-fold cross-validation and reports the mean accuracy. Then a single model is fit on all available data and a single prediction is made.

The complete example is listed below.

# gradient boosting for classification in scikit-learn from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.ensemble import GradientBoostingClassifier from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # evaluate the model model = GradientBoostingClassifier() cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores))) # fit the model on the whole dataset model = GradientBoostingClassifier() model.fit(X, y) # make a single prediction row = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]] yhat = model.predict(row) print('Prediction: %d' % yhat[0])

Running the example first reports the evaluation of the model using repeated k-fold cross-validation, then the result of making a single prediction with a model fit on the entire dataset.

Accuracy: 0.915 (0.025) Prediction: 1

The example below first evaluates a GradientBoostingRegressor on the test problem using repeated k-fold cross-validation and reports the mean absolute error. Then a single model is fit on all available data and a single prediction is made.

The complete example is listed below.

# gradient boosting for regression in scikit-learn from numpy import mean from numpy import std from sklearn.datasets import make_regression from sklearn.ensemble import GradientBoostingRegressor from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from matplotlib import pyplot # define dataset X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) # evaluate the model model = GradientBoostingRegressor() cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise') print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores))) # fit the model on the whole dataset model = GradientBoostingRegressor() model.fit(X, y) # make a single prediction row = [[2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]] yhat = model.predict(row) print('Prediction: %.3f' % yhat[0])

Running the example first reports the evaluation of the model using repeated k-fold cross-validation, then the result of making a single prediction with a model fit on the entire dataset.

MAE: -11.854 (1.121) Prediction: -80.661

The scikit-learn library provides an alternate implementation of the gradient boosting algorithm, referred to as histogram-based gradient boosting.

This is an alternate approach to implement gradient tree boosting inspired by the LightGBM library (described more later). This implementation is provided via the *HistGradientBoostingClassifier* and *HistGradientBoostingRegressor* classes.

The primary benefit of the histogram-based approach to gradient boosting is speed. These implementations are designed to be much faster to fit on training data.

At the time of writing, this is an experimental implementation and requires that you add the following line to your code to enable access to these classes.

from sklearn.experimental import enable_hist_gradient_boosting

Without this line, you will see an error like:

ImportError: cannot import name 'HistGradientBoostingClassifier'

or

ImportError: cannot import name 'HistGradientBoostingRegressor'

Let’s take a close look at how to use this implementation.

The example below first evaluates a HistGradientBoostingClassifier on the test problem using repeated k-fold cross-validation and reports the mean accuracy. Then a single model is fit on all available data and a single prediction is made.

The complete example is listed below.

# histogram-based gradient boosting for classification in scikit-learn from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.experimental import enable_hist_gradient_boosting from sklearn.ensemble import HistGradientBoostingClassifier from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # evaluate the model model = HistGradientBoostingClassifier() cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores))) # fit the model on the whole dataset model = HistGradientBoostingClassifier() model.fit(X, y) # make a single prediction row = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]] yhat = model.predict(row) print('Prediction: %d' % yhat[0])

Running the example first reports the evaluation of the model using repeated k-fold cross-validation, then the result of making a single prediction with a model fit on the entire dataset.

Accuracy: 0.935 (0.024) Prediction: 1

The example below first evaluates a HistGradientBoostingRegressor on the test problem using repeated k-fold cross-validation and reports the mean absolute error. Then a single model is fit on all available data and a single prediction is made.

The complete example is listed below.

# histogram-based gradient boosting for regression in scikit-learn from numpy import mean from numpy import std from sklearn.datasets import make_regression from sklearn.experimental import enable_hist_gradient_boosting from sklearn.ensemble import HistGradientBoostingRegressor from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from matplotlib import pyplot # define dataset X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) # evaluate the model model = HistGradientBoostingRegressor() cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise') print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores))) # fit the model on the whole dataset model = HistGradientBoostingRegressor() model.fit(X, y) # make a single prediction row = [[2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]] yhat = model.predict(row) print('Prediction: %.3f' % yhat[0])

MAE: -12.723 (1.540) Prediction: -77.837

XGBoost, which is short for “*Extreme Gradient Boosting*,” is a library that provides an efficient implementation of the gradient boosting algorithm.

The main benefit of the XGBoost implementation is computational efficiency and often better model performance.

For more on the benefits and capability of XGBoost, see the tutorial:

You can install the XGBoost library using the pip Python installer, as follows:

sudo pip install xgboost

For additional installation instructions specific to your platform see:

Next, let’s confirm that the library is installed and you are using a modern version.

Run the following script to print the library version number.

# check xgboost version import xgboost print(xgboost.__version__)

Running the example, you should see the following version number or higher.

1.0.1

The XGBoost library provides wrapper classes so that the efficient algorithm implementation can be used with the scikit-learn library, specifically via the *XGBClassifier* and *XGBregressor* classes.

Let’s take a closer look at each in turn.

The example below first evaluates an XGBClassifier on the test problem using repeated k-fold cross-validation and reports the mean accuracy. Then a single model is fit on all available data and a single prediction is made.

The complete example is listed below.

# xgboost for classification from numpy import asarray from numpy import mean from numpy import std from sklearn.datasets import make_classification from xgboost import XGBClassifier from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # evaluate the model model = XGBClassifier() cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores))) # fit the model on the whole dataset model = XGBClassifier() model.fit(X, y) # make a single prediction row = [2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145] row = asarray(row).reshape((1, len(row))) yhat = model.predict(row) print('Prediction: %d' % yhat[0])

Accuracy: 0.936 (0.019) Prediction: 1

The example below first evaluates an XGBRegressor on the test problem using repeated k-fold cross-validation and reports the mean absolute error. Then a single model is fit on all available data and a single prediction is made.

The complete example is listed below.

# xgboost for regression from numpy import asarray from numpy import mean from numpy import std from sklearn.datasets import make_regression from xgboost import XGBRegressor from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from matplotlib import pyplot # define dataset X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) # evaluate the model model = XGBRegressor(objective='reg:squarederror') cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise') print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores))) # fit the model on the whole dataset model = XGBRegressor(objective='reg:squarederror') model.fit(X, y) # make a single prediction row = [2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118] row = asarray(row).reshape((1, len(row))) yhat = model.predict(row) print('Prediction: %.3f' % yhat[0])

MAE: -15.048 (1.316) Prediction: -93.434

LightGBM, short for Light Gradient Boosted Machine, is a library developed at Microsoft that provides an efficient implementation of the gradient boosting algorithm.

The primary benefit of the LightGBM is the changes to the training algorithm that make the process dramatically faster, and in many cases, result in a more effective model.

For more technical details on the LightGBM algorithm, see the paper:

You can install the LightGBM library using the pip Python installer, as follows:

sudo pip install lightgbm

For additional installation instructions specific to your platform, see:

Next, let’s confirm that the library is installed and you are using a modern version.

Run the following script to print the library version number.

# check lightgbm version import lightgbm print(lightgbm.__version__)

Running the example, you should see the following version number or higher.

2.3.1

The LightGBM library provides wrapper classes so that the efficient algorithm implementation can be used with the scikit-learn library, specifically via the *LGBMClassifier* and *LGBMRegressor* classes.

Let’s take a closer look at each in turn.

The example below first evaluates an LGBMClassifier on the test problem using repeated k-fold cross-validation and reports the mean accuracy. Then a single model is fit on all available data and a single prediction is made.

The complete example is listed below.

# lightgbm for classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from lightgbm import LGBMClassifier from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # evaluate the model model = LGBMClassifier() cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores))) # fit the model on the whole dataset model = LGBMClassifier() model.fit(X, y) # make a single prediction row = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]] yhat = model.predict(row) print('Prediction: %d' % yhat[0])

Accuracy: 0.934 (0.021) Prediction: 1

The example below first evaluates an LGBMRegressor on the test problem using repeated k-fold cross-validation and reports the mean absolute error. Then a single model is fit on all available data and a single prediction is made.

The complete example is listed below.

# lightgbm for regression from numpy import mean from numpy import std from sklearn.datasets import make_regression from lightgbm import LGBMRegressor from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from matplotlib import pyplot # define dataset X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) # evaluate the model model = LGBMRegressor() cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise') print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores))) # fit the model on the whole dataset model = LGBMRegressor() model.fit(X, y) # make a single prediction row = [[2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]] yhat = model.predict(row) print('Prediction: %.3f' % yhat[0])

MAE: -12.739 (1.408) Prediction: -82.040

CatBoost is a third-party library developed at Yandex that provides an efficient implementation of the gradient boosting algorithm.

The primary benefit of the CatBoost (in addition to computational speed improvements) is support for categorical input variables. This gives the library its name CatBoost for “*Category Gradient Boosting*.”

For more technical details on the CatBoost algorithm, see the paper:

You can install the CatBoost library using the pip Python installer, as follows:

sudo pip install catboost

For additional installation instructions specific to your platform, see:

Next, let’s confirm that the library is installed and you are using a modern version.

Run the following script to print the library version number.

# check catboost version import catboost print(catboost.__version__)

Running the example, you should see the following version number or higher.

0.21

The CatBoost library provides wrapper classes so that the efficient algorithm implementation can be used with the scikit-learn library, specifically via the *CatBoostClassifier* and *CatBoostRegressor* classes.

Let’s take a closer look at each in turn.

The example below first evaluates a CatBoostClassifier on the test problem using repeated k-fold cross-validation and reports the mean accuracy. Then a single model is fit on all available data and a single prediction is made.

The complete example is listed below.

# catboost for classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from catboost import CatBoostClassifier from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # evaluate the model model = CatBoostClassifier(verbose=0, n_estimators=100) cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores))) # fit the model on the whole dataset model = CatBoostClassifier(verbose=0, n_estimators=100) model.fit(X, y) # make a single prediction row = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]] yhat = model.predict(row) print('Prediction: %d' % yhat[0])

Accuracy: 0.931 (0.026) Prediction: 1

The example below first evaluates a CatBoostRegressor on the test problem using repeated k-fold cross-validation and reports the mean absolute error. Then a single model is fit on all available data and a single prediction is made.

The complete example is listed below.

# catboost for regression from numpy import mean from numpy import std from sklearn.datasets import make_regression from catboost import CatBoostRegressor from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from matplotlib import pyplot # define dataset X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) # evaluate the model model = CatBoostRegressor(verbose=0, n_estimators=100) cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise') print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores))) # fit the model on the whole dataset model = CatBoostRegressor(verbose=0, n_estimators=100) model.fit(X, y) # make a single prediction row = [[2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]] yhat = model.predict(row) print('Prediction: %.3f' % yhat[0])

MAE: -9.281 (0.951) Prediction: -74.212

This section provides more resources on the topic if you are looking to go deeper.

- How to Setup Your Python Environment for Machine Learning with Anaconda
- A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning
- How to Configure the Gradient Boosting Algorithm
- A Gentle Introduction to XGBoost for Applied Machine Learning

- Stochastic Gradient Boosting, 2002.
- XGBoost: A Scalable Tree Boosting System, 2016.
- LightGBM: A Highly Efficient Gradient Boosting Decision Tree, 2017.
- CatBoost: gradient boosting with categorical features support, 2017.

- Scikit-Learn Homepage.
- sklearn.ensemble API.
- XGBoost Homepage.
- XGBoost Python API.
- LightGBM Project.
- LightGBM Python API.
- CatBoost Homepage.
- CatBoost API.

In this tutorial, you discovered how to use gradient boosting models for classification and regression in Python.

Specifically, you learned:

- Gradient boosting is an ensemble algorithm that fits boosted decision trees by minimizing an error gradient.
- How to evaluate and use gradient boosting with scikit-learn, including gradient boosting machines and the histogram-based algorithm.
- How to evaluate and use third-party gradient boosting algorithms including XGBoost, LightGBM and CatBoost.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Gradient Boosting with Scikit-Learn, XGBoost, LightGBM, and CatBoost appeared first on Machine Learning Mastery.

]]>The post Comparing 13 Algorithms on 165 Datasets (hint: use Gradient Boosting) appeared first on Machine Learning Mastery.

]]>It is a central question in applied machine learning.

In a recent paper by Randal Olson and others, they attempt to answer it and give you a guide for algorithms and parameters to try on your problem first, before spot checking a broader suite of algorithms.

In this post, you will discover a study and findings from evaluating many machine learning algorithms across a large number of machine learning datasets and the recommendations made from this study.

After reading this post, you will know:

- That ensemble tree algorithms perform well across a wide range of datasets.
- That it is critical to test a suite of algorithms on a problem as there is no silver bullet algorithm.
- That it is critical to test a suite of configurations for a given algorithm as it can result in as much as a 50% improvement on some problems.

Discover how to configure, fit, tune and evaluation gradient boosting models with XGBoost in my new book, with 15 step-by-step tutorial lessons, and full python code.

Let’s get started.

In 2017, Randal Olson, et al. released a pre-print for a paper with the intriguing title “Data-driven Advice for Applying Machine Learning to Bioinformatics Problems“.

The goal of their work was to address the question that every practitioner faces when getting started on their predictive modeling problem; namely:

**What algorithm should I use?**

The authors describe this problem as choice overload, as follows:

Although having several readily-available ML algorithm implementations is advantageous to bioinformatics researchers seeking to move beyond simple statistics, many researchers experience “choice overload” and find difficulty in selecting the right ML algorithm for their problem at hand.

They approach the problem by running a decent sample of algorithms across a large sample of standard machine learning datasets to see what algorithms and parameters work best in general.

They describe their paper as:

… a thorough analysis of 13 state-of-the-art, commonly used machine learning algorithms on a set of 165 publicly available classification problems in order to provide data-driven algorithm recommendations to current researchers

It is very much similar to the paper “Do We Need Hundreds of Classifiers to Solve Real World Classification Problems?” covered in the post “Use Random Forest: Testing 179 Classifiers on 121 Datasets” both in methodology and findings.

A total of 13 different algorithms were chosen for the study.

Algorithms were chosen to provide a mix of types or underlying assumptions.

The goal was to represent the most common classes of algorithms used in the literature, as well as recent state-of-the-art algorithms

The complete list of algorithms is provided below.

- Gaussian Naive Bayes (GNB)
- Bernoulli Naive Bayes (BNB)
- Multinomial Naive Bayes (MNB)
- Logistic Regression (LR)
- Stochastic Gradient Descent (SGD)
- Passive Aggressive Classifier (PAC)
- Support Vector Classifier (SVC)
- K-Nearest Neighbor (KNN)
- Decision Tree (DT)
- Random Forest (RF)
- Extra Trees Classifier (ERF)
- AdaBoost (AB)
- Gradient Tree Boosting (GTB)

The scikit-learn library was used for the implementations of these algorithms.

Each algorithm has zero or more parameters, and a grid search across sensible parameter values was performed for each algorithm.

For each algorithm, the hyperparameters were tuned using a fixed grid search.

A table of algorithms and the hyperparameters evaluated is listed below, taken from the paper.

Algorithms were evaluated using 10-fold cross-validation and the balanced accuracy measure.

Cross-validation was not repeated, perhaps introducing some statistical noise into the results.

A selection of 165 standard machine learning problems were selected for the study.

Many of the problems were drawn from the field of bioinformatics, although not all datasets belong to this field of study.

All prediction problems were classification type problems with two or more classes.

The algorithms were compared on 165 supervised classification datasets from the Penn Machine Learning Benchmark (PMLB). […] PMLB is a collection of publicly available classification problems that have been standardized to the same format and collected in a central location with easy access via Python.

The datasets were drawn from the Penn Machine Learning Benchmark (PMLB) collection, which is a project that provides standard machine learning datasets in a uniform format and made available by a simple Python API. You can learn more about this dataset catalog on the GitHub Project:

All datasets were standardized prior to fitting models.

Prior to evaluating each ML algorithm, we scaled the features of every dataset by subtracting the mean and scaling the features to unit variance.

Other data preparation was not performed, nor feature selection or feature engineering.

The large number of experiments performed resulted in a lot of skill scores to analyze.

The analysis of the results was handled well, asking interesting questions and providing findings in the form of easy-to-understand charts.

The entire experimental design consisted of over 5.5 million ML algorithm and parameter evaluations in total, resulting in a rich set of data that is analyzed from several viewpoints…

Algorithm performance was ranked for each dataset, then the average rank of each algorithm was calculated.

This provided a rough and easy to understand idea of which algorithms performed well or not, on average.

The results showed that both Gradient boosting and random forest had the lowest rank (performed best) and that the Naive Bayes approaches had the highest rank (performed worst) on average.

The post-hoc test underlines the impressive performance of Gradient Tree Boosting, which significantly outperforms every algorithm except Random Forest at the p < 0.01 level.

This is demonstrated with a nice chart, taken from the paper.

No single algorithm performs best or worst.

This is wisdom known to machine learning practitioners, but difficult to grasp for beginners in the field.

There is no silver bullet and you must test a suite of algorithms on a given dataset to see what works best.

… it is worth noting that no one ML algorithm performs best across all 165 datasets. For example, there are 9 datasets for which Multinomial NB performs as well as or better than Gradient Tree Boosting, despite being the overall worst- and best-ranked algorithms, respectively. Therefore, it is still important to consider different ML algorithms when applying ML to new datasets.

Further, picking the right algorithm is not enough. You must also pick the right configuration of the algorithm for your dataset.

… both selecting the right ML algorithm and tuning its parameters is vitally important for most problems.

The results found that tuning an algorithm lifted skill of a method anywhere from 3% to 50%, depending on the algorithm and the dataset.

The results demonstrate why it is unwise to use default ML algorithm hyperparameters: tuning often improves an algorithm’s accuracy by 3-5%, depending on the algorithm. In some cases, parameter tuning led to CV accuracy improvements of 50%.

This is demonstrated with a chart from the paper that shows the spread of improvement offered by parameter tuning on each algorithm.

Not all algorithms are required.

The results found that five algorithms and specific parameters achieved top 1% in performance across 106 of the 165 tested datasets.

These five algorithms are recommended as a starting point for spot checking algorithms on a given dataset in bioinformatics, but I would suggest also more generally:

- Gradient Boosting
- Random Forest
- Support Vector Classifier
- Extra Trees
- Logistic Regression

The paper provides a table of these algorithms, including the recommend parameter settings and the number of datasets covered, e.g. where the algorithm and configuration achieved top 1% performance.

There are two big findings from this paper that are valuable for practitioners, especially those starting out or those who are under pressure to get a result on their own predictive modeling problem.

If in doubt or under time pressure, use ensemble tree algorithms such as gradient boosting and random forest on your dataset.

The analysis demonstrates the strength of state-of-the-art, tree-based ensemble algorithms, while also showing the problem-dependent nature of ML algorithm performance.

No one can look at your problem and tell you what algorithm to use, and there is no silver bullet algorithm.

You must test a suite of algorithms and a suite of parameters for each algorithm to see what works best for your specific problem.

In addition, the analysis shows that selecting the right ML algorithm and thoroughly tuning its parameters can lead to a significant improvement in predictive accuracy on most problems, and is a critical step in every ML application.

I talk about this all the time; for example, see the post:

This section provides more resources on the topic if you are looking to go deeper.

- Data-driven Advice for Applying Machine Learning to Bioinformatics Problems
- scikit-learn benchmarks on GitHub
- Penn Machine Learning Benchmarks
- Quantitative comparison of scikit-learn’s predictive models on a large number of machine learning datasets: A good start
- Use Random Forest: Testing 179 Classifiers on 121 Datasets

In this post, you discovered a study and findings from evaluating many machine learning algorithms across a large number of machine learning datasets.

Specifically, you learned:

- That ensemble tree algorithms perform well across a wide range of datasets.
- That it is critical to test a suite of algorithms on a problem as there is no silver bullet algorithm.
- That it is critical to test a suite of configurations for a given algorithm as it can result in as much as a 50% improvement on some problems.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Comparing 13 Algorithms on 165 Datasets (hint: use Gradient Boosting) appeared first on Machine Learning Mastery.

]]>The post How to Install XGBoost for Python on macOS appeared first on Machine Learning Mastery.

]]>It is a library at the center of many winning solutions in Kaggle data science competitions.

In this tutorial, you will discover how to install the XGBoost library for Python on macOS.

Discover how to configure, fit, tune and evaluation gradient boosting models with XGBoost in my new book, with 15 step-by-step tutorial lessons, and full python code.

Let’s get started.

This tutorial is divided into 3 parts; they are:

- Install MacPorts
- Build XGBoost
- Install XGBoost

**Note**: I have used this procedure for years on a range of different macOS versions and it has not changed. This tutorial was written and tested on macOS High Sierra (10.13.1).

You need GCC and a Python environment installed in order to build and install XGBoost for Python.

I recommend GCC 7 and Python 3.6 and I recommend installing these prerequisites using MacPorts.

- 1. For help installing MacPorts and a Python environment step-by-step, see this tutorial:

>> How to Install a Python 3 Environment on Mac OS X for Machine Learning and Deep Learning

- 2. After MacPorts and a working Python environment are installed, you can install and select GCC 7 as follows:

sudo port install gcc7 sudo port select --set gcc mp-gcc7

- 3. Confirm your GCC installation was successful as follows:

gcc -v

You should see the version of GCC printed; for example:

.. gcc version 7.2.0 (MacPorts gcc7 7.2.0_0)

What version did you see?

Let me know in the comments below.

The next step is to download and compile XGBoost for your system.

- 1. First, check out the code repository from GitHub:

git clone --recursive https://github.com/dmlc/xgboost

- 2. Change into the xgboost directory.

cd xgboost/

- 3. Copy the configuration we intend to use to compile XGBoost into position.

cp make/config.mk ./config.mk

- 4. Compile XGBoost; this requires that you specify the number of cores on your system (e.g. 8, change as needed).

make -j8

The build process may take a minute and should not produce any error messages, although you may see some warnings that you can safely ignore.

For example, the last snippet of the compilation might look as follows:

... a - build/learner.o a - build/logging.o a - build/c_api/c_api.o a - build/c_api/c_api_error.o a - build/common/common.o a - build/common/hist_util.o a - build/data/data.o a - build/data/simple_csr_source.o a - build/data/simple_dmatrix.o a - build/data/sparse_page_dmatrix.o a - build/data/sparse_page_raw_format.o a - build/data/sparse_page_source.o a - build/data/sparse_page_writer.o a - build/gbm/gblinear.o a - build/gbm/gbm.o a - build/gbm/gbtree.o a - build/metric/elementwise_metric.o a - build/metric/metric.o a - build/metric/multiclass_metric.o a - build/metric/rank_metric.o a - build/objective/multiclass_obj.o a - build/objective/objective.o a - build/objective/rank_obj.o a - build/objective/regression_obj.o a - build/predictor/cpu_predictor.o a - build/predictor/predictor.o a - build/tree/tree_model.o a - build/tree/tree_updater.o a - build/tree/updater_colmaker.o a - build/tree/updater_fast_hist.o a - build/tree/updater_histmaker.o a - build/tree/updater_prune.o a - build/tree/updater_refresh.o a - build/tree/updater_skmaker.o a - build/tree/updater_sync.o c++ -std=c++11 -Wall -Wno-unknown-pragmas -Iinclude -Idmlc-core/include -Irabit/include -I/include -O3 -funroll-loops -msse2 -fPIC -fopenmp -o xgboost build/cli_main.o build/learner.o build/logging.o build/c_api/c_api.o build/c_api/c_api_error.o build/common/common.o build/common/hist_util.o build/data/data.o build/data/simple_csr_source.o build/data/simple_dmatrix.o build/data/sparse_page_dmatrix.o build/data/sparse_page_raw_format.o build/data/sparse_page_source.o build/data/sparse_page_writer.o build/gbm/gblinear.o build/gbm/gbm.o build/gbm/gbtree.o build/metric/elementwise_metric.o build/metric/metric.o build/metric/multiclass_metric.o build/metric/rank_metric.o build/objective/multiclass_obj.o build/objective/objective.o build/objective/rank_obj.o build/objective/regression_obj.o build/predictor/cpu_predictor.o build/predictor/predictor.o build/tree/tree_model.o build/tree/tree_updater.o build/tree/updater_colmaker.o build/tree/updater_fast_hist.o build/tree/updater_histmaker.o build/tree/updater_prune.o build/tree/updater_refresh.o build/tree/updater_skmaker.o build/tree/updater_sync.o dmlc-core/libdmlc.a rabit/lib/librabit.a -pthread -lm -fopenmp

Did this step work for you?

Let me know in the comments below.

You are now ready to install XGBoost on your system.

- 1. Change directory into the Python package of the xgboost project.

cd python-package

- 2. Install the Python XGBoost package.

sudo python setup.py install

The installation is very fast.

For example, at the end of the installation, you may see messages like the following:

... Installed /opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/xgboost-0.6-py3.6.egg Processing dependencies for xgboost==0.6 Searching for scipy==1.0.0 Best match: scipy 1.0.0 Adding scipy 1.0.0 to easy-install.pth file Using /opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages Searching for numpy==1.13.3 Best match: numpy 1.13.3 Adding numpy 1.13.3 to easy-install.pth file Using /opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages Finished processing dependencies for xgboost==0.6

- 3. Confirm that the installation was successful by printing the xgboost version, which requires the library to be loaded.

Save the following code to a file called *version.py.*

import xgboost print("xgboost", xgboost.__version__)

Run the script from the command line:

python version.py

You should see the XGBoost version printed to screen:

xgboost 0.6

How did you do?

Post your results in the comments below.

This section provides more resources on the topic if you are looking to go deeper.

- How to Install a Python 3 Environment on Mac OS X for Machine Learning and Deep Learning
- MacPorts Installation Guide
- XGBoost Installation Guide

In this tutorial, you discovered how to install XGBoost for Python on macOS step-by-step.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Install XGBoost for Python on macOS appeared first on Machine Learning Mastery.

]]>The post 7 Step Mini-Course to Get Started with XGBoost in Python appeared first on Machine Learning Mastery.

]]>XGBoost is an implementation of gradient boosting that is being used to win machine learning competitions.

It is powerful but it can be hard to get started.

In this post, you will discover a 7-part crash course on XGBoost with Python.

This mini-course is designed for Python machine learning practitioners that are already comfortable with scikit-learn and the SciPy ecosystem.

Discover how to configure, fit, tune and evaluation gradient boosting models with XGBoost in my new book, with 15 step-by-step tutorial lessons, and full python code.

Let’s get started.

**Update Jan/2017**: Updated to reflect changes in scikit-learn API version 0.18.1.**Update Mar/2018**: Added alternate link to download the dataset as the original appears to have been taken down.

(**Tip**: *you might want to print or bookmark this page so that you can refer back to it later*.)

Before we get started, let’s make sure you are in the right place. The list below provides some general guidelines as to who this course was designed for.

Don’t panic if you don’t match these points exactly, you might just need to brush up in one area or another to keep up.

**Developers that know how to write a little code**. This means that it is not a big deal for you to get things done with Python and know how to setup the SciPy ecosystem on your workstation (a prerequisite). It does not mean your a wizard coder, but it does mean you’re not afraid to install packages and write scripts.**Developers that know a little machine learning**. This means you know about the basics of machine learning like cross validation, some algorithms and the bias-variance trade-off. It does not mean that you are a machine learning PhD, just that you know the landmarks or know where to look them up.

This mini-course is not a textbook on XGBoost. There will be no equations.

It will take you from a developer that knows a little machine learning in Python to a developer who can get results and bring the power of XGBoost to your own projects.

Take my free 7-day email course and discover xgboost (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

This mini-course is divided into 7 parts.

Each lesson was designed to take the average developer about 30 minutes. You might finish some much sooner and others you may choose to go deeper and spend more time.

You can complete each part as quickly or as slowly as you like. A comfortable schedule may be to complete one lesson per day over a one week period. Highly recommended.

The topics you will cover over the next 7 lessons are as follows:

**Lesson 01**: Introduction to Gradient Boosting.**Lesson 02**: Introduction to XGBoost.**Lesson 03**: Develop Your First XGBoost Model.**Lesson 04**: Monitor Performance and Early Stopping.**Lesson 05**: Feature Importance with XGBoost.**Lesson 06**: How to Configure Gradient Boosting.**Lesson 07**: XGBoost Hyperparameter Tuning.

This is going to be a lot of fun.

You’re going to have to do some work though, a little reading, a little research and a little programming. You want to learn about XGBoost right?

(**Tip**: *Help for with these lessons can be found on this blog, use the search feature*.)

Any questions at all, please post in the comments below.

Share your results in the comments.

Hang in there, don’t give up!

Gradient boosting is one of the most powerful techniques for building predictive models.

The idea of boosting came out of the idea of whether a weak learner can be modified to become better. The first realization of boosting that saw great success in application was Adaptive Boosting or AdaBoost for short. The weak learners in AdaBoost are decision trees with a single split, called decision stumps for their shortness.

AdaBoost and related algorithms were recast in a statistical framework and became known as Gradient Boosting Machines. The statistical framework cast boosting as a numerical optimization problem where the objective is to minimize the loss of the model by adding weak learners using a gradient descent like procedure, hence the name.

The Gradient Boosting algorithm involves three elements:

**A loss function to be optimized**, such as cross entropy for classification or mean squared error for regression problems.**A weak learner to make predictions**, such as a greedily constructed decision tree.**An additive model,**used to add weak learners to minimize the loss function.

New weak learners are added to the model in an effort to correct the residual errors of all previous trees. The result is a powerful predictive modeling algorithm, perhaps more powerful than random forest.

In the next lesson we will take a closer look at the XGBoost implementation of gradient boosting.

XGBoost is an implementation of gradient boosted decision trees designed for speed and performance.

XGBoost stands for e**X**treme **G**radient **Boosti**ng.

It was developed by Tianqi Chen and is laser focused on computational speed and model performance, as such there are few frills.

In addition to supporting all key variations of the technique, the real interest is the speed provided by the careful engineering of the implementation, including:

**Parallelization**of tree construction using all of your CPU cores during training.**Distributed Computing**for training very large models using a cluster of machines.**Out-of-Core Computing**for very large datasets that don’t fit into memory.**Cache Optimization**of data structures and algorithms to make best use of hardware.

Traditionally, gradient boosting implementations are slow because of the sequential nature in which each tree must be constructed and added to the model.

The on performance in the development of XGBoost has resulted in one of the best predictive modeling algorithms that can now harness the full capability of your hardware platform, or very large computers you might rent in the cloud.

As such, XGBoost has been a cornerstone in competitive machine learning, being the technique used to win and recommended by winners. For example, here is what some recent Kaggle competition winners have said:

As the winner of an increasing amount of Kaggle competitions, XGBoost showed us again to be a great all-round algorithm worth having in your toolbox.

When in doubt, use xgboost.

In the next lesson, we will develop our first XGBoost model in Python.

Assuming you have a working SciPy environment, XGBoost can be installed easily using pip.

For example:

sudo pip install xgboost

You can learn more about installing and building XGBoost on your platform in the XGBoost Installation Instructions.

XGBoost models can be used directly in the scikit-learn framework using the wrapper classes, **XGBClassifier** for classification and **XGBRegressor** for regression problems.

This is the recommended way to use XGBoost in Python.

Download the Pima Indians onset of diabetes dataset.

It is a good test dataset for binary classification as all input variables are numeric, meaning the problem can be modeled directly with no data preparation.

We can train an XGBoost model for classification by constructing it and calling the **model.fit()** function:

model = XGBClassifier() model.fit(X_train, y_train)

This model can then be used to make predictions by calling the **model.predict()** function on new data.

y_pred = model.predict(X_test)

We can tie this all together as follows:

# First XGBoost model for Pima Indians dataset from numpy import loadtxt from xgboost import XGBClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # load data dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",") # split data into X and y X = dataset[:,0:8] Y = dataset[:,8] # split data into train and test sets seed = 7 test_size = 0.33 X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed) # fit model on training data model = XGBClassifier() model.fit(X_train, y_train) # make predictions for test data y_pred = model.predict(X_test) predictions = [round(value) for value in y_pred] # evaluate predictions accuracy = accuracy_score(y_test, predictions) print("Accuracy: %.2f%%" % (accuracy * 100.0))

In the next lesson we will look at how we can use early stopping to limit overfitting.

The XGBoost model can evaluate and report on the performance on a test set for the model during training.

It supports this capability by specifying both a test dataset and an evaluation metric on the call to **model.fit()** when training the model and specifying verbose output (**verbose=True**).

For example, we can report on the binary classification error rate (**error**) on a standalone test set (**eval_set**) while training an XGBoost model as follows:

eval_set = [(X_test, y_test)] model.fit(X_train, y_train, eval_metric="error", eval_set=eval_set, verbose=True)

Running a model with this configuration will report the performance of the model after each tree is added. For example:

... [89] validation_0-error:0.204724 [90] validation_0-error:0.208661

We can use this evaluation to stop training once no further improvements have been made to the model.

We can do this by setting the **early_stopping_rounds** parameter when calling **model.fit()** to the number of iterations that no improvement is seen on the validation dataset before training is stopped.

The full example using the Pima Indians Onset of Diabetes dataset is provided below.

# exmaple of early stopping from numpy import loadtxt from xgboost import XGBClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # load data dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",") # split data into X and y X = dataset[:,0:8] Y = dataset[:,8] # split data into train and test sets seed = 7 test_size = 0.33 X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed) # fit model on training data model = XGBClassifier() eval_set = [(X_test, y_test)] model.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="logloss", eval_set=eval_set, verbose=True) # make predictions for test data y_pred = model.predict(X_test) predictions = [round(value) for value in y_pred] # evaluate predictions accuracy = accuracy_score(y_test, predictions) print("Accuracy: %.2f%%" % (accuracy * 100.0))

In the next lesson, we will look at how we calculate the importance of features using XGBoost

A benefit of using ensembles of decision tree methods like gradient boosting is that they can automatically provide estimates of feature importance from a trained predictive model.

A trained XGBoost model automatically calculates feature importance on your predictive modeling problem.

These importance scores are available in the **feature_importances_** member variable of the trained model. For example, they can be printed directly as follows:

print(model.feature_importances_)

The XGBoost library provides a built-in function to plot features ordered by their importance.

The function is called **plot_importance()** and can be used as follows:

plot_importance(model) pyplot.show()

These importance scores can help you decide what input variables to keep or discard. They can also be used as the basis for automatic feature selection techniques.

The full example of plotting feature importance scores using the Pima Indians Onset of Diabetes dataset is provided below.

# plot feature importance using built-in function from numpy import loadtxt from xgboost import XGBClassifier from xgboost import plot_importance from matplotlib import pyplot # load data dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",") # split data into X and y X = dataset[:,0:8] y = dataset[:,8] # fit model on training data model = XGBClassifier() model.fit(X, y) # plot feature importance plot_importance(model) pyplot.show()

In the next lesson we will look at heuristics for best configuring the gradient boosting algorithm.

Gradient boosting is one of the most powerful techniques for applied machine learning and as such is quickly becoming one of the most popular.

But how do you configure gradient boosting on your problem?

A number of configuration heuristics were published in the original gradient boosting papers. They can be summarized as:

- Learning rate or shrinkage (
**learning_rate**in XGBoost) should be set to 0.1 or lower, and smaller values will require the addition of more trees. - The depth of trees (
**max_depth**in XGBoost) should be configured in the range of 2-to-8, where not much benefit is seen with deeper trees. - Row sampling (
**subsample**in XGBoost) should be configured in the range of 30% to 80% of the training dataset, and compared to a value of 100% for no sampling.

These are a good starting points when configuring your model.

A good general configuration strategy is as follows:

- Run the default configuration and review plots of the learning curves on the training and validation datasets.
- If the system is overlearning, decrease the learning rate and/or increase the number of trees.
- If the system is underlearning, speed the learning up to be more aggressive by increasing the learning rate and/or decreasing the number of trees.

Owen Zhang, the former #1 ranked competitor on Kaggle and now CTO at Data Robot proposes an interesting strategy to configure XGBoost.

He suggests to set the number of trees to a target value such as 100 or 1000, then tune the learning rate to find the best model. This is an efficient strategy for quickly finding a good model.

In the next and final lesson, we will look at an example of tuning the XGBoost hyperparameters.

The scikit-learn framework provides the capability to search combinations of parameters.

This capability is provided in the **GridSearchCV** class and can be used to discover the best way to configure the model for top performance on your problem.

For example, we can define a grid of the number of trees (**n_estimators**) and tree sizes (**max_depth**) to evaluate by defining a grid as:

n_estimators = [50, 100, 150, 200] max_depth = [2, 4, 6, 8] param_grid = dict(max_depth=max_depth, n_estimators=n_estimators)

And then evaluate each combination of parameters using 10-fold cross validation as:

kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold, verbose=1) result = grid_search.fit(X, label_encoded_y)

We can then review the results to determine the best combination and the general trends in varying the combinations of parameters.

This is the best practice when applying XGBoost to your own problems. The parameters to consider tuning are:

- The number and size of trees (
**n_estimators**and**max_depth**). - The learning rate and number of trees (
**learning_rate**and**n_estimators**). - The row and column subsampling rates (
**subsample**,**colsample_bytree**and**colsample_bylevel**).

Below is a full example of tuning just the **learning_rate** on the Pima Indians Onset of Diabetes dataset.

# Tune learning_rate from numpy import loadtxt from xgboost import XGBClassifier from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold # load data dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",") # split data into X and y X = dataset[:,0:8] Y = dataset[:,8] # grid search model = XGBClassifier() learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3] param_grid = dict(learning_rate=learning_rate) kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold) grid_result = grid_search.fit(X, Y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param))

Congratulations, you made it. Well done!

Take a moment and look back at how far you have come:

- You learned about the gradient boosting algorithm and the XGBoost library.
- You developed your first XGBoost model.
- You learned how to use advanced features like early stopping and feature importance.
- You learned how to configure gradient boosted models and how to design controlled experiments to tune XGBoost hyperparameters.

Don’t make light of this, you have come a long way in a short amount of time. This is just the beginning of your journey with XGBoost in Python. Keep practicing and developing your skills.

Did you enjoy this mini-course? Do you have any questions or sticking points?

Leave a comment and let me know.

The post 7 Step Mini-Course to Get Started with XGBoost in Python appeared first on Machine Learning Mastery.

]]>The post Stochastic Gradient Boosting with XGBoost and scikit-learn in Python appeared first on Machine Learning Mastery.

]]>Subsets of the the rows in the training data can be taken to train individual trees called bagging. When subsets of rows of the training data are also taken when calculating each split point, this is called random forest.

These techniques can also be used in the gradient tree boosting model in a technique called stochastic gradient boosting.

In this post you will discover stochastic gradient boosting and how to tune the sampling parameters using XGBoost with scikit-learn in Python.

After reading this post you will know:

- The rationale behind training trees on subsamples of data and how this can be used in gradient boosting.
- How to tune row-based subsampling in XGBoost using scikit-learn.
- How to tune column-based subsampling by both tree and split-point in XGBoost.

Let’s get started.

**Update Jan/2017**: Updated to reflect changes in scikit-learn API version 0.18.1.

Take my free 7-day email course and discover xgboost (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Gradient boosting is a greedy procedure.

New decision trees are added to the model to correct the residual error of the existing model.

Each decision tree is created using a greedy search procedure to select split points that best minimize an objective function. This can result in trees that use the same attributes and even the same split points again and again.

Bagging is a technique where a collection of decision trees are created, each from a different random subset of rows from the training data. The effect is that better performance is achieved from the ensemble of trees because the randomness in the sample allows slightly different trees to be created, adding variance to the ensembled predictions.

Random forest takes this one step further, by allowing the features (columns) to be subsampled when choosing split points, adding further variance to the ensemble of trees.

These same techniques can be used in the construction of decision trees in gradient boosting in a variation called stochastic gradient boosting.

It is common to use aggressive sub-samples of the training data such as 40% to 80%.

In this tutorial we are going to look at the effect of different subsampling techniques in gradient boosting.

We will tune three different flavors of stochastic gradient boosting supported by the XGBoost library in Python, specifically:

- Subsampling of rows in the dataset when creating each tree.
- Subsampling of columns in the dataset when creating each tree.
- Subsampling of columns for each split in the dataset when creating each tree.

In this tutorial we will use the Otto Group Product Classification Challenge dataset.

This dataset is available for free from Kaggle (you will need to sign-up to Kaggle to be able to download this dataset). You can download the training dataset **train.csv.zip** from the Data page and place the unzipped **train.csv** file into your working directory.

This dataset describes the 93 obfuscated details of more than 61,000 products grouped into 10 product categories (e.g. fashion, electronics, etc.). Input attributes are counts of different events of some kind.

The goal is to make predictions for new products as an array of probabilities for each of the 10 categories and models are evaluated using multiclass logarithmic loss (also called cross entropy).

This competition was completed in May 2015 and this dataset is a good challenge for XGBoost because of the nontrivial number of examples, the difficulty of the problem and the fact that little data preparation is required (other than encoding the string class variables as integers).

Row subsampling involves selecting a random sample of the training dataset without replacement.

Row subsampling can be specified in the scikit-learn wrapper of the XGBoost class in the **subsample** parameter. The default is 1.0 which is no sub-sampling.

We can use the grid search capability built into scikit-learn to evaluate the effect of different subsample values from 0.1 to 1.0 on the Otto dataset.

[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0]

There are 9 variations of subsample and each model will be evaluated using 10-fold cross validation, meaning that 9×10 or 90 models need to be trained and tested.

The complete code listing is provided below.

# XGBoost on Otto dataset, tune subsample from pandas import read_csv from xgboost import XGBClassifier from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import LabelEncoder import matplotlib matplotlib.use('Agg') from matplotlib import pyplot # load data data = read_csv('train.csv') dataset = data.values # split data into X and y X = dataset[:,0:94] y = dataset[:,94] # encode string class values as integers label_encoded_y = LabelEncoder().fit_transform(y) # grid search model = XGBClassifier() subsample = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0] param_grid = dict(subsample=subsample) kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold) grid_result = grid_search.fit(X, label_encoded_y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) # plot pyplot.errorbar(subsample, means, yerr=stds) pyplot.title("XGBoost subsample vs Log Loss") pyplot.xlabel('subsample') pyplot.ylabel('Log Loss') pyplot.savefig('subsample.png')

Running this example prints the best configuration as well as the log loss for each tested configuration.

We can see that the best results achieved were 0.3, or training trees using a 30% sample of the training dataset.

Best: -0.000647 using {'subsample': 0.3} -0.001156 (0.000286) with: {'subsample': 0.1} -0.000765 (0.000430) with: {'subsample': 0.2} -0.000647 (0.000471) with: {'subsample': 0.3} -0.000659 (0.000635) with: {'subsample': 0.4} -0.000717 (0.000849) with: {'subsample': 0.5} -0.000773 (0.000998) with: {'subsample': 0.6} -0.000877 (0.001179) with: {'subsample': 0.7} -0.001007 (0.001371) with: {'subsample': 0.8} -0.001239 (0.001730) with: {'subsample': 1.0}

We can plot these mean and standard deviation log loss values to get a better understanding of how performance varies with the subsample value.

We can see that indeed 30% has the best mean performance, but we can also see that as the ratio increased, the variance in performance grows quite markedly.

It is interesting to note that the mean performance of all **subsample** values outperforms the mean performance without subsampling (**subsample=1.0**).

We can also create a random sample of the features (or columns) to use prior to creating each decision tree in the boosted model.

In the XGBoost wrapper for scikit-learn, this is controlled by the **colsample_bytree** parameter.

The default value is 1.0 meaning that all columns are used in each decision tree. We can evaluate values for **colsample_bytree** between 0.1 and 1.0 incrementing by 0.1.

[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0]

The full code listing is provided below.

# XGBoost on Otto dataset, tune colsample_bytree from pandas import read_csv from xgboost import XGBClassifier from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import LabelEncoder import matplotlib matplotlib.use('Agg') from matplotlib import pyplot # load data data = read_csv('train.csv') dataset = data.values # split data into X and y X = dataset[:,0:94] y = dataset[:,94] # encode string class values as integers label_encoded_y = LabelEncoder().fit_transform(y) # grid search model = XGBClassifier() colsample_bytree = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0] param_grid = dict(colsample_bytree=colsample_bytree) kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold) grid_result = grid_search.fit(X, label_encoded_y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) # plot pyplot.errorbar(colsample_bytree, means, yerr=stds) pyplot.title("XGBoost colsample_bytree vs Log Loss") pyplot.xlabel('colsample_bytree') pyplot.ylabel('Log Loss') pyplot.savefig('colsample_bytree.png')

Running this example prints the best configuration as well as the log loss for each tested configuration.

We can see that the best performance for the model was **colsample_bytree=1.0**. This suggests that subsampling columns on this problem does not add value.

Best: -0.001239 using {'colsample_bytree': 1.0} -0.298955 (0.002177) with: {'colsample_bytree': 0.1} -0.092441 (0.000798) with: {'colsample_bytree': 0.2} -0.029993 (0.000459) with: {'colsample_bytree': 0.3} -0.010435 (0.000669) with: {'colsample_bytree': 0.4} -0.004176 (0.000916) with: {'colsample_bytree': 0.5} -0.002614 (0.001062) with: {'colsample_bytree': 0.6} -0.001694 (0.001221) with: {'colsample_bytree': 0.7} -0.001306 (0.001435) with: {'colsample_bytree': 0.8} -0.001239 (0.001730) with: {'colsample_bytree': 1.0}

Plotting the results, we can see the performance of the model plateau (at least at this scale) with values between 0.5 to 1.0.

Rather than subsample the columns once for each tree, we can subsample them at each split in the decision tree. In principle, this is the approach used in random forest.

We can set the size of the sample of columns used at each split in the **colsample_bylevel** parameter in the XGBoost wrapper classes for scikit-learn.

As before, we will vary the ratio from 10% to the default of 100%.

The full code listing is provided below.

# XGBoost on Otto dataset, tune colsample_bylevel from pandas import read_csv from xgboost import XGBClassifier from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import LabelEncoder import matplotlib matplotlib.use('Agg') from matplotlib import pyplot # load data data = read_csv('train.csv') dataset = data.values # split data into X and y X = dataset[:,0:94] y = dataset[:,94] # encode string class values as integers label_encoded_y = LabelEncoder().fit_transform(y) # grid search model = XGBClassifier() colsample_bylevel = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0] param_grid = dict(colsample_bylevel=colsample_bylevel) kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold) grid_result = grid_search.fit(X, label_encoded_y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) # plot pyplot.errorbar(colsample_bylevel, means, yerr=stds) pyplot.title("XGBoost colsample_bylevel vs Log Loss") pyplot.xlabel('colsample_bylevel') pyplot.ylabel('Log Loss') pyplot.savefig('colsample_bylevel.png')

Running this example prints the best configuration as well as the log loss for each tested configuration.

We can see that the best results were achieved by setting **colsample_bylevel** to 70%, resulting in an (inverted) log loss of -0.001062, which is better than -0.001239 seen when setting the per-tree column sampling to 100%.

This suggest to not give up on column subsampling if per-tree results suggest using 100% of columns, and to instead try per-split column subsampling.

Best: -0.001062 using {'colsample_bylevel': 0.7} -0.159455 (0.007028) with: {'colsample_bylevel': 0.1} -0.034391 (0.003533) with: {'colsample_bylevel': 0.2} -0.007619 (0.000451) with: {'colsample_bylevel': 0.3} -0.002982 (0.000726) with: {'colsample_bylevel': 0.4} -0.001410 (0.000946) with: {'colsample_bylevel': 0.5} -0.001182 (0.001144) with: {'colsample_bylevel': 0.6} -0.001062 (0.001221) with: {'colsample_bylevel': 0.7} -0.001071 (0.001427) with: {'colsample_bylevel': 0.8} -0.001239 (0.001730) with: {'colsample_bylevel': 1.0}

We can plot the performance of each **colsample_bylevel** variation. The results show relatively low variance and seemingly a plateau in performance after a value of 0.3 at this scale.

In this post you discovered stochastic gradient boosting with XGBoost in Python.

Specifically, you learned:

- About stochastic boosting and how you can subsample your training data to improve the generalization of your model
- How to tune row subsampling with XGBoost in Python and scikit-learn.
- How to tune column subsampling with XGBoost both per-tree and per-split.

Do you have any questions about stochastic gradient boosting or about this post? Ask your questions in the comments and I will do my best to answer.

The post Stochastic Gradient Boosting with XGBoost and scikit-learn in Python appeared first on Machine Learning Mastery.

]]>The post Tune Learning Rate for Gradient Boosting with XGBoost in Python appeared first on Machine Learning Mastery.

]]>One effective way to slow down learning in the gradient boosting model is to use a learning rate, also called shrinkage (or eta in XGBoost documentation).

In this post you will discover the effect of the learning rate in gradient boosting and how to tune it on your machine learning problem using the XGBoost library in Python.

After reading this post you will know:

- The effect learning rate has on the gradient boosting model.
- How to tune learning rate on your machine learning on your problem.
- How to tune the trade-off between the number of boosted trees and learning rate on your problem.

Let’s get started.

**Update Jan/2017**: Updated to reflect changes in scikit-learn API version 0.18.1.

Take my free 7-day email course and discover xgboost (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Gradient boosting involves creating and adding trees to the model sequentially.

New trees are created to correct the residual errors in the predictions from the existing sequence of trees.

The effect is that the model can quickly fit, then overfit the training dataset.

A technique to slow down the learning in the gradient boosting model is to apply a weighting factor for the corrections by new trees when added to the model.

This weighting is called the shrinkage factor or the learning rate, depending on the literature or the tool.

Naive gradient boosting is the same as gradient boosting with shrinkage where the shrinkage factor is set to 1.0. Setting values less than 1.0 has the effect of making less corrections for each tree added to the model. This in turn results in more trees that must be added to the model.

It is common to have small values in the range of 0.1 to 0.3, as well as values less than 0.1.

Let’s investigate the effect of the learning rate on a standard machine learning dataset.

In this tutorial we will use the Otto Group Product Classification Challenge dataset.

This dataset is available for free from Kaggle (you will need to sign-up to Kaggle to be able to download this dataset). You can download the training dataset **train.csv.zip** from the Data page and place the unzipped **train.csv** file into your working directory.

This dataset describes the 93 obfuscated details of more than 61,000 products grouped into 10 product categories (e.g. fashion, electronics, etc.). Input attributes are counts of different events of some kind.

The goal is to make predictions for new products as an array of probabilities for each of the 10 categories and models are evaluated using multiclass logarithmic loss (also called cross entropy).

This competition was completed in May 2015 and this dataset is a good challenge for XGBoost because of the nontrivial number of examples, the difficulty of the problem and the fact that little data preparation is required (other than encoding the string class variables as integers).

When creating gradient boosting models with XGBoost using the scikit-learn wrapper, the **learning_rate** parameter can be set to control the weighting of new trees added to the model.

We can use the grid search capability in scikit-learn to evaluate the effect on logarithmic loss of training a gradient boosting model with different learning rate values.

We will hold the number of trees constant at the default of 100 and evaluate of suite of standard values for the learning rate on the Otto dataset.

learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3]

There are 6 variations of learning rate to be tested and each variation will be evaluated using 10-fold cross validation, meaning that there is a total of 6×10 or 60 XGBoost models to be trained and evaluated.

The log loss for each learning rate will be printed as well as the value that resulted in the best performance.

# XGBoost on Otto dataset, Tune learning_rate from pandas import read_csv from xgboost import XGBClassifier from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import LabelEncoder import matplotlib matplotlib.use('Agg') from matplotlib import pyplot # load data data = read_csv('train.csv') dataset = data.values # split data into X and y X = dataset[:,0:94] y = dataset[:,94] # encode string class values as integers label_encoded_y = LabelEncoder().fit_transform(y) # grid search model = XGBClassifier() learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3] param_grid = dict(learning_rate=learning_rate) kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold) grid_result = grid_search.fit(X, label_encoded_y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) # plot pyplot.errorbar(learning_rate, means, yerr=stds) pyplot.title("XGBoost learning_rate vs Log Loss") pyplot.xlabel('learning_rate') pyplot.ylabel('Log Loss') pyplot.savefig('learning_rate.png')

Running this example prints the best result as well as the log loss for each of the evaluated learning rates.

Best: -0.001156 using {'learning_rate': 0.2} -2.155497 (0.000081) with: {'learning_rate': 0.0001} -1.841069 (0.000716) with: {'learning_rate': 0.001} -0.597299 (0.000822) with: {'learning_rate': 0.01} -0.001239 (0.001730) with: {'learning_rate': 0.1} -0.001156 (0.001684) with: {'learning_rate': 0.2} -0.001158 (0.001666) with: {'learning_rate': 0.3}

Interestingly, we can see that the best learning rate was 0.2.

This is a high learning rate and it suggest that perhaps the default number of trees of 100 is too low and needs to be increased.

We can also plot the effect of the learning rate of the (inverted) log loss scores, although the log10-like spread of chosen learning_rate values means that most are squashed down the left-hand side of the plot near zero.

Next, we will look at varying the number of trees whilst varying the learning rate.

Smaller learning rates generally require more trees to be added to the model.

We can explore this relationship by evaluating a grid of parameter pairs. The number of decision trees will be varied from 100 to 500 and the learning rate varied on a log10 scale from 0.0001 to 0.1.

n_estimators = [100, 200, 300, 400, 500] learning_rate = [0.0001, 0.001, 0.01, 0.1]

There are 5 variations of **n_estimators** and 4 variations of **learning_rate**. Each combination will be evaluated using 10-fold cross validation, so that is a total of 4x5x10 or 200 XGBoost models that must be trained and evaluated.

The expectation is that for a given learning rate, performance will improve and then plateau as the number of trees is increased. The full code listing is provided below.

# XGBoost on Otto dataset, Tune learning_rate and n_estimators from pandas import read_csv from xgboost import XGBClassifier from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import LabelEncoder import matplotlib matplotlib.use('Agg') from matplotlib import pyplot import numpy # load data data = read_csv('train.csv') dataset = data.values # split data into X and y X = dataset[:,0:94] y = dataset[:,94] # encode string class values as integers label_encoded_y = LabelEncoder().fit_transform(y) # grid search model = XGBClassifier() n_estimators = [100, 200, 300, 400, 500] learning_rate = [0.0001, 0.001, 0.01, 0.1] param_grid = dict(learning_rate=learning_rate, n_estimators=n_estimators) kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold) grid_result = grid_search.fit(X, label_encoded_y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) # plot results scores = numpy.array(means).reshape(len(learning_rate), len(n_estimators)) for i, value in enumerate(learning_rate): pyplot.plot(n_estimators, scores[i], label='learning_rate: ' + str(value)) pyplot.legend() pyplot.xlabel('n_estimators') pyplot.ylabel('Log Loss') pyplot.savefig('n_estimators_vs_learning_rate.png')

Running the example prints the best combination as well as the log loss for each evaluated pair.

Best: -0.001152 using {'n_estimators': 300, 'learning_rate': 0.1} -2.155497 (0.000081) with: {'n_estimators': 100, 'learning_rate': 0.0001} -2.115540 (0.000159) with: {'n_estimators': 200, 'learning_rate': 0.0001} -2.077211 (0.000233) with: {'n_estimators': 300, 'learning_rate': 0.0001} -2.040386 (0.000304) with: {'n_estimators': 400, 'learning_rate': 0.0001} -2.004955 (0.000373) with: {'n_estimators': 500, 'learning_rate': 0.0001} -1.841069 (0.000716) with: {'n_estimators': 100, 'learning_rate': 0.001} -1.572384 (0.000692) with: {'n_estimators': 200, 'learning_rate': 0.001} -1.364543 (0.000699) with: {'n_estimators': 300, 'learning_rate': 0.001} -1.196490 (0.000713) with: {'n_estimators': 400, 'learning_rate': 0.001} -1.056687 (0.000728) with: {'n_estimators': 500, 'learning_rate': 0.001} -0.597299 (0.000822) with: {'n_estimators': 100, 'learning_rate': 0.01} -0.214311 (0.000929) with: {'n_estimators': 200, 'learning_rate': 0.01} -0.080729 (0.000982) with: {'n_estimators': 300, 'learning_rate': 0.01} -0.030533 (0.000949) with: {'n_estimators': 400, 'learning_rate': 0.01} -0.011769 (0.001071) with: {'n_estimators': 500, 'learning_rate': 0.01} -0.001239 (0.001730) with: {'n_estimators': 100, 'learning_rate': 0.1} -0.001153 (0.001702) with: {'n_estimators': 200, 'learning_rate': 0.1} -0.001152 (0.001704) with: {'n_estimators': 300, 'learning_rate': 0.1} -0.001153 (0.001708) with: {'n_estimators': 400, 'learning_rate': 0.1} -0.001153 (0.001708) with: {'n_estimators': 500, 'learning_rate': 0.1}

We can see that the best result observed was a learning rate of 0.1 with 300 trees.

It is hard to pick out trends from the raw data and small negative log loss results. Below is a plot of each learning rate as a series showing log loss performance as the number of trees is varied.

We can see that the expected general trend holds, where the performance (inverted log loss) improves as the number of trees is increased.

Performance is generally poor for the smaller learning rates, suggesting that a much larger number of trees may be required. We may need to increase the number of trees to many thousands which may be quite computationally expensive.

The results for **learning_rate=0.1** are obscured due the large y-axis scale of the graph. We can extract the performance measure for just **learning_rate=0.1** and plot them directly.

# Plot performance for learning_rate=0.1 from matplotlib import pyplot n_estimators = [100, 200, 300, 400, 500] loss = [-0.001239, -0.001153, -0.001152, -0.001153, -0.001153] pyplot.plot(n_estimators, loss) pyplot.xlabel('n_estimators') pyplot.ylabel('Log Loss') pyplot.title('XGBoost learning_rate=0.1 n_estimators vs Log Loss') pyplot.show()

Running this code shows the increased performance as the number of trees are added, followed by a plateau in performance across 400 and 500 trees.

In this post you discovered the effect of weighting the addition of new trees to a gradient boosting model, called shrinkage or the learning rate.

Specifically, you learned:

- That adding a learning rate is intended to slow down the adaptation of the model to the training data.
- How to evaluate a range of learning rate values on your machine learning problem.
- How to evaluate the relationship of varying both the number of trees and the learning rate on your problem.

Do you have any questions regarding shrinkage in gradient boosting or about this post? Ask your questions in the comments and I will do my best to answer them.

The post Tune Learning Rate for Gradient Boosting with XGBoost in Python appeared first on Machine Learning Mastery.

]]>The post How to Train XGBoost Models in the Cloud with Amazon Web Services appeared first on Machine Learning Mastery.

]]>It is implemented to make best use of your computing resources, including all CPU cores and memory.

In this post you will discover how you can setup a server on Amazon’s cloud service to quickly and cheaply create very large models.

After reading this post you will know:

- How to setup and configure an Amazon EC2 server instance for use with XGBoost.
- How to confirm the parallel capabilities of XGBoost are working on your server.
- How to transfer data and code to your server and train a very large model.

Let’s get started.

**Update May/2020**: Updated instructions.

Take my free 7-day email course and discover xgboost (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

The process is quite simple. Below is an overview of the steps we are going to complete in this tutorial.

- Setup Your AWS Account (if needed).
- Launch Your AWS Instance.
- Login and Run Your Code.
- Train an XGBoost Model.
- Close Your AWS Instance.

**Note, it costs money to use a virtual server instance on Amazon**. The cost is very low for ad hoc model development (e.g. less than one US dollar per hour), which is why this is so attractive, but it is not free.

The server instance runs Linux. It is desirable although not required that you know how to navigate Linux or a Unix-like environment. We’re just running our Python scripts, so no advanced skills are needed.

You need an account on Amazon Web Services.

- 1. You can create an account using the Amazon Web Services portal by clicking “Sign in to the Console”. From there you can sign in using an existing Amazon account or create a new account.

- 2. If creating an account, you will need to provide your details as well as a valid credit card that Amazon can charge. The process is a lot quicker if you are already an Amazon customer and have your credit card on file.

**Note**: If you have created a new account, you may have to request to Amazon support in order to be approved to use larger (non-free) server instance in the rest of this tutorial.

Now that you have an AWS account, you want to launch an EC2 virtual server instance on which you can run XGBoost.

Launching an instance is as easy as selecting the image to load and starting the virtual server.

We will use an existing Fedora Linux image and install Python and XGBoost manually.

- 1. Login to your AWS console if you have not already.

- 2. Click on EC2 for launching a new virtual server.
- 3. Select “N. California” from the drop-down in the top right hand corner. This is important otherwise you may not be able to find the image (called an AMI) that we plan to use.

- 4. Click the “Launch Instance” button.
- 5. Click “Community AMIs”. An AMI is an Amazon Machine Image. It is a frozen instance of a server that you can select and instantiate on a new virtual server.

- 6. Enter AMI: “
**Fedora-Cloud-Base-24**” in the “Search community AMIs” search box and press enter. You should be presented with a single result.

This is an image for a base installation of Fedora Linux version 24. A very easy to use Linux distribution.

- 7. Click “Select” to choose the AMI in the search result.
- 8. Now you need to select the hardware on which to run the image. Scroll down and select the “c3.8xlarge” hardware.

This is a large instance that includes 32 CPU cores, 60 Gigabytes of RAM and a 2 large SSD disks.

- 9. Click “Review and Launch” to finalize the configuration of your server instance.

You will see a warning like “Your instance configuration is not eligible for the free usage tier”. This is just indicating that you will be charged for your time on this server. We know this, ignore this warning.

- 10. Click the “Launch” button.
- 11. Select your SSH key pair.
- If you have a key pair because you have used EC2 before, select “Choose an existing key pair” and choose your key pair from the list. Then check “I acknowledge…”.
- If you do not have a key pair, select the option “Create a new key pair” and enter a “Key pair name” such as “xgboost-keypair”. Click the “Download Key Pair” button.

- 12. Open a Terminal and change directory to where you downloaded your key pair.
- 13. If you have not already done so, restrict the access permissions on your key pair file. This is required as part of the SSH access to your server. For example, on your console you can type:

cd Downloads chmod 600 xgboost-keypair.pem

- 14. Click “Launch Instances.

**Note**: If this is your first time using AWS, Amazon may have to validate your request and this could take up to 2 hours (often just a few minutes).

- 15. Click “View Instances” to review the status of your instance.

Your server is now running and ready for you to log in.

Now that you have launched your server instance, it is time to login and configure it for use.

You will need to configure your server each time you launch it. Therefore, it is a good idea to batch all work so you can make good use of your configured server.

Configuring the server will not take long, perhaps 10 minutes total.

- 1. Click “View Instances” in your Amazon EC2 console if you have not already.
- 2. Copy “Public IP” (down the bottom of the screen in Description) to clipboard.

In this example my IP address is 52.53.185.166.

**Do not use this IP address, your IP address will be different**.

- 3. Open a Terminal and change directory to where you downloaded your key pair. Login in to your server using SSH, for example you can type:

ssh -i xgboost-keypair.pem fedora@52.53.185.166

- 4. You may be prompted with a warning the first time you log into your server instance. You can ignore this warning, just type “yes” and press enter.

You are now logged into your server.

Double check the number of CPU cores on your instance, by typing

cat /proc/cpuinfo | grep processor | wc -l

You should see:

32

The first step is to install all of the packages to support XGBoost.

This includes GCC, Python and the SciPy stack. We will use the Fedora package manager dnf (the new yum).

**Note**: We will be using Python 2 and some older versions of the libraries. This is intentional as these instructions do not appear to work for the very latest versions of the libraries and Python 3.

This is a single line:

sudo dnf install gcc gcc-c++ make git unzip python python2-numpy python2-scipy python2-scikit-learn python2-pandas python2-matplotlib

Type “y” and press Enter when prompted to confirm the packages to install.

This will take a few minutes to download and install all of the required packages.

Once completed, we can confirm the environment was installed successfully.

Type:

gcc --version

You should see:

gcc (GCC) 6.3.1 20161221 (Red Hat 6.3.1-1) Copyright (C) 2016 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Type:

python --version

You should see:

Python 2.7.13

Type:

python -c "import scipy;print(scipy.__version__)" python -c "import numpy;print(numpy.__version__)" python -c "import pandas;print(pandas.__version__)" python -c "import sklearn;print(sklearn.__version__)"

You should see something like:

0.16.1 1.11.0 0.18.0 0.17.1

**Note**: If any of these checks failed, stop and correct any errors. You must have a complete working environment before moving on.

We are now ready to install XGBoost.

The installation instructions for XGBoost are complete and we can follow them directly.

First we need to download the project on the server.

git clone --recursive https://github.com/dmlc/xgboost cd xgboost git checkout tags/v0.90

Next we need to compile it. The -j argument can be used to specify the number of cores to expect. We can set this to 32 for the 32 cores on our AWS instance.

If you chose different AWS hardware, you can set this appropriately.

make -j32

The XGBoost project should build successfully (e.g. no errors).

We are now ready to install the Python version of the library.

cd python-package sudo python setup.py install

That is it.

We can confirm the installation was successful by typing:

python -c "import xgboost;print(xgboost.__version__)"

This should print something like:

0.90

Let’s test out your large AWS instance by running XGBoost with a lot of cores.

In this tutorial we will use the Otto Group Product Classification Challenge dataset.

This dataset is available for free from Kaggle (you will need to sign-up to Kaggle to be able to download this dataset). It describes the 93 obfuscated details of more than 61,000 products grouped into 10 product categories (e.g. fashion, electronics, etc.). Input variables are counts of different events of some kind.

The goal is to make predictions for new products as an array of probabilities for each of the 10 categories and models are evaluated using multiclass logarithmic loss (also called cross entropy).

This competition was completed in May 2015 and this dataset is a good challenge for XGBoost because of the nontrivial number of examples, the difficulty of the problem and the fact that little data preparation is required (other than encoding the string class variables as integers).

Create a new directory called **work/** on your workstation.

You can download the training dataset **train.csv.zip** from the Data page and place it in your **work/** directory on your workstation.

We will evaluate the time taken to train an XGBoost on this dataset using different numbers of cores.

We will try 1 core, half the cores 16 and all of the 32 cores. We can specify the number of cores used by the XGBoost algorithm by setting the **nthread** parameter in the **XGBClassifier** class (the scikit-learn wrapper for XGBoost).

The complete example is listed below. Save it in a file with the name **work/script.py**.

# Otto multi-core test from pandas import read_csv from xgboost import XGBClassifier from sklearn.preprocessing import LabelEncoder import time # load data data = read_csv('train.csv') dataset = data.values # split data into X and y X = dataset[:,0:94] y = dataset[:,94] # encode string class values as integers label_encoded_y = LabelEncoder().fit_transform(y) # evaluate the effect of the number of threads results = [] num_threads = [1, 16, 32] for n in num_threads: start = time.time() model = XGBClassifier(nthread=n) model.fit(X, label_encoded_y) elapsed = time.time() - start print(n, elapsed) results.append(elapsed)

Now, we can copy your **work/** directory with the data and script to your AWS server.

From your workstation in the current directory where the **work/** directory is located, type:

scp -r -i xgboost-keypair.pem work fedora@52.53.185.166:/home/fedora/

Of course, you will need to use your key file and the IP address of your server.

This will create a new **work/** directory in your home directory on your server.

Log back onto your server instance (if needed):

ssh -i xgboost-keypair.pem fedora@52.53.185.166

Change directory to your work directory and unzip the training data.

cd work unzip ./train.csv.data

Now we can run the script and train our XGBoost models and calculate how long it takes using different numbers of cores:

python script.py

You should see output like:

(1, 96.34455895423889) (16, 8.31994891166687) (32, 7.604229927062988)

You can see little difference between 16 and 32 cores.

I believe the reason for this is that AWS is giving access to 16 physical cores with hyperthreading, offering an additional virtual cores. Nevertheless, building a large XGBoost model in 7 seconds is great.

You can use this as a template for your copying your own data and scripts to your AWS instance.

A good tip is to run your scripts as a background process and forward any output to a file. This is just in case your connection to the server is interrupted or you want to close it down and let the server run your code all night.

You can run your code as a background process and redirect output to a file by typing:

nohup python script.py >script.py.out 2>&1 &

Now that we are done, we can shut down the AWS instance.

When you are finished with your work you must close your instance.

Remember you are charged by the amount of time that you use the instance. It is cheap, but you do not want to leave an instance on if you are not using it.

- 1. Log out of your instance at the terminal, for example you can type:

exit

- 2. Log in to your AWS account with your web browser.
- 3. Click EC2.
- 4. Click “Instances” from the left-hand side menu.
- 5. Select your running instance from the list (it may already be selected if you only have one running instance).
- 6. Click the “Actions” button and select “Instance State” and choose “Terminate”. Confirm that you want to terminate your running instance.

It may take a number of seconds for the instance to close and to be removed from your list of instances.

That’s it.

In this post you discovered how to train large XGBoost models on Amazon cloud infrastructure.

Specifically, you learned:

- How to start and configure a Linux server instance on Amazon EC2 for XGBoost.
- How to install all of the required software needed to run the XGBoost library in Python.
- How to transfer data and code to your server and train a large model making use of all of the cores on the server.

Do you have any questions about training XGBoost models on Amazon Web Services or about this post? Ask your questions in the comments and I will do my best to answer.

The post How to Train XGBoost Models in the Cloud with Amazon Web Services appeared first on Machine Learning Mastery.

]]>The post How to Configure the Gradient Boosting Algorithm appeared first on Machine Learning Mastery.

]]>But how do you configure gradient boosting on your problem?

In this post you will discover how you can configure gradient boosting on your machine learning problem by looking at configurations reported in books, papers and as a result of competitions.

After reading this post, you will know:

- How to configure gradient boosting according to the original sources.
- Ideas for configuring the algorithm from defaults and suggestions in standard implementations.
- Rules of thumb for configuring gradient boosting and XGBoost from a top Kaggle competitors.

Let’s get started.

Take my free 7-day email course and discover xgboost (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

In the 1999 paper “Greedy Function Approximation: A Gradient Boosting Machine“, Jerome Friedman comments on the trade-off between the number of trees (M) and the learning rate (v):

The v-M trade-off is clearly evident; smaller values of v give rise to larger optimal M-values. They also provide higher accuracy, with a diminishing return for v < 0.125. The misclassification error rate is very flat for M > 200, so that optimal M-values for it are unstable. … the qualitative nature of these results is fairly universal.

He suggests to first set a large value for the number of trees, then tune the shrinkage parameter to achieve the best results. Studies in the paper preferred a shrinkage value of 0.1, a number of trees in the range 100 to 500 and the number of terminal nodes in a tree between 2 and 8.

In the 1999 paper “Stochastic Gradient Boosting“, Friedman reiterated the preference for the shrinkage parameter:

The “shrinkage” parameter 0 < v < 1 controls the learning rate of the procedure. Empirically …, it was found that small values (v <= 0.1) lead to much better generalization error.

In the paper, Friedman introduces and empirically investigates stochastic gradient boosting (row-based sub-sampling). He finds that almost all subsampling percentages are better than so-called deterministic boosting and perhaps 30%-to-50% is a good value to choose on some problems and 50%-to-80% on others.

… the best value of the sampling fraction … is approximately 40% (f=0.4) … However, sampling only 30% or even 20% of the data at each iteration gives considerable improvement over no sampling at all, with a corresponding computational speed-up by factors of 3 and 5 respectively.

He also studied the effect of the number of terminal nodes in trees finding values like 3 and 6 better than larger values like 11, 21 and 41.

In both cases the optimal tree size as averaged over 100 targets is L = 6. Increasing the capacity of the base learner by using larger trees degrades performance through “over-fitting”.

In his talk titled “Gradient Boosting Machine Learning” at H2O, Trevor Hastie made the comment that in general gradient boosting performs better than random forest, which in turn performs better than individual decision trees.

Gradient Boosting > Random Forest > Bagging > Single Trees

Chapter 10 titled “Boosting and Additive Trees” of the book “The Elements of Statistical Learning: Data Mining, Inference, and Prediction” is dedicated to boosting. In it they provide both heuristics for configuring gradient boosting as well as some empirical studies.

They comment that a good value the number of nodes in the tree (J) is about 6, with generally good values in the range of 4-to-8.

Although in many applications J = 2 will be insufficient, it is unlikely that J > 10 will be required. Experience so far indicates that 4 <= J <= 8 works well in the context of boosting, with results being fairly insensitive to particular choices in this range.

They suggest monitoring the performance on a validation dataset in order to calibrate the number of trees and to use an early stopping procedure once performance on the validation dataset begins to degrade.

As in Friedman’s first gradient boosting paper, they comment on the trade-off between the number of trees (M) and the learning rate (v) and recommend a small value for the learning rate < 0.1.

Smaller values of v lead to larger values of M for the same training risk, so that there is a tradeoff between them. … In fact, the best strategy appears to be to set v to be very small (v < 0.1) and then choose M by early stopping.

Also, as in Friedman’s stochastic gradient boosting paper, they recommend a subsampling percentage (n) without replacement with a value of about 50%.

A typical value for n can be 1/2, although for large N, n can be substantially smaller than 1/2.

The gradient boosting algorithm is implemented in R as the gbm package.

Reviewing the package documentation, the gbm() function specifies sensible defaults:

- n.trees = 100 (number of trees).
- interaction.depth = 1 (number of leaves).
- n.minobsinnode = 10 (minimum number of samples in tree terminal nodes).
- shrinkage = 0.001 (learning rate).

It is interesting to note that a smaller shrinkage factor is used and that stumps are the default. The small shrinkage is explained by Ridgeway next.

In the vignette for using the gbm package in R titled “Generalized Boosted Models: A guide to the gbm package“, Greg Ridgeway provides some usage heuristics. He suggest firs setting the learning rate (lambda) to as small as possible then tuning the number of trees (iterations or T) using cross validation.

In practice I set lambda to be as small as possible and then select T by cross-validation. Performance is best when lambda is as small as possible performance with decreasing marginal utility for smaller and smaller lambda.

He comments on his rationale for setting the default shrinkage to the small value of 0.001 rather than 0.1.

It is important to know that smaller values of shrinkage (almost) always give improved predictive performance. That is, setting shrinkage=0.001 will almost certainly result in a model with better out-of-sample predictive performance than setting shrinkage=0.01. … The model with shrinkage=0.001 will likely require ten times as many iterations as the model with shrinkage=0.01

Ridgeway also uses quite large numbers of trees (called iterations here), thousands rather than hundreds

I usually aim for 3,000 to 10,000 iterations with shrinkage rates between 0.01 and 0.001.

The Python library provides an implementation of gradient boosting for classification called the GradientBoostingClassifier class and regression called the GradientBoostingRegressor class.

It is useful to review the default configuration for the algorithm in this library.

There are many parameters, but below are a few key defaults.

- learning_rate=0.1 (shrinkage).
- n_estimators=100 (number of trees).
- max_depth=3.
- min_samples_split=2.
- min_samples_leaf=1.
- subsample=1.0.

It is interesting to note that the default shrinkage does match Friedman and that the tree depth is not set to stumps like the R package. A tree depth of 3 (if the created tree was symmetrical) will have 8 leaf nodes, matching the upper bound of the preferred number of terminal nodes in Friedman’s studies (alternately max_leaf_nodes can be set).

In the scikit-learn user guide under the section titled “Gradient Tree Boosting” the authors comment that setting the maximum leaf nodes has a similar effect to setting the max depth to the maximum leaf nodes minus one, but results in worse performance.

We found that max_leaf_nodes=k gives comparable results to max_depth=k-1 but is significantly faster to train at the expense of a slightly higher training error.

In a small study demonstrating regularization methods for gradient boosting titled “Gradient Boosting regularization“, the results show the benefit of using both shrinkage and sub-sampling.

The XGBoost library is dedicated to the gradient boosting algorithm.

It too specifies default parameters that are interesting to note, firstly the XGBoost Parameters page:

- eta=0.3 (shrinkage or learning rate).
- max_depth=6.
- subsample=1.

This shows a higher learning rate and a larger max depth than we see in most studies and other libraries. Similarly, we can summarize the default parameters for XGBoost in the Python API reference.

- max_depth=3.
- learning_rate=0.1.
- n_estimators=100.
- subsample=1.

These defaults are generally more in-line with scikit-learn defaults and recommendations from the papers.

In a talk to TechEd Europe titled “xgboost: An R package for Fast and Accurate Gradient Boosting“, when asked how to configure XGBoost, Tong He suggested the three most important parameters to tune are:

- Number of trees.
- Tree depth.
- Step Size (learning rate).

He also provide a terse configuration strategy for new problems:

- Run the default configuration (and presumably review learning curves?).
- If the system is overlearning, slow the learning down (using shrinkage?).
- If the system is underlearning, speed the learning up to be more aggressive (using shrinkage?).

In Owen Zhang’s talk to the NYC Data Science Academy in 2015 titled “Winning Data Science Competitions“, he provides some general tips for configuring gradient boost with XGBoost. Owen is a heavy user of gradient boosting.

My confession: I (over)use GBM. When in doubt, use GBM.

He provides some tips for configuring gradient boosting:

- learning rate + number of trees: Target 500-to-1000 trees and tune learning rate.
- number of samples in leaf: the number of observations needed to get a good mean estimate.
- interaction depth: 10+.

In an updated slide deck for the same talk, he gives a summary of common parameters he uses for XGBoost:

We can see a few interesting things in this table.

- Simplified the relationship of learning rate and the number of trees as an approximate ratio: learning rate = [2-10]/trees.
- Explores values for both row and column sampling for stochastic gradient boosting.
- Explores close to the same range for max depth as reported by Friedman (4-10).
- Tunes minimum leaf weight as an approximate ratio of 3 over the percentage of the number of rare events.

In a similar talk by Owen at ODSC Boston 2015 titled “Open Source Tools and Data Science Competitions“, he again summarized common parameters he uses:

We can see some minor differences that may be relevant.

- Target 100 rather than 1000 trees and tune learning rate.
- Min child weight as 1 over the square root of the event rate.
- No sub sampling of rows.

Finally, Abhishek Thakur, in his post titled “Approaching (Almost) Any Machine Learning Problem” provided a similar table listing out key XGBoost parameters and suggestions for tuning.

The spreads do cover the general defaults suggested above and more.

It is interesting to note that Abhishek does provides some suggestions for tuning the alpha and beta model penalization terms as well as row sampling.

In this post, you got insight into how to configure gradient boosting for your own machine learning problems.

Specifically you learned:

- About the trade-off in the number of trees and the shrinkage and good defaults for sub-sampling.
- Different ideas on limiting tree size and good defaults for tree depth and the number of terminal nodes.
- Grid search strategies used by a top Kaggle competition winner.

Do you have any questions about configuring gradient boosting or about this post? Ask your questions in the comments.

The post How to Configure the Gradient Boosting Algorithm appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning appeared first on Machine Learning Mastery.

]]>In this post you will discover the gradient boosting machine learning algorithm and get a gentle introduction into where it came from and how it works.

After reading this post, you will know:

- The origin of boosting from learning theory and AdaBoost.
- How gradient boosting works including the loss function, weak learners and the additive model.
- How to improve performance over the base algorithm with various regularization schemes.

Let’s get started.

Take my free 7-day email course and discover xgboost (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

The idea of boosting came out of the idea of whether a weak learner can be modified to become better.

Michael Kearns articulated the goal as the “*Hypothesis Boosting Problem*” stating the goal from a practical standpoint as:

… an efficient algorithm for converting relatively poor hypotheses into very good hypotheses

— Thoughts on Hypothesis Boosting [PDF], 1988

A weak hypothesis or weak learner is defined as one whose performance is at least slightly better than random chance.

These ideas built upon Leslie Valiant’s work on distribution free or Probability Approximately Correct (PAC) learning, a framework for investigating the complexity of machine learning problems.

Hypothesis boosting was the idea of filtering observations, leaving those observations that the weak learner can handle and focusing on developing new weak learns to handle the remaining difficult observations.

The idea is to use the weak learning method several times to get a succession of hypotheses, each one refocused on the examples that the previous ones found difficult and misclassified. … Note, however, it is not obvious at all how this can be done

— Probably Approximately Correct: Nature’s Algorithms for Learning and Prospering in a Complex World, page 152, 2013

The first realization of boosting that saw great success in application was Adaptive Boosting or AdaBoost for short.

Boosting refers to this general problem of producing a very accurate prediction rule by combining rough and moderately inaccurate rules-of-thumb.

— A decision-theoretic generalization of on-line learning and an application to boosting [PDF], 1995

The weak learners in AdaBoost are decision trees with a single split, called decision stumps for their shortness.

AdaBoost works by weighting the observations, putting more weight on difficult to classify instances and less on those already handled well. New weak learners are added sequentially that focus their training on the more difficult patterns.

This means that samples that are difficult to classify receive increasing larger weights until the algorithm identifies a model that correctly classifies these samples

— Applied Predictive Modeling, 2013

Predictions are made by majority vote of the weak learners’ predictions, weighted by their individual accuracy. The most successful form of the AdaBoost algorithm was for binary classification problems and was called AdaBoost.M1.

You can learn more about the AdaBoost algorithm in the post:

AdaBoost and related algorithms were recast in a statistical framework first by Breiman calling them ARCing algorithms.

Arcing is an acronym for Adaptive Reweighting and Combining. Each step in an arcing algorithm consists of a weighted minimization followed by a recomputation of [the classifiers] and [weighted input].

— Prediction Games and Arching Algorithms [PDF], 1997

This framework was further developed by Friedman and called Gradient Boosting Machines. Later called just gradient boosting or gradient tree boosting.

The statistical framework cast boosting as a numerical optimization problem where the objective is to minimize the loss of the model by adding weak learners using a gradient descent like procedure.

This class of algorithms were described as a stage-wise additive model. This is because one new weak learner is added at a time and existing weak learners in the model are frozen and left unchanged.

Note that this stagewise strategy is different from stepwise approaches that readjust previously entered terms when new ones are added.

— Greedy Function Approximation: A Gradient Boosting Machine [PDF], 1999

The generalization allowed arbitrary differentiable loss functions to be used, expanding the technique beyond binary classification problems to support regression, multi-class classification and more.

Gradient boosting involves three elements:

- A loss function to be optimized.
- A weak learner to make predictions.
- An additive model to add weak learners to minimize the loss function.

The loss function used depends on the type of problem being solved.

It must be differentiable, but many standard loss functions are supported and you can define your own.

For example, regression may use a squared error and classification may use logarithmic loss.

A benefit of the gradient boosting framework is that a new boosting algorithm does not have to be derived for each loss function that may want to be used, instead, it is a generic enough framework that any differentiable loss function can be used.

Decision trees are used as the weak learner in gradient boosting.

Specifically regression trees are used that output real values for splits and whose output can be added together, allowing subsequent models outputs to be added and “correct” the residuals in the predictions.

Trees are constructed in a greedy manner, choosing the best split points based on purity scores like Gini or to minimize the loss.

Initially, such as in the case of AdaBoost, very short decision trees were used that only had a single split, called a decision stump. Larger trees can be used generally with 4-to-8 levels.

It is common to constrain the weak learners in specific ways, such as a maximum number of layers, nodes, splits or leaf nodes.

This is to ensure that the learners remain weak, but can still be constructed in a greedy manner.

Trees are added one at a time, and existing trees in the model are not changed.

A gradient descent procedure is used to minimize the loss when adding trees.

Traditionally, gradient descent is used to minimize a set of parameters, such as the coefficients in a regression equation or weights in a neural network. After calculating error or loss, the weights are updated to minimize that error.

Instead of parameters, we have weak learner sub-models or more specifically decision trees. After calculating the loss, to perform the gradient descent procedure, we must add a tree to the model that reduces the loss (i.e. follow the gradient). We do this by parameterizing the tree, then modify the parameters of the tree and move in the right direction by (reducing the residual loss.

Generally this approach is called functional gradient descent or gradient descent with functions.

One way to produce a weighted combination of classifiers which optimizes [the cost] is by gradient descent in function space

— Boosting Algorithms as Gradient Descent in Function Space [PDF], 1999

The output for the new tree is then added to the output of the existing sequence of trees in an effort to correct or improve the final output of the model.

A fixed number of trees are added or training stops once loss reaches an acceptable level or no longer improves on an external validation dataset.

Gradient boosting is a greedy algorithm and can overfit a training dataset quickly.

It can benefit from regularization methods that penalize various parts of the algorithm and generally improve the performance of the algorithm by reducing overfitting.

In this this section we will look at 4 enhancements to basic gradient boosting:

- Tree Constraints
- Shrinkage
- Random sampling
- Penalized Learning

It is important that the weak learners have skill but remain weak.

There are a number of ways that the trees can be constrained.

A good general heuristic is that the more constrained tree creation is, the more trees you will need in the model, and the reverse, where less constrained individual trees, the fewer trees that will be required.

Below are some constraints that can be imposed on the construction of decision trees:

**Number of trees**, generally adding more trees to the model can be very slow to overfit. The advice is to keep adding trees until no further improvement is observed.**Tree depth**, deeper trees are more complex trees and shorter trees are preferred. Generally, better results are seen with 4-8 levels.**Number of nodes or number of leaves**, like depth, this can constrain the size of the tree, but is not constrained to a symmetrical structure if other constraints are used.**Number of observations per split**imposes a minimum constraint on the amount of training data at a training node before a split can be considered**Minimim improvement to loss**is a constraint on the improvement of any split added to a tree.

The predictions of each tree are added together sequentially.

The contribution of each tree to this sum can be weighted to slow down the learning by the algorithm. This weighting is called a shrinkage or a learning rate.

Each update is simply scaled by the value of the “learning rate parameter v”

— Greedy Function Approximation: A Gradient Boosting Machine [PDF], 1999

The effect is that learning is slowed down, in turn require more trees to be added to the model, in turn taking longer to train, providing a configuration trade-off between the number of trees and learning rate.

Decreasing the value of v [the learning rate] increases the best value for M [the number of trees].

— Greedy Function Approximation: A Gradient Boosting Machine [PDF], 1999

It is common to have small values in the range of 0.1 to 0.3, as well as values less than 0.1.

Similar to a learning rate in stochastic optimization, shrinkage reduces the influence of each individual tree and leaves space for future trees to improve the model.

— Stochastic Gradient Boosting [PDF], 1999

A big insight into bagging ensembles and random forest was allowing trees to be greedily created from subsamples of the training dataset.

This same benefit can be used to reduce the correlation between the trees in the sequence in gradient boosting models.

This variation of boosting is called stochastic gradient boosting.

at each iteration a subsample of the training data is drawn at random (without replacement) from the full training dataset. The randomly selected subsample is then used, instead of the full sample, to fit the base learner.

— Stochastic Gradient Boosting [PDF], 1999

A few variants of stochastic boosting that can be used:

- Subsample rows before creating each tree.
- Subsample columns before creating each tree
- Subsample columns before considering each split.

Generally, aggressive sub-sampling such as selecting only 50% of the data has shown to be beneficial.

According to user feedback, using column sub-sampling prevents over-fitting even more so than the traditional row sub-sampling

— XGBoost: A Scalable Tree Boosting System, 2016

Additional constraints can be imposed on the parameterized trees in addition to their structure.

Classical decision trees like CART are not used as weak learners, instead a modified form called a regression tree is used that has numeric values in the leaf nodes (also called terminal nodes). The values in the leaves of the trees can be called weights in some literature.

As such, the leaf weight values of the trees can be regularized using popular regularization functions, such as:

- L1 regularization of weights.
- L2 regularization of weights.

The additional regularization term helps to smooth the final learnt weights to avoid over-fitting. Intuitively, the regularized objective will tend to select a model employing simple and predictive functions.

— XGBoost: A Scalable Tree Boosting System, 2016

Gradient boosting is a fascinating algorithm and I am sure you want to go deeper.

This section lists various resources that you can use to learn more about the gradient boosting algorithm.

- Gradient Boosting Machine Learning, Trevor Hastie, 2014
- Gradient Boosting, Alexander Ihler, 2012
- GBM, John Mount, 2015
- Learning: Boosting, MIT 6.034 Artificial Intelligence, 2010
- xgboost: An R package for Fast and Accurate Gradient Boosting, 2016
- XGBoost: A Scalable Tree Boosting System, Tianqi Chen, 2016

- Section 8.2.3 Boosting, page 321, An Introduction to Statistical Learning: with Applications in R.
- Section 8.6 Boosting, page 203, Applied Predictive Modeling.
- Section 14.5 Stochastic Gradient Boosting, page 390,Applied Predictive Modeling.
- Section 16.4 Boosting, page 556, Machine Learning: A Probabilistic Perspective
- Chapter 10 Boosting and Additive Trees, page 337, The Elements of Statistical Learning: Data Mining, Inference, and Prediction

- Thoughts on Hypothesis Boosting [PDF], Michael Kearns, 1988
- A decision-theoretic generalization of on-line learning and an application to boosting [PDF], 1995
- Arcing the edge [PDF], 1998
- Stochastic Gradient Boosting [PDF], 1999
- Boosting Algorithms as Gradient Descent in Function Space [PDF], 1999

In this post you discovered the gradient boosting algorithm for predictive modeling in machine learning.

Specifically, you learned:

- The history of boosting in learning theory and AdaBoost.
- How the gradient boosting algorithm works with a loss function, weak learners and an additive model.
- How to improve the performance of gradient boosting with regularization.

Do you have any questions about the gradient boosting algorithm or about this post? Ask your questions in the comments and I will do my best to answer.

The post A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning appeared first on Machine Learning Mastery.

]]>The post How to Tune the Number and Size of Decision Trees with XGBoost in Python appeared first on Machine Learning Mastery.

]]>This raises the question as to how many trees (weak learners or estimators) to configure in your gradient boosting model and how big each tree should be.

In this post you will discover how to design a systematic experiment to select the number and size of decision trees to use on your problem.

After reading this post you will know:

- How to evaluate the effect of adding more decision trees to your XGBoost model.
- How to evaluate the effect of creating larger decision trees to your XGBoost model.
- How to investigate the relationship between the number and depth of trees on your problem.

Let’s get started.

**Update Jan/2017**: Updated to reflect changes in scikit-learn API version 0.18.1.

Take my free 7-day email course and discover xgboost (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

In this tutorial we will use the Otto Group Product Classification Challenge dataset.

This dataset is available for free from Kaggle (you will need to sign-up to Kaggle to be able to download this dataset). You can download the training dataset **train.csv.zip** from the Data page and place the unzipped **train.csv** file into your working directory.

This dataset describes the 93 obfuscated details of more than 61,000 products grouped into 10 product categories (e.g. fashion, electronics, etc.). Input attributes are counts of different events of some kind.

Most implementations of gradient boosting are configured by default with a relatively small number of trees, such as hundreds or thousands.

The general reason is that on most problems, adding more trees beyond a limit does not improve the performance of the model.

The reason is in the way that the boosted tree model is constructed, sequentially where each new tree attempts to model and correct for the errors made by the sequence of previous trees. Quickly, the model reaches a point of diminishing returns.

We can demonstrate this point of diminishing returns easily on the Otto dataset.

The number of trees (or rounds) in an XGBoost model is specified to the XGBClassifier or XGBRegressor class in the n_estimators argument. The default in the XGBoost library is 100.

Using scikit-learn we can perform a grid search of the **n_estimators** model parameter, evaluating a series of values from 50 to 350 with a step size of 50 (50, 150, 200, 250, 300, 350).

# grid search model = XGBClassifier() n_estimators = range(50, 400, 50) param_grid = dict(n_estimators=n_estimators) kfold = StratifiedKFold(n_splits scoring="neg_log_loss", n_jobs=-1, cv=kfold) result = grid_search.fit(X, label_encoded_y)

We can perform this grid search on the Otto dataset, using 10-fold cross validation, requiring 60 models to be trained (6 configurations * 10 folds).

The full code listing is provided below for completeness.

# XGBoost on Otto dataset, Tune n_estimators from pandas import read_csv from xgboost import XGBClassifier from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import LabelEncoder import matplotlib matplotlib.use('Agg') from matplotlib import pyplot # load data data = read_csv('train.csv') dataset = data.values # split data into X and y X = dataset[:,0:94] y = dataset[:,94] # encode string class values as integers label_encoded_y = LabelEncoder().fit_transform(y) # grid search model = XGBClassifier() n_estimators = range(50, 400, 50) param_grid = dict(n_estimators=n_estimators) kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold) grid_result = grid_search.fit(X, label_encoded_y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) # plot pyplot.errorbar(n_estimators, means, yerr=stds) pyplot.title("XGBoost n_estimators vs Log Loss") pyplot.xlabel('n_estimators') pyplot.ylabel('Log Loss') pyplot.savefig('n_estimators.png')

Running this example prints the following results.

Best: -0.001152 using {'n_estimators': 250} -0.010970 (0.001083) with: {'n_estimators': 50} -0.001239 (0.001730) with: {'n_estimators': 100} -0.001163 (0.001715) with: {'n_estimators': 150} -0.001153 (0.001702) with: {'n_estimators': 200} -0.001152 (0.001702) with: {'n_estimators': 250} -0.001152 (0.001704) with: {'n_estimators': 300} -0.001153 (0.001706) with: {'n_estimators': 350}

We can see that the cross validation log loss scores are negative. This is because the scikit-learn cross validation framework inverted them. The reason is that internally, the framework requires that all metrics that are being optimized are to be maximized, whereas log loss is a minimization metric. It can easily be made maximizing by inverting the scores.

The best number of trees was **n_estimators=250** resulting in a log loss of 0.001152, but really not a significant difference from **n_estimators=200**. In fact, there is not a large relative difference in the number of trees between 100 and 350 if we plot the results.

Below is line graph showing the relationship between the number of trees and mean (inverted) logarithmic loss, with the standard deviation shown as error bars.

In gradient boosting, we can control the size of decision trees, also called the number of layers or the depth.

Shallow trees are expected to have poor performance because they capture few details of the problem and are generally referred to as weak learners. Deeper trees generally capture too many details of the problem and overfit the training dataset, limiting the ability to make good predictions on new data.

Generally, boosting algorithms are configured with weak learners, decision trees with few layers, sometimes as simple as just a root node, also called a decision stump rather than a decision tree.

The maximum depth can be specified in the **XGBClassifier** and **XGBRegressor** wrapper classes for XGBoost in the **max_depth** parameter. This parameter takes an integer value and defaults to a value of 3.

model = XGBClassifier(max_depth=3)

We can tune this hyperparameter of XGBoost using the grid search infrastructure in scikit-learn on the Otto dataset. Below we evaluate odd values for **max_depth** between 1 and 9 (1, 3, 5, 7, 9).

Each of the 5 configurations is evaluated using 10-fold cross validation, resulting in 50 models being constructed. The full code listing is provided below for completeness.

# XGBoost on Otto dataset, Tune max_depth from pandas import read_csv from xgboost import XGBClassifier from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import LabelEncoder import matplotlib matplotlib.use('Agg') from matplotlib import pyplot # load data data = read_csv('train.csv') dataset = data.values # split data into X and y X = dataset[:,0:94] y = dataset[:,94] # encode string class values as integers label_encoded_y = LabelEncoder().fit_transform(y) # grid search model = XGBClassifier() max_depth = range(1, 11, 2) print(max_depth) param_grid = dict(max_depth=max_depth) kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold, verbose=1) grid_result = grid_search.fit(X, label_encoded_y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) # plot pyplot.errorbar(max_depth, means, yerr=stds) pyplot.title("XGBoost max_depth vs Log Loss") pyplot.xlabel('max_depth') pyplot.ylabel('Log Loss') pyplot.savefig('max_depth.png')

Running this example prints the log loss for each **max_depth**.

The optimal configuration was **max_depth=5** resulting in a log loss of 0.001236.

Best: -0.001236 using {'max_depth': 5} -0.026235 (0.000898) with: {'max_depth': 1} -0.001239 (0.001730) with: {'max_depth': 3} -0.001236 (0.001701) with: {'max_depth': 5} -0.001237 (0.001701) with: {'max_depth': 7} -0.001237 (0.001701) with: {'max_depth': 9}

Reviewing the plot of log loss scores, we can see a marked jump from **max_depth=1** to **max_depth=3** then pretty even performance for the rest the values of **max_depth**.

Although the best score was observed for **max_depth=5**, it is interesting to note that there was practically little difference between using **max_depth=3** or **max_depth=7**.

This suggests a point of diminishing returns in **max_depth** on a problem that you can tease out using grid search. A graph of **max_depth** values is plotted against (inverted) logarithmic loss below.

There is a relationship between the number of trees in the model and the depth of each tree.

We would expect that deeper trees would result in fewer trees being required in the model, and the inverse where simpler trees (such as decision stumps) require many more trees to achieve similar results.

We can investigate this relationship by evaluating a grid of **n_estimators** and **max_depth** configuration values. To avoid the evaluation taking too long, we will limit the total number of configuration values evaluated. Parameters were chosen to tease out the relationship rather than optimize the model.

We will create a grid of 4 different n_estimators values (50, 100, 150, 200) and 4 different max_depth values (2, 4, 6, 8) and each combination will be evaluated using 10-fold cross validation. A total of 4*4*10 or 160 models will be trained and evaluated.

The full code listing is provided below.

# XGBoost on Otto dataset, Tune n_estimators and max_depth from pandas import read_csv from xgboost import XGBClassifier from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import LabelEncoder import matplotlib matplotlib.use('Agg') from matplotlib import pyplot import numpy # load data data = read_csv('train.csv') dataset = data.values # split data into X and y X = dataset[:,0:94] y = dataset[:,94] # encode string class values as integers label_encoded_y = LabelEncoder().fit_transform(y) # grid search model = XGBClassifier() n_estimators = [50, 100, 150, 200] max_depth = [2, 4, 6, 8] print(max_depth) param_grid = dict(max_depth=max_depth, n_estimators=n_estimators) kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold, verbose=1) grid_result = grid_search.fit(X, label_encoded_y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) # plot results scores = numpy.array(means).reshape(len(max_depth), len(n_estimators)) for i, value in enumerate(max_depth): pyplot.plot(n_estimators, scores[i], label='depth: ' + str(value)) pyplot.legend() pyplot.xlabel('n_estimators') pyplot.ylabel('Log Loss') pyplot.savefig('n_estimators_vs_max_depth.png')

Running the code produces a listing of the logloss for each parameter pair.

Best: -0.001141 using {'n_estimators': 200, 'max_depth': 4} -0.012127 (0.001130) with: {'n_estimators': 50, 'max_depth': 2} -0.001351 (0.001825) with: {'n_estimators': 100, 'max_depth': 2} -0.001278 (0.001812) with: {'n_estimators': 150, 'max_depth': 2} -0.001266 (0.001796) with: {'n_estimators': 200, 'max_depth': 2} -0.010545 (0.001083) with: {'n_estimators': 50, 'max_depth': 4} -0.001226 (0.001721) with: {'n_estimators': 100, 'max_depth': 4} -0.001150 (0.001704) with: {'n_estimators': 150, 'max_depth': 4} -0.001141 (0.001693) with: {'n_estimators': 200, 'max_depth': 4} -0.010341 (0.001059) with: {'n_estimators': 50, 'max_depth': 6} -0.001237 (0.001701) with: {'n_estimators': 100, 'max_depth': 6} -0.001163 (0.001688) with: {'n_estimators': 150, 'max_depth': 6} -0.001154 (0.001679) with: {'n_estimators': 200, 'max_depth': 6} -0.010342 (0.001059) with: {'n_estimators': 50, 'max_depth': 8} -0.001237 (0.001701) with: {'n_estimators': 100, 'max_depth': 8} -0.001161 (0.001688) with: {'n_estimators': 150, 'max_depth': 8} -0.001153 (0.001679) with: {'n_estimators': 200, 'max_depth': 8}

We can see that the best result was achieved with a **n_estimators=200** and **max_depth=4**, similar to the best values found from the previous two rounds of standalone parameter tuning (**n_estimators=250**, **max_depth=5**).

We can plot the relationship between each series of **max_depth** values for a given **n_estimators**.

The lines overlap making it hard to see the relationship, but generally we can see the interaction we expect. Fewer boosted trees are required with increased tree depth.

Further, we would expect the increase complexity provided by deeper individual trees to result in greater overfitting of the training data which would be exacerbated by having more trees, in turn resulting in a lower cross validation score. We don’t see this here as our trees are not that deep nor do we have too many. Exploring this expectation is left as an exercise you could explore yourself.

In this post, you discovered how to tune the number and depth of decision trees when using gradient boosting with XGBoost in Python.

Specifically, you learned:

- How to tune the number of decision trees in an XGBoost model.
- How to tune the depth of decision trees in an XGBoost model.
- How to jointly tune the number of trees and tree depth in an XGBoost model

Do you have any questions about the number or size of decision trees in your gradient boosting model or about this post? Ask your questions in the comments and I will do my best to answer.

The post How to Tune the Number and Size of Decision Trees with XGBoost in Python appeared first on Machine Learning Mastery.

]]>