A problem with gradient boosted decision trees is that they are quick to learn and overfit training data.

One effective way to slow down learning in the gradient boosting model is to use a learning rate, also called shrinkage (or eta in XGBoost documentation).

In this post you will discover the effect of the learning rate in gradient boosting and how to tune it on your machine learning problem using the XGBoost library in Python.

After reading this post you will know:

- The effect learning rate has on the gradient boosting model.
- How to tune learning rate on your machine learning on your problem.
- How to tune the trade-off between the number of boosted trees and learning rate on your problem.

Discover how to configure, fit, tune and evaluation gradient boosting models with XGBoost in my new book, with 15 step-by-step tutorial lessons, and full python code.

Let’s get started.

**Update Jan/2017**: Updated to reflect changes in scikit-learn API version 0.18.1.

### Need help with XGBoost in Python?

Take my free 7-day email course and discover configuration, tuning and more (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

## Slow Learning in Gradient Boosting with a Learning Rate

Gradient boosting involves creating and adding trees to the model sequentially.

New trees are created to correct the residual errors in the predictions from the existing sequence of trees.

The effect is that the model can quickly fit, then overfit the training dataset.

A technique to slow down the learning in the gradient boosting model is to apply a weighting factor for the corrections by new trees when added to the model.

This weighting is called the shrinkage factor or the learning rate, depending on the literature or the tool.

Naive gradient boosting is the same as gradient boosting with shrinkage where the shrinkage factor is set to 1.0. Setting values less than 1.0 has the effect of making less corrections for each tree added to the model. This in turn results in more trees that must be added to the model.

It is common to have small values in the range of 0.1 to 0.3, as well as values less than 0.1.

Let’s investigate the effect of the learning rate on a standard machine learning dataset.

## Problem Description: Otto Dataset

In this tutorial we will use the Otto Group Product Classification Challenge dataset.

This dataset is available for free from Kaggle (you will need to sign-up to Kaggle to be able to download this dataset). You can download the training dataset **train.csv.zip** from the Data page and place the unzipped **train.csv** file into your working directory.

This dataset describes the 93 obfuscated details of more than 61,000 products grouped into 10 product categories (e.g. fashion, electronics, etc.). Input attributes are counts of different events of some kind.

The goal is to make predictions for new products as an array of probabilities for each of the 10 categories and models are evaluated using multiclass logarithmic loss (also called cross entropy).

This competition was completed in May 2015 and this dataset is a good challenge for XGBoost because of the nontrivial number of examples, the difficulty of the problem and the fact that little data preparation is required (other than encoding the string class variables as integers).

## Tuning Learning Rate in XGBoost

When creating gradient boosting models with XGBoost using the scikit-learn wrapper, the **learning_rate** parameter can be set to control the weighting of new trees added to the model.

We can use the grid search capability in scikit-learn to evaluate the effect on logarithmic loss of training a gradient boosting model with different learning rate values.

We will hold the number of trees constant at the default of 100 and evaluate of suite of standard values for the learning rate on the Otto dataset.

1 |
learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3] |

There are 6 variations of learning rate to be tested and each variation will be evaluated using 10-fold cross validation, meaning that there is a total of 6×10 or 60 XGBoost models to be trained and evaluated.

The log loss for each learning rate will be printed as well as the value that resulted in the best performance.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
# XGBoost on Otto dataset, Tune learning_rate from pandas import read_csv from xgboost import XGBClassifier from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import LabelEncoder import matplotlib matplotlib.use('Agg') from matplotlib import pyplot # load data data = read_csv('train.csv') dataset = data.values # split data into X and y X = dataset[:,0:94] y = dataset[:,94] # encode string class values as integers label_encoded_y = LabelEncoder().fit_transform(y) # grid search model = XGBClassifier() learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3] param_grid = dict(learning_rate=learning_rate) kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold) grid_result = grid_search.fit(X, label_encoded_y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) # plot pyplot.errorbar(learning_rate, means, yerr=stds) pyplot.title("XGBoost learning_rate vs Log Loss") pyplot.xlabel('learning_rate') pyplot.ylabel('Log Loss') pyplot.savefig('learning_rate.png') |

Running this example prints the best result as well as the log loss for each of the evaluated learning rates.

1 2 3 4 5 6 7 |
Best: -0.001156 using {'learning_rate': 0.2} -2.155497 (0.000081) with: {'learning_rate': 0.0001} -1.841069 (0.000716) with: {'learning_rate': 0.001} -0.597299 (0.000822) with: {'learning_rate': 0.01} -0.001239 (0.001730) with: {'learning_rate': 0.1} -0.001156 (0.001684) with: {'learning_rate': 0.2} -0.001158 (0.001666) with: {'learning_rate': 0.3} |

Interestingly, we can see that the best learning rate was 0.2.

This is a high learning rate and it suggest that perhaps the default number of trees of 100 is too low and needs to be increased.

We can also plot the effect of the learning rate of the (inverted) log loss scores, although the log10-like spread of chosen learning_rate values means that most are squashed down the left-hand side of the plot near zero.

Next, we will look at varying the number of trees whilst varying the learning rate.

## Tuning Learning Rate and the Number of Trees in XGBoost

Smaller learning rates generally require more trees to be added to the model.

We can explore this relationship by evaluating a grid of parameter pairs. The number of decision trees will be varied from 100 to 500 and the learning rate varied on a log10 scale from 0.0001 to 0.1.

1 2 |
n_estimators = [100, 200, 300, 400, 500] learning_rate = [0.0001, 0.001, 0.01, 0.1] |

There are 5 variations of **n_estimators** and 4 variations of **learning_rate**. Each combination will be evaluated using 10-fold cross validation, so that is a total of 4x5x10 or 200 XGBoost models that must be trained and evaluated.

The expectation is that for a given learning rate, performance will improve and then plateau as the number of trees is increased. The full code listing is provided below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
# XGBoost on Otto dataset, Tune learning_rate and n_estimators from pandas import read_csv from xgboost import XGBClassifier from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import LabelEncoder import matplotlib matplotlib.use('Agg') from matplotlib import pyplot import numpy # load data data = read_csv('train.csv') dataset = data.values # split data into X and y X = dataset[:,0:94] y = dataset[:,94] # encode string class values as integers label_encoded_y = LabelEncoder().fit_transform(y) # grid search model = XGBClassifier() n_estimators = [100, 200, 300, 400, 500] learning_rate = [0.0001, 0.001, 0.01, 0.1] param_grid = dict(learning_rate=learning_rate, n_estimators=n_estimators) kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold) grid_result = grid_search.fit(X, label_encoded_y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) # plot results scores = numpy.array(means).reshape(len(learning_rate), len(n_estimators)) for i, value in enumerate(learning_rate): pyplot.plot(n_estimators, scores[i], label='learning_rate: ' + str(value)) pyplot.legend() pyplot.xlabel('n_estimators') pyplot.ylabel('Log Loss') pyplot.savefig('n_estimators_vs_learning_rate.png') |

Running the example prints the best combination as well as the log loss for each evaluated pair.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
Best: -0.001152 using {'n_estimators': 300, 'learning_rate': 0.1} -2.155497 (0.000081) with: {'n_estimators': 100, 'learning_rate': 0.0001} -2.115540 (0.000159) with: {'n_estimators': 200, 'learning_rate': 0.0001} -2.077211 (0.000233) with: {'n_estimators': 300, 'learning_rate': 0.0001} -2.040386 (0.000304) with: {'n_estimators': 400, 'learning_rate': 0.0001} -2.004955 (0.000373) with: {'n_estimators': 500, 'learning_rate': 0.0001} -1.841069 (0.000716) with: {'n_estimators': 100, 'learning_rate': 0.001} -1.572384 (0.000692) with: {'n_estimators': 200, 'learning_rate': 0.001} -1.364543 (0.000699) with: {'n_estimators': 300, 'learning_rate': 0.001} -1.196490 (0.000713) with: {'n_estimators': 400, 'learning_rate': 0.001} -1.056687 (0.000728) with: {'n_estimators': 500, 'learning_rate': 0.001} -0.597299 (0.000822) with: {'n_estimators': 100, 'learning_rate': 0.01} -0.214311 (0.000929) with: {'n_estimators': 200, 'learning_rate': 0.01} -0.080729 (0.000982) with: {'n_estimators': 300, 'learning_rate': 0.01} -0.030533 (0.000949) with: {'n_estimators': 400, 'learning_rate': 0.01} -0.011769 (0.001071) with: {'n_estimators': 500, 'learning_rate': 0.01} -0.001239 (0.001730) with: {'n_estimators': 100, 'learning_rate': 0.1} -0.001153 (0.001702) with: {'n_estimators': 200, 'learning_rate': 0.1} -0.001152 (0.001704) with: {'n_estimators': 300, 'learning_rate': 0.1} -0.001153 (0.001708) with: {'n_estimators': 400, 'learning_rate': 0.1} -0.001153 (0.001708) with: {'n_estimators': 500, 'learning_rate': 0.1} |

We can see that the best result observed was a learning rate of 0.1 with 300 trees.

It is hard to pick out trends from the raw data and small negative log loss results. Below is a plot of each learning rate as a series showing log loss performance as the number of trees is varied.

We can see that the expected general trend holds, where the performance (inverted log loss) improves as the number of trees is increased.

Performance is generally poor for the smaller learning rates, suggesting that a much larger number of trees may be required. We may need to increase the number of trees to many thousands which may be quite computationally expensive.

The results for **learning_rate=0.1** are obscured due the large y-axis scale of the graph. We can extract the performance measure for just **learning_rate=0.1** and plot them directly.

1 2 3 4 5 6 7 8 9 |
# Plot performance for learning_rate=0.1 from matplotlib import pyplot n_estimators = [100, 200, 300, 400, 500] loss = [-0.001239, -0.001153, -0.001152, -0.001153, -0.001153] pyplot.plot(n_estimators, loss) pyplot.xlabel('n_estimators') pyplot.ylabel('Log Loss') pyplot.title('XGBoost learning_rate=0.1 n_estimators vs Log Loss') pyplot.show() |

Running this code shows the increased performance as the number of trees are added, followed by a plateau in performance across 400 and 500 trees.

## Summary

In this post you discovered the effect of weighting the addition of new trees to a gradient boosting model, called shrinkage or the learning rate.

Specifically, you learned:

- That adding a learning rate is intended to slow down the adaptation of the model to the training data.
- How to evaluate a range of learning rate values on your machine learning problem.
- How to evaluate the relationship of varying both the number of trees and the learning rate on your problem.

Do you have any questions regarding shrinkage in gradient boosting or about this post? Ask your questions in the comments and I will do my best to answer them.

Hi! How long does it take to run the first part “Tuning the Learning Rate” and what system are you running it on? Thanks.

I ran the examples on a large AWS instance, for example:

http://machinelearningmastery.com/train-xgboost-models-cloud-amazon-web-services/

Sorry, I do not recall how long it took. I believe no example took longer than a few hours.

Great! Thanks for the info!

“Tuning Learning Rate and the Number of Trees in XGBoost” Running this part is taking more time for me (completed 6 hours but still running).

Ouch, I think I may have run it on a large AWS instance with 32 cores.

What is the use of learning rate and what does it represent? can you please tell me any intuitive explanation?

The learning rate makes the boosting process more or less conservative, e.g. to correct or boost more or less based on the results of the previously added tree.

Excellent and userful article. I applied it on my data, It helped me to choose learning rate and n_estimators prefectly because of which Results imporoved a lot.

Thanks,

Thanks, well done!

the first colum of csv is ID, Isn’t this feature useless?

Thank you

Yes, it often is.

How to do make predictions on the tuned xg_boost model? Do you just pass the learning rate and number of trees as parameters in XGBClassifier? Could you add that code also to the article?

After the model is fit, you can save it and use it to start making predictions.

This is called creating a final model, more here:

https://machinelearningmastery.com/train-final-machine-learning-model/

Hi.. Is it possible to show how the tree is built for gradient boosted tree for binary class problem?.. I am curious how exactly the tree is built.. Using what function to determine the split.. And how the results from each tree being added to compute the prediction class. If possible .. Use simple example.. With learning rate 0.1.. . I am familiar with single cart tree… But till now not able to get the understanding for gradient boosted tree.. Only recently i realized that for gradient boosted tree.. The tree is builtbusing regression tree.. And the classification is converted using probability value. I would appreciate a lot if u can show one example as requested … Thanks

Thanks for the suggestion.

Hey, can you please tell me why we are even going for learning rate or shrinkage parameter in GB while we already have number of base learners as hyperparameter and we even having thepre-computed γ m through gradient minimization technique, so how does this learning rate adds more value in this?

This is a shrinkage factor, it is explained in the post. Perhaps a re-read is in order?

Instead of having discrete values for learning rate, an approach with lower and upper bounds can be tried as well. So, something similar to learning_rate = scipy.stats.uniform(lower_bound, upper_bound)

Thanks.

Hi, I have a question here.

I read some paper and it seems that we need to find the best learning rate (minimize our loss function) after we fit the base learner. Does it mean in Python library, we don’t need to find the best learning rate and we just simply set it as a constant number?

never mind. I abused notation here.

No problem.