Avoid Overfitting By Early Stopping With XGBoost In Python

Overfitting is a problem with sophisticated non-linear learning algorithms like gradient boosting.

In this post you will discover how you can use early stopping to limit overfitting with XGBoost in Python.

After reading this post, you will know:

  • About early stopping as an approach to reducing overfitting of training data.
  • How to monitor the performance of an XGBoost model during training and plot the learning curve.
  • How to use early stopping to prematurely stop the training of an XGBoost model at an optimal epoch.

Let’s get started.

  • Update Jan/2017: Updated to reflect changes in scikit-learn API version 0.18.1.
Avoid Overfitting By Early Stopping With XGBoost In Python

Avoid Overfitting By Early Stopping With XGBoost In Python
Photo by Michael Hamann, some rights reserved.

Don’t Waste Another Day Learning XGBoost
“The Slow Way”

XGBoost With Python Mini-CourseDevelop your first XGBoost TODAY with my free XGBoost-With-Python mini-course.

...with an XG-Boost-With-Python “Cheat Sheet” you can download right now, so you can start learning this award-winning algorithm
“The Fast Way”

Download Your FREE Mini-Course


Early Stopping to Avoid Overfitting

Early stopping is an approach to training complex machine learning models to avoid overfitting.

It works by monitoring the performance of the model that is being trained on a separate test dataset and stopping the training procedure once the performance on the test dataset has not improved after a fixed number of training iterations.

It avoids overfitting by attempting to automatically select the inflection point where performance on the test dataset starts to decrease while performance on the training dataset continues to improve as the model starts to overfit.

The performance measure may be the loss function that is being optimized to train the model (such as logarithmic loss), or an external metric of interest to the problem in general (such as classification accuracy).

Monitoring Training Performance With XGBoost

The XGBoost model can evaluate and report on the performance on a test set for the the model during training.

It supports this capability by specifying both an test dataset and an evaluation metric on the call to model.fit() when training the model and specifying verbose output.

For example, we can report on the binary classification error rate (“error“) on a standalone test set (eval_set) while training an XGBoost model as follows:

XGBoost supports a suite of evaluation metrics not limited to:

  • rmse” for root mean squared error.
  • mae” for mean absolute error.
  • logloss” for binary logarithmic loss and “mlogloss” for multi-class log loss (cross entropy).
  • error” for classification error.
  • auc” for area under ROC curve.

The full list is provided in the “Learning Task Parameters” section of the XGBoost Parameters webpage.

For example, we can demonstrate how to track the performance of the training of an XGBoost model on the Pima Indians onset of diabetes dataset, available from the UCI Machine Learning Repository.

The full example is provided below:

Running this example trains the model on 67% of the data and evaluates the model every training epoch on a 33% test dataset.

The classification error is reported each iteration and finally the classification accuracy is reported at the end.

The output is provided below, truncated for brevity. We can see that the classification error is reported each training iteration (after each boosted tree is added to the model).

Reviewing all of the output, we can see that the model performance on the test set sits flat and even gets worse towards the end of training.

Evaluate XGBoost Models With Learning Curves

We can retrieve the performance of the model on the evaluation dataset and plot it to get insight into how learning unfolded while training.

We provide an array of X and y pairs to the eval_metric argument when fitting our XGBoost model. In addition to a test set, we can also provide the training dataset. This will provide a report on how well the model is performing on both training and test sets during training.

For example:

In addition, the performance of the model on each evaluation set is stored and made available by the model after training by calling the model.evals_result() function. This returns a dictionary of evaluation datasets and scores, for example:

This will print results like the following (truncated for brevity):

Each of ‘validation_0‘ and ‘validation_1‘ correspond to the order that datasets were provided to the eval_set argument in the call to fit().

A specific array of results, such as for the first dataset and the error metric can be accessed as follows:

Additionally, we can specify more evaluation metrics to evaluate and collect by providing an array of metrics to the eval_metric argument of the fit() function.

We can then use these collected performance measures to create a line plot and gain further insight into how the model behaved on train and test datasets over training epochs.

Below is the complete code example showing how the collected results can be visualized on a line plot.

Running this code reports the classification error on both the train and test datasets each epoch. We can turn this off by setting verbose=False (the default) in the call to the fit() function.

Two plots are created. The first shows the logarithmic loss of the XGBoost model for each epoch on the training and test datasets.

XGBoost Learning Curve Log Loss

XGBoost Learning Curve Log Loss

The second plot shows the classification error of the XGBoost model for each epoch on the training and test datasets.

XGBoost Learning Curve Classification Error

XGBoost Learning Curve Classification Error

From reviewing the logloss plot, it looks like there is an opportunity to stop the learning early, perhaps somewhere around epoch 20 to epoch 40.

We see a similar story for classification error, where error appears to go back up at around epoch 40.

Early Stopping With XGBoost

XGBoost supports early stopping after a fixed number of iterations.

In addition to specifying a metric and test dataset for evaluation each epoch, you must specify a window of the number of epochs over which no improvement is observed. This is specified in the early_stopping_rounds parameter.

For example, we can check for no improvement in logarithmic loss over the 10 epochs as follows:

If multiple evaluation datasets or multiple evaluation metrics are provided, then early stopping will use the last in the list.

Below provides a full example for completeness with early stopping.

Running the example provides the following output, truncated for brevity:

We can see that the model stopped training at epoch 42 (close to what we expected by our manual judgment of learning curves) and that the model with the best loss was observed at epoch 32.

It is generally a good idea to select the early_stopping_rounds as a reasonable function of the total number of training epochs (10% in this case) or attempt to correspond to the period of inflection points as might be observed on plots of learning curves.


In this post you discovered about monitoring performance and early stopping.

You learned:

  • About the early stopping technique to stop model training before the model overfits the training data.
  • How to monitor the performance of XGBoost models during training and to plot learning curves.
  • How to configure early stopping when training XGBoost models.

Do you have any questions about overfitting or about this post? Ask your questions in the comments and I will do my best to answer.

Want To Learn The Algorithm Winning Competitions?

Develop Your Own XGBoost Models in Minutes

...with just a few lines of Python

Discover how in my new Ebook: XGBoost With Python

It covers self-study tutorials on topics like:
Algorithm Fundamentals, Scaling, Hyperparameter Tuning, and much more...

Bring The Power of XGBoost To
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.

12 Responses to Avoid Overfitting By Early Stopping With XGBoost In Python

  1. shivam October 19, 2016 at 2:43 am #

    Hi Jason,

    Thanks for neat and nice explanations.

    Can you please elaborate on below things –

    1. What should we do if the error on train is higher as compared to error on test.
    2. Apart for ealry stopping how can we tune regularization parameters effectively ?

  2. Andy November 10, 2016 at 10:30 am #

    Hi Jason,

    Thank you so much for the all your posts. Your site really helped to get me started.

    Quick question: Is the eval_metric and eval_set arguments available in .fit() for other models besides XGBoost? Say KNN, LogReg or SVM?

    Also, Can those arguments be used in grid/random search? e.g. grid_search.fit(X, y, eval_metric “error”, eval_set= […. ….] )

  3. VSP January 18, 2017 at 3:50 am #

    Hi Jason,
    Thank you for this post, it is very handy and clear.
    In the case that I have a task that is measured by another metric, as F-score, will we find the optimal epoch in the loss learning curve or in this new metric? Are there proportional, even with Accuracy?

    • Jason Brownlee January 18, 2017 at 10:17 am #

      I would suggest using the new metric, but try both approaches and compare the results.

      Make decisions based on data.

  4. Jimmy March 9, 2017 at 8:51 pm #

    Hi Jason
    Thanks for your sharing!
    I have a question that since the python API document mention that

    Early stopping returns the model from the last iteration (not the best one). If early stopping occurs, the model will have three additional fields: bst.best_score, bst.best_iteration and bst.best_ntree_limit.

    So the model we get when early stopping occur may not be the best model, right?
    how can we get that best model?

    • Jason Brownlee March 10, 2017 at 9:26 am #

      Hi Jimmy,

      Early stopping may not be the best method to capture the “best” model, however you define that (train or test performance and the metric).

      You might need to write a custom callback function to save the model if it has a lower score than the best seen so far.

      Sorry, I do not have an example, but I’d expect you will need to use the native xgboost API rather than sklearn wrappers.

  5. Shud March 13, 2017 at 5:02 pm #

    You’ve selected early stopping rounds = 10, but why did the total epochs reached 42. Since you said the best may not be the best, then how do i get to control the number of epochs in my final model?

    • Jason Brownlee March 14, 2017 at 8:14 am #

      Great question shud,

      The early stopping does not trigger unless there is no improvement for 10 epochs. It is not a limit on the total number of epochs.

      • Shud March 14, 2017 at 3:08 pm #

        Since the model stopped at epoch 32, my model is trained till that and my predictions are based out of 32 epochs?

Leave a Reply