Avoid Overfitting By Early Stopping With XGBoost In Python

Overfitting is a problem with sophisticated non-linear learning algorithms like gradient boosting.

In this post you will discover how you can use early stopping to limit overfitting with XGBoost in Python.

After reading this post, you will know:

  • About early stopping as an approach to reducing overfitting of training data.
  • How to monitor the performance of an XGBoost model during training and plot the learning curve.
  • How to use early stopping to prematurely stop the training of an XGBoost model at an optimal epoch.

Let’s get started.

  • Update Jan/2017: Updated to reflect changes in scikit-learn API version 0.18.1.
Avoid Overfitting By Early Stopping With XGBoost In Python

Avoid Overfitting By Early Stopping With XGBoost In Python
Photo by Michael Hamann, some rights reserved.

Need help with XGBoost in Python?

Take my free 7-day email course and discover configuration, tuning and more (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Early Stopping to Avoid Overfitting

Early stopping is an approach to training complex machine learning models to avoid overfitting.

It works by monitoring the performance of the model that is being trained on a separate test dataset and stopping the training procedure once the performance on the test dataset has not improved after a fixed number of training iterations.

It avoids overfitting by attempting to automatically select the inflection point where performance on the test dataset starts to decrease while performance on the training dataset continues to improve as the model starts to overfit.

The performance measure may be the loss function that is being optimized to train the model (such as logarithmic loss), or an external metric of interest to the problem in general (such as classification accuracy).

Monitoring Training Performance With XGBoost

The XGBoost model can evaluate and report on the performance on a test set for the the model during training.

It supports this capability by specifying both an test dataset and an evaluation metric on the call to model.fit() when training the model and specifying verbose output.

For example, we can report on the binary classification error rate (“error“) on a standalone test set (eval_set) while training an XGBoost model as follows:

XGBoost supports a suite of evaluation metrics not limited to:

  • rmse” for root mean squared error.
  • mae” for mean absolute error.
  • logloss” for binary logarithmic loss and “mlogloss” for multi-class log loss (cross entropy).
  • error” for classification error.
  • auc” for area under ROC curve.

The full list is provided in the “Learning Task Parameters” section of the XGBoost Parameters webpage.

For example, we can demonstrate how to track the performance of the training of an XGBoost model on the Pima Indians onset of diabetes dataset, available from the UCI Machine Learning Repository.

The full example is provided below:

Running this example trains the model on 67% of the data and evaluates the model every training epoch on a 33% test dataset.

The classification error is reported each iteration and finally the classification accuracy is reported at the end.

The output is provided below, truncated for brevity. We can see that the classification error is reported each training iteration (after each boosted tree is added to the model).

Reviewing all of the output, we can see that the model performance on the test set sits flat and even gets worse towards the end of training.

Evaluate XGBoost Models With Learning Curves

We can retrieve the performance of the model on the evaluation dataset and plot it to get insight into how learning unfolded while training.

We provide an array of X and y pairs to the eval_metric argument when fitting our XGBoost model. In addition to a test set, we can also provide the training dataset. This will provide a report on how well the model is performing on both training and test sets during training.

For example:

In addition, the performance of the model on each evaluation set is stored and made available by the model after training by calling the model.evals_result() function. This returns a dictionary of evaluation datasets and scores, for example:

This will print results like the following (truncated for brevity):

Each of ‘validation_0‘ and ‘validation_1‘ correspond to the order that datasets were provided to the eval_set argument in the call to fit().

A specific array of results, such as for the first dataset and the error metric can be accessed as follows:

Additionally, we can specify more evaluation metrics to evaluate and collect by providing an array of metrics to the eval_metric argument of the fit() function.

We can then use these collected performance measures to create a line plot and gain further insight into how the model behaved on train and test datasets over training epochs.

Below is the complete code example showing how the collected results can be visualized on a line plot.

Running this code reports the classification error on both the train and test datasets each epoch. We can turn this off by setting verbose=False (the default) in the call to the fit() function.

Two plots are created. The first shows the logarithmic loss of the XGBoost model for each epoch on the training and test datasets.

XGBoost Learning Curve Log Loss

XGBoost Learning Curve Log Loss

The second plot shows the classification error of the XGBoost model for each epoch on the training and test datasets.

XGBoost Learning Curve Classification Error

XGBoost Learning Curve Classification Error

From reviewing the logloss plot, it looks like there is an opportunity to stop the learning early, perhaps somewhere around epoch 20 to epoch 40.

We see a similar story for classification error, where error appears to go back up at around epoch 40.

Early Stopping With XGBoost

XGBoost supports early stopping after a fixed number of iterations.

In addition to specifying a metric and test dataset for evaluation each epoch, you must specify a window of the number of epochs over which no improvement is observed. This is specified in the early_stopping_rounds parameter.

For example, we can check for no improvement in logarithmic loss over the 10 epochs as follows:

If multiple evaluation datasets or multiple evaluation metrics are provided, then early stopping will use the last in the list.

Below provides a full example for completeness with early stopping.

Running the example provides the following output, truncated for brevity:

We can see that the model stopped training at epoch 42 (close to what we expected by our manual judgment of learning curves) and that the model with the best loss was observed at epoch 32.

It is generally a good idea to select the early_stopping_rounds as a reasonable function of the total number of training epochs (10% in this case) or attempt to correspond to the period of inflection points as might be observed on plots of learning curves.


In this post you discovered about monitoring performance and early stopping.

You learned:

  • About the early stopping technique to stop model training before the model overfits the training data.
  • How to monitor the performance of XGBoost models during training and to plot learning curves.
  • How to configure early stopping when training XGBoost models.

Do you have any questions about overfitting or about this post? Ask your questions in the comments and I will do my best to answer.

Want To Learn The Algorithm Winning Competitions?

XGBoost With Python

Develop Your Own XGBoost Models in Minutes

…with just a few lines of Python

Discover how in my new Ebook:
XGBoost With Python

It covers self-study tutorials like:
Algorithm Fundamentals, Scaling, Hyperparameters, and much more…

Bring The Power of XGBoost To
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.

32 Responses to Avoid Overfitting By Early Stopping With XGBoost In Python

  1. shivam October 19, 2016 at 2:43 am #

    Hi Jason,

    Thanks for neat and nice explanations.

    Can you please elaborate on below things –

    1. What should we do if the error on train is higher as compared to error on test.
    2. Apart for ealry stopping how can we tune regularization parameters effectively ?

  2. Andy November 10, 2016 at 10:30 am #

    Hi Jason,

    Thank you so much for the all your posts. Your site really helped to get me started.

    Quick question: Is the eval_metric and eval_set arguments available in .fit() for other models besides XGBoost? Say KNN, LogReg or SVM?

    Also, Can those arguments be used in grid/random search? e.g. grid_search.fit(X, y, eval_metric “error”, eval_set= […. ….] )

  3. VSP January 18, 2017 at 3:50 am #

    Hi Jason,
    Thank you for this post, it is very handy and clear.
    In the case that I have a task that is measured by another metric, as F-score, will we find the optimal epoch in the loss learning curve or in this new metric? Are there proportional, even with Accuracy?

    • Jason Brownlee January 18, 2017 at 10:17 am #

      I would suggest using the new metric, but try both approaches and compare the results.

      Make decisions based on data.

  4. Jimmy March 9, 2017 at 8:51 pm #

    Hi Jason
    Thanks for your sharing!
    I have a question that since the python API document mention that

    Early stopping returns the model from the last iteration (not the best one). If early stopping occurs, the model will have three additional fields: bst.best_score, bst.best_iteration and bst.best_ntree_limit.

    So the model we get when early stopping occur may not be the best model, right?
    how can we get that best model?

    • Jason Brownlee March 10, 2017 at 9:26 am #

      Hi Jimmy,

      Early stopping may not be the best method to capture the “best” model, however you define that (train or test performance and the metric).

      You might need to write a custom callback function to save the model if it has a lower score than the best seen so far.

      Sorry, I do not have an example, but I’d expect you will need to use the native xgboost API rather than sklearn wrappers.

  5. Shud March 13, 2017 at 5:02 pm #

    You’ve selected early stopping rounds = 10, but why did the total epochs reached 42. Since you said the best may not be the best, then how do i get to control the number of epochs in my final model?

    • Jason Brownlee March 14, 2017 at 8:14 am #

      Great question shud,

      The early stopping does not trigger unless there is no improvement for 10 epochs. It is not a limit on the total number of epochs.

      • Shud March 14, 2017 at 3:08 pm #

        Since the model stopped at epoch 32, my model is trained till that and my predictions are based out of 32 epochs?

  6. G April 1, 2017 at 12:23 am #

    Hi Jason,

    I have a question regarding the use of the test set for early stopping to avoid overfitting…
    Shouldn’t you use the train set? Shouldn’t we use the test set only for testing the model and not for optimizing it? (I see early stopping as model optimization)


    • Jason Brownlee April 1, 2017 at 5:56 am #

      Early stopping uses a separate dataset like a test or validation dataset to avoid overfitting.

      If we used the training dataset alone, we would not get the benefits of early stopping. How would we know when to stop?

      • G April 1, 2017 at 8:18 pm #

        I thought we would stop when the performances on the training set don’t improve in xx rounds to avoid to create a lot of not useful trees. Then use the selected number of estimator to compute the performances on the test set. Otherwise we might risk to evaluate our model using overoptimistic results. ie. we might get very high AUC because we select the best model, but in a real world experiment where we do not have labels our performances will decrease a lot. The use of the earlystopping on the evaluation set is legitim.. Could you please elaborate and give your opinion?

        Thank you

        PS I really like your posts..

        • G April 1, 2017 at 8:23 pm #

          In short my point is: how can we use the early stopping on the test set if (in principle) we should use the labels of the test set only to evaluate the results of our model and not to “train/optimize” further the model…

          • Jason Brownlee April 2, 2017 at 6:27 am #

            Often we split data into train/test/validation to avoid optimistic results.

  7. G. April 3, 2017 at 4:42 am #

    Hi Jason, I agree. However in your post you wrote:

    “It works by monitoring the performance of the model that is being trained on a separate test dataset and stopping the training procedure once the performance on the test dataset has not improved after a fixed number of training iterations.

    It avoids overfitting by attempting to automatically select the inflection point where performance on the test dataset starts to decrease while performance on the training dataset continues to improve as the model starts to overfit.”

    This could lead to the error of using the early stopping on the final test set while it should be used on the validation set or directly on the training to don’t create too many trees.

    Could you confirm this?


    • Jason Brownlee April 4, 2017 at 9:09 am #

      Early stopping requires two datasets, a training and a validation or test set.

  8. Ogunleye May 11, 2017 at 8:27 pm #

    Hello sir,

    Thank you for the good work. I adapted your code to my dataset sir, my ‘validation_0’ error stays at zero only ‘validation_1’ error changes. What does that imply sir? Thank you and kind regards sir.

    • Jason Brownlee May 12, 2017 at 7:40 am #

      Sorry, I’m not sure I understand.

      Perhaps you could give more details or an example?

      • omoggbeyin May 14, 2017 at 12:01 pm #

        Am sorry for not making too much sense initially. I used your XGBoost code and validation_0 stayed at value 0 while validation_1 also stayed at constant value 0f 0.0123 throughout the training. I just want your expert advice on why it is constant sir. Kind regards.

        • Jason Brownlee May 15, 2017 at 5:51 am #

          That is odd.

          Try different configuration, try different data. See if things change. You may have to explore a little to debug what is going on.

  9. Omogbehin May 13, 2017 at 9:49 am #

    [56] validation_0-error:0 validation_0-logloss:0.02046 validation_1-error:0 validation_1-logloss:0.028423
    [57] validation_0-error:0 validation_0-logloss:0.020461 validation_1-error:0 validation_1-logloss:0.028407
    [58] validation_0-error:0 validation_0-logloss:0.020013 validation_1-error:0 validation_1-logloss:0.027592
    Stopping. Best iteration:
    [43] validation_0-error:0 validation_0-logloss:0.020612 validation_1-error:0 validation_1-logloss:0.027545

    Accuracy: 100.00%

  10. omogbeyin May 14, 2017 at 12:15 pm #

    Thank you for the good work sir.

  11. davalo May 23, 2017 at 8:38 am #

    Thanks for your post!

  12. Markos Flavio August 11, 2017 at 11:10 am #

    Hi Jason, I have a question about early-stopping.

    After saving the model that achieves the best validation error (say on epoch 50), how can I retrain it (to achieve better results) using this knowledge?

    Is it valid to retrain it on a mix of training and validation sets considering those 50 epochs and expect to get the best result again? I know that some variance may occur after adding some more examples, but considering standard proportion values of dataset cardinalities (train=0.6, cv= 0.2, test=0.2), retraining the model using validation data is sufficient to ruin my previous result of 50 epochs? What’s the best practical in, say, a ML competition?

    Another quick question: how do you manage validation sets for hyperparameterization and early stopping? Do you use the same set?

    Thank you very much, Markos.

    • Jason Brownlee August 12, 2017 at 6:43 am #

      Good question.

      There’s no clear answer, you must experiment.

      Perhaps you could train 5-10 models for 50 epochs and ensemble them. Perhaps compare the ensemble results to one-best model found via early stopping.

      I split the training set into training and validation, see this post:

      • Markos Flavio August 12, 2017 at 6:49 am #

        Thank you for the answer. It’s awsome having someone with great knowledge in the field answering our questions.

Leave a Reply