How to Evaluate Gradient Boosting Models with XGBoost in Python

The goal of developing a predictive model is to develop a model that is accurate on unseen data.

This can be achieved using statistical techniques where the training dataset is carefully used to estimate the performance of the model on new and unseen data.

In this tutorial you will discover how you can evaluate the performance of your gradient boosting models with XGBoost in Python.

After completing this tutorial, you will know.

  • How to evaluate the performance of your XGBoost models using train and test datasets.
  • How to evaluate the performance of your XGBoost models using k-fold cross validation.

Kick-start your project with my new book XGBoost With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

  • Update Jan/2017: Updated to reflect changes in scikit-learn API version 0.18.1.
  • Update Mar/2018: Added alternate link to download the dataset as the original appears to have been taken down.
How to Evaluate Gradient Boosting Models with XGBoost in Python

How to Evaluate Gradient Boosting Models with XGBoost in Python
Photo by Timitrius, some rights reserved.

Need help with XGBoost in Python?

Take my free 7-day email course and discover xgboost (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Evaluate XGBoost Models With Train and Test Sets

The simplest method that we can use to evaluate the performance of a machine learning algorithm is to use different training and testing datasets.

We can take our original dataset and split it into two parts. Train the algorithm on the first part, then make predictions on the second part and evaluate the predictions against the expected results.

The size of the split can depend on the size and specifics of your dataset, although it is common to use 67% of the data for training and the remaining 33% for testing.

This algorithm evaluation technique is fast. It is ideal for large datasets (millions of records) where there is strong evidence that both splits of the data are representative of the underlying problem. Because of the speed, it is useful to use this approach when the algorithm you are investigating is slow to train.

A downside of this technique is that it can have a high variance. This means that differences in the training and test dataset can result in meaningful differences in the estimate of model accuracy.

We can split the dataset into a train and test set using the train_test_split() function from the scikit-learn library. For example, we can split the dataset into a 67% and 33% split for training and test sets as follows:

The full code listing is provided below using the Pima Indians onset of diabetes dataset, assumed to be in the current working directory.

Download the dataset and place it in your current working directory.

An XGBoost model with default configuration is fit on the training dataset and evaluated on the test dataset.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running this example summarizes the performance of the model on the test set.

Evaluate XGBoost Models With k-Fold Cross Validation

Cross validation is an approach that you can use to estimate the performance of a machine learning algorithm with less variance than a single train-test set split.

It works by splitting the dataset into k-parts (e.g. k=5 or k=10). Each split of the data is called a fold. The algorithm is trained on k-1 folds with one held back and tested on the held back fold. This is repeated so that each fold of the dataset is given a chance to be the held back test set.

After running cross validation you end up with k different performance scores that you can summarize using a mean and a standard deviation.

The result is a more reliable estimate of the performance of the algorithm on new data given your test data. It is more accurate because the algorithm is trained and evaluated multiple times on different data.

The choice of k must allow the size of each test partition to be large enough to be a reasonable sample of the problem, whilst allowing enough repetitions of the train-test evaluation of the algorithm to provide a fair estimate of the algorithms performance on unseen data. For modest sized datasets in the thousands or tens of thousands of observations, k values of 3, 5 and 10 are common.

We can use k-fold cross validation support provided in scikit-learn. First we must create the KFold object specifying the number of folds and the size of the dataset. We can then use this scheme with the specific dataset. The cross_val_score() function from scikit-learn allows us to evaluate a model using the cross validation scheme and returns a list of the scores for each model trained on each fold.

The full code listing for evaluating an XGBoost model with k-fold cross validation is provided below for completeness.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running this example summarizes the performance of the default model configuration on the dataset including both the mean and standard deviation classification accuracy.

If you have many classes for a classification type predictive modeling problem or the classes are imbalanced (there are a lot more instances for one class than another), it can be a good idea to create stratified folds when performing cross validation.

This has the effect of enforcing the same distribution of classes in each fold as in the whole training dataset when performing the cross validation evaluation. The scikit-learn library provides this capability in the StratifiedKFold class.

Below is the same example modified to use stratified cross validation to evaluate an XGBoost model.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running this example produces the following output.

What Techniques to Use When

  • Generally, k-fold cross validation is the gold-standard for evaluating the performance of a machine learning algorithm on unseen data with k set to 3, 5, or 10.
  • Use stratified cross validation to enforce class distributions when there are a large number of classes or an imbalance in instances for each class.
  • Using a train/test split is good for speed when using a slow algorithm and produces performance estimates with lower bias when using large datasets.

The best advice is to experiment and find a technique for your problem that is fast and produces reasonable estimates of performance that you can use to make decisions.

If in doubt, use 10-fold cross validation for regression problems and stratified 10-fold cross validation on classification problems.


In this tutorial, you discovered how you can evaluate your XGBoost models by estimating how well they are likely to perform on unseen data.

Specifically, you learned:

  • How to split your dataset into train and test subsets for training and evaluating the performance of your model.
  • How you can create k XGBoost models on different subsets of the dataset and average the scores to get a more robust estimate of model performance.
  • Heuristics to help choose between train-test split and k-fold cross validation for your problem.

Do you have any questions on how to evaluate the performance of XGBoost models or about this post? Ask your questions in the comments below and I will do my best to answer.

Discover The Algorithm Winning Competitions!

XGBoost With Python

Develop Your Own XGBoost Models in Minutes

...with just a few lines of Python

Discover how in my new Ebook:
XGBoost With Python

It covers self-study tutorials like:
Algorithm Fundamentals, Scaling, Hyperparameters, and much more...

Bring The Power of XGBoost To Your Own Projects

Skip the Academics. Just Results.

See What's Inside

32 Responses to How to Evaluate Gradient Boosting Models with XGBoost in Python

  1. Avatar
    Agnes January 27, 2017 at 9:07 pm #

    Hi Jason,
    thank you for this article. You didn’t mention the Leave-One-Out cross-validator method.
    Is it the same logic that the k-Fold Cross Validation (exept that the size of the test set is 1) ?
    Would you recommend to use Leave-One-Out cross-validator or k-Fold Cross Validation for a small dataset (approximately 2000 rows) ?

    • Avatar
      Jason Brownlee January 28, 2017 at 7:36 am #

      Hi Agnes,

      Yes, it is like 1-fold cross validation, repeated for every pattern in the dataset.

      From my reading, you are better off using k-fold cross validation.

  2. Avatar
    Siva May 3, 2017 at 8:15 pm #

    Hi Jason, How to find the accuracy for XGBRegressor model?

  3. Avatar
    Whakeem February 8, 2018 at 7:24 am #

    Hi Jason,
    Does using the cross_val_score already fits the model so it is ready to provide predictions?

  4. Avatar
    Aminatta April 28, 2018 at 10:57 am #

    Thanks for the tutorial. I’m still working on it, but I can say it is very understandable compared to others out there.

  5. Avatar
    satish September 1, 2018 at 5:13 pm #

    Hi Jason,

    Thanks for this tutorial, Its simple and clear.
    I was working on Imbalanced dataset (1:9) classification problem. It worked well with XGBClassifier(). and evaluated well with k-fold validation.
    Thanks a lot!

  6. Avatar
    ervin November 22, 2018 at 10:54 am #

    Hi Jason,

    in your examples — where would you implement early stopping?

  7. Avatar
    Aryan March 27, 2019 at 9:37 pm #

    Thanks Jason for the very elaborative explaination of the process

  8. Avatar
    Gledson May 11, 2019 at 9:08 pm #

    Hello Jason Brownlee ,
    How are you?
    After you’ve done cross-validation, how do I get the best model to perform classification on my test data?

    • Avatar
      Jason Brownlee May 12, 2019 at 6:42 am #

      Choose the configuration that gave the best results, then fit a final model on all available data.

  9. Avatar
    Clare July 30, 2019 at 6:09 pm #

    Thanks, Jason, the tutorial helps a lot.
    However, I got stuck when working on imbalanced dataset (1:15) classification problem. The model worked well with XGBClassifier() initially, with an AUC of 0.911 for train set and 0.949 for test set. Then after I tuning the hyperparameters (max_depth, min_child_weight, gamma) using GridSearchCV, the AUC of train and test set dropped obviously (0.892 and 0.917). I feel really confused. Are there any clues why this would happen?

    • Avatar
      Jason Brownlee July 31, 2019 at 6:47 am #

      Perhaps tuning the parameter reduced the capacity of the model. Perhaps continue the tuning project?

  10. Avatar
    GOPAL Behera September 6, 2019 at 1:59 am #

    i have used big mart data set and split the data into train ,test set after that i execute,y_train); where my model is XGBClassifier() and it execute successful

    but when i execute y_pred = model.predict(X_test) it wil gives an error that feature name mis match as gvien below

    ValueError Traceback (most recent call last)
    in ()
    1 # make predictions for test data
    —-> 2 y_pred = model.predict(X_test)
    3 predictions = [round(value) for value in y_pred]
    4 # evaluate predictions
    5 accuracy = accuracy_score(y_test, predictions)

    /home/gopal/.local/lib/python2.7/site-packages/xgboost/sklearn.pyc in predict(self, data, output_margin, ntree_limit, validate_features)
    770 output_margin=output_margin,
    771 ntree_limit=ntree_limit,
    –> 772 validate_features=validate_features)
    773 if output_margin:
    774 # If output_margin is active, simply return the scores

    /home/gopal/.local/lib/python2.7/site-packages/xgboost/core.pyc in predict(self, data, output_margin, ntree_limit, pred_leaf, pred_contribs, approx_contribs, pred_interactions, validate_features)
    1284 if validate_features:
    -> 1285 self._validate_features(data)
    1287 length = c_bst_ulong()

    /home/gopal/.local/lib/python2.7/site-packages/xgboost/core.pyc in _validate_features(self, data)
    1691 raise ValueError(msg.format(self.feature_names,
    -> 1692 data.feature_names))
    1694 def get_split_value_histogram(self, feature, fmap=”, bins=None, as_pandas=True):

    ValueError: feature_names mismatch: [‘f0’, ‘f1’, ‘f2’, ‘f3’, ‘f4’, ‘f5’, ‘f6’, ‘f7’, ‘f8’, ‘f9’, ‘f10’, ‘f11′] [u’Item_Fat_Content’, u’Item_Visibility’, u’Item_Type’, u’Item_MRP’, u’Outlet_Size’, u’Outlet_Location_Type’, u’Outlet_Type’, u’Outlet_Years’, u’Item_Visibility_MeanRatio’, u’Outlet’, u’Identifier’, u’Item_Weight’]
    expected f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11 in input data
    training data did not have the following fields: Outlet_Years, Outlet_Size, Item_Visibility, Item_MRP, Item_Visibility_MeanRatio, Outlet_Location_Type, Item_Weight, Item_Type, Outlet, Identifier, Outlet_Type, Item_Fat_Content

    • Avatar
      Jason Brownlee September 6, 2019 at 5:06 am #

      Perhaps confirm that the two datasets have identical columns?

  11. Avatar
    GOPAL Behera September 7, 2019 at 10:35 pm #

    my train set and test set contains float vlaues but when i predicting by using classifier it says continious is not supported

    • Avatar
      Jason Brownlee September 8, 2019 at 5:18 am #

      That is odd. Perhaps double check your data was loaded correctly?

  12. Avatar
    GOPAL Behera September 7, 2019 at 10:43 pm #

    Hi Jason for XBGRegressor i got RMSE =1043 fro big mart dataset and the bset score i got 0.59974 so can i use best score as my accuracy as the RMSE value look very large please suggest

  13. Avatar
    Danny December 12, 2019 at 4:01 pm #

    Hi Jason,

    I just found this wonderful blog. I still have some questions about using XGBoost. I don’t know if I can ask for help from you.
    I am new with using XGBoost. XGBClassifier to build the model. I have used GridSearchCV to create a tune-grid to find the optimal hyperparameters and I have gotten my final model. I used ‘auc’ as my classification metrics. My question is that I use

    yPred = model.predict(Xtest),
    but the result(yPred) are float values range from 0 to 1. Who do I decide the threshold value to mapping those value to 0 and 1?

    I saw you used round(value), which is equivalent to setting the threshold to 0.5, I think. Is there any rule that I need to follow to find the threshold value for my model? I am looking forward to your reply. Thank you so much.


    Yilin Wang

    • Avatar
      Jason Brownlee December 13, 2019 at 5:53 am #


      If you are using ROC AUC, you can use the threshold that achieves the best F-measure or J-metric directly.

      If unsure, test each threshold from the ROC curve against the F-measure score.

      I hope that helps.

  14. Avatar
    Johnny Lu April 16, 2020 at 1:56 pm #

    Hi Jason:

    Thanks for your tutorial.
    This tutorial is based on the Sklearn API, do you have any example to do StratifiedKFold in XGboost’s native API?


    • Avatar
      Jason Brownlee April 17, 2020 at 6:14 am #

      Sorry, I don’t have tutorials using the native apis.

  15. Avatar
    Marcos May 5, 2021 at 1:11 am #

    Hello Jason,
    thanks for this tutorial. There is a way to view the confusion matrix of every validation?

    • Avatar
      Jason Brownlee May 5, 2021 at 6:13 am #

      No, typically a confusion matrix is calculated for a single hold-out dataset.

Leave a Reply