Last Updated on August 27, 2020

The goal of developing a predictive model is to develop a model that is accurate on unseen data.

This can be achieved using statistical techniques where the training dataset is carefully used to estimate the performance of the model on new and unseen data.

In this tutorial you will discover how you can evaluate the performance of your gradient boosting models with XGBoost in Python.

After completing this tutorial, you will know.

- How to evaluate the performance of your XGBoost models using train and test datasets.
- How to evaluate the performance of your XGBoost models using k-fold cross validation.

**Kick-start your project** with my new book XGBoost With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Update Jan/2017**: Updated to reflect changes in scikit-learn API version 0.18.1.**Update Mar/2018**: Added alternate link to download the dataset as the original appears to have been taken down.

### Need help with XGBoost in Python?

Take my free 7-day email course and discover xgboost (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

## Evaluate XGBoost Models With Train and Test Sets

The simplest method that we can use to evaluate the performance of a machine learning algorithm is to use different training and testing datasets.

We can take our original dataset and split it into two parts. Train the algorithm on the first part, then make predictions on the second part and evaluate the predictions against the expected results.

The size of the split can depend on the size and specifics of your dataset, although it is common to use 67% of the data for training and the remaining 33% for testing.

This algorithm evaluation technique is fast. It is ideal for large datasets (millions of records) where there is strong evidence that both splits of the data are representative of the underlying problem. Because of the speed, it is useful to use this approach when the algorithm you are investigating is slow to train.

A downside of this technique is that it can have a high variance. This means that differences in the training and test dataset can result in meaningful differences in the estimate of model accuracy.

We can split the dataset into a train and test set using the **train_test_split()** function from the scikit-learn library. For example, we can split the dataset into a 67% and 33% split for training and test sets as follows:

1 2 |
# split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7) |

The full code listing is provided below using the Pima Indians onset of diabetes dataset, assumed to be in the current working directory.

Download the dataset and place it in your current working directory.

An XGBoost model with default configuration is fit on the training dataset and evaluated on the test dataset.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# train-test split evaluation of xgboost model from numpy import loadtxt from xgboost import XGBClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # load data dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",") # split data into X and y X = dataset[:,0:8] Y = dataset[:,8] # split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7) # fit model no training data model = XGBClassifier() model.fit(X_train, y_train) # make predictions for test data y_pred = model.predict(X_test) predictions = [round(value) for value in y_pred] # evaluate predictions accuracy = accuracy_score(y_test, predictions) print("Accuracy: %.2f%%" % (accuracy * 100.0)) |

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running this example summarizes the performance of the model on the test set.

1 |
Accuracy: 77.95% |

## Evaluate XGBoost Models With k-Fold Cross Validation

Cross validation is an approach that you can use to estimate the performance of a machine learning algorithm with less variance than a single train-test set split.

It works by splitting the dataset into k-parts (e.g. k=5 or k=10). Each split of the data is called a fold. The algorithm is trained on k-1 folds with one held back and tested on the held back fold. This is repeated so that each fold of the dataset is given a chance to be the held back test set.

After running cross validation you end up with k different performance scores that you can summarize using a mean and a standard deviation.

The result is a more reliable estimate of the performance of the algorithm on new data given your test data. It is more accurate because the algorithm is trained and evaluated multiple times on different data.

The choice of k must allow the size of each test partition to be large enough to be a reasonable sample of the problem, whilst allowing enough repetitions of the train-test evaluation of the algorithm to provide a fair estimate of the algorithms performance on unseen data. For modest sized datasets in the thousands or tens of thousands of observations, k values of 3, 5 and 10 are common.

We can use k-fold cross validation support provided in scikit-learn. First we must create the KFold object specifying the number of folds and the size of the dataset. We can then use this scheme with the specific dataset. The **cross_val_score()** function from scikit-learn allows us to evaluate a model using the cross validation scheme and returns a list of the scores for each model trained on each fold.

1 2 |
kfold = KFold(n_splits=10, random_state=7) results = cross_val_score(model, X, Y, cv=kfold) |

The full code listing for evaluating an XGBoost model with k-fold cross validation is provided below for completeness.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# k-fold cross validation evaluation of xgboost model from numpy import loadtxt import xgboost from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score # load data dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",") # split data into X and y X = dataset[:,0:8] Y = dataset[:,8] # CV model model = xgboost.XGBClassifier() kfold = KFold(n_splits=10, random_state=7) results = cross_val_score(model, X, Y, cv=kfold) print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100)) |

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running this example summarizes the performance of the default model configuration on the dataset including both the mean and standard deviation classification accuracy.

1 |
Accuracy: 76.69% (7.11%) |

If you have many classes for a classification type predictive modeling problem or the classes are imbalanced (there are a lot more instances for one class than another), it can be a good idea to create stratified folds when performing cross validation.

This has the effect of enforcing the same distribution of classes in each fold as in the whole training dataset when performing the cross validation evaluation. The scikit-learn library provides this capability in the StratifiedKFold class.

Below is the same example modified to use stratified cross validation to evaluate an XGBoost model.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# stratified k-fold cross validation evaluation of xgboost model from numpy import loadtxt import xgboost from sklearn.model_selection import StratifiedKFold from sklearn.model_selection import cross_val_score # load data dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",") # split data into X and y X = dataset[:,0:8] Y = dataset[:,8] # CV model model = xgboost.XGBClassifier() kfold = StratifiedKFold(n_splits=10, random_state=7) results = cross_val_score(model, X, Y, cv=kfold) print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100)) |

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running this example produces the following output.

1 |
Accuracy: 76.95% (5.88%) |

## What Techniques to Use When

- Generally, k-fold cross validation is the gold-standard for evaluating the performance of a machine learning algorithm on unseen data with k set to 3, 5, or 10.
- Use stratified cross validation to enforce class distributions when there are a large number of classes or an imbalance in instances for each class.
- Using a train/test split is good for speed when using a slow algorithm and produces performance estimates with lower bias when using large datasets.

The best advice is to experiment and find a technique for your problem that is fast and produces reasonable estimates of performance that you can use to make decisions.

If in doubt, use 10-fold cross validation for regression problems and stratified 10-fold cross validation on classification problems.

## Summary

In this tutorial, you discovered how you can evaluate your XGBoost models by estimating how well they are likely to perform on unseen data.

Specifically, you learned:

- How to split your dataset into train and test subsets for training and evaluating the performance of your model.
- How you can create k XGBoost models on different subsets of the dataset and average the scores to get a more robust estimate of model performance.
- Heuristics to help choose between train-test split and k-fold cross validation for your problem.

Do you have any questions on how to evaluate the performance of XGBoost models or about this post? Ask your questions in the comments below and I will do my best to answer.

Hi Jason,

thank you for this article. You didn’t mention the Leave-One-Out cross-validator method.

Is it the same logic that the k-Fold Cross Validation (exept that the size of the test set is 1) ?

Would you recommend to use Leave-One-Out cross-validator or k-Fold Cross Validation for a small dataset (approximately 2000 rows) ?

Regards,

Agnes

Hi Agnes,

Yes, it is like 1-fold cross validation, repeated for every pattern in the dataset.

From my reading, you are better off using k-fold cross validation.

Hi Jason, How to find the accuracy for XGBRegressor model?

You cannot calculate accuracy for regression algorithms. There are no classes. You must calculate an error like mean squared error.

Can you please show what is the actual line of code to do that ?

Thank you

Hi Nader,

You can use the XGBRegressor instead of the XGBClassifier for regression problems:

http://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBRegressor

Hi Jason,

Does using the cross_val_score already fits the model so it is ready to provide predictions?

Thanks,

Whakeem

I recommend fitting a final model on all data and using it to make predictions. See this post for the general idea:

https://machinelearningmastery.com/train-final-machine-learning-model/

Thanks for the tutorial. I’m still working on it, but I can say it is very understandable compared to others out there.

Thanks, I’m glad to hear that.

Hi Jason,

Thanks for this tutorial, Its simple and clear.

I was working on Imbalanced dataset (1:9) classification problem. It worked well with XGBClassifier(). and evaluated well with k-fold validation.

Thanks a lot!

Well done!

Hi Jason,

in your examples — where would you implement early stopping?

This post may help:

https://machinelearningmastery.com/avoid-overfitting-by-early-stopping-with-xgboost-in-python/

Thanks Jason for the very elaborative explaination of the process

I’m happy it helped.

Hello Jason Brownlee ,

How are you?

After you’ve done cross-validation, how do I get the best model to perform classification on my test data?

Choose the configuration that gave the best results, then fit a final model on all available data.

Thanks, Jason, the tutorial helps a lot.

However, I got stuck when working on imbalanced dataset (1:15) classification problem. The model worked well with XGBClassifier() initially, with an AUC of 0.911 for train set and 0.949 for test set. Then after I tuning the hyperparameters (max_depth, min_child_weight, gamma) using GridSearchCV, the AUC of train and test set dropped obviously (0.892 and 0.917). I feel really confused. Are there any clues why this would happen?

Perhaps tuning the parameter reduced the capacity of the model. Perhaps continue the tuning project?

i have used big mart data set and split the data into train ,test set after that i execute model.fit(x_train,y_train); where my model is XGBClassifier() and it execute successful

but when i execute y_pred = model.predict(X_test) it wil gives an error that feature name mis match as gvien below

ValueError Traceback (most recent call last)

in ()

1 # make predictions for test data

—-> 2 y_pred = model.predict(X_test)

3 predictions = [round(value) for value in y_pred]

4 # evaluate predictions

5 accuracy = accuracy_score(y_test, predictions)

/home/gopal/.local/lib/python2.7/site-packages/xgboost/sklearn.pyc in predict(self, data, output_margin, ntree_limit, validate_features)

770 output_margin=output_margin,

771 ntree_limit=ntree_limit,

–> 772 validate_features=validate_features)

773 if output_margin:

774 # If output_margin is active, simply return the scores

/home/gopal/.local/lib/python2.7/site-packages/xgboost/core.pyc in predict(self, data, output_margin, ntree_limit, pred_leaf, pred_contribs, approx_contribs, pred_interactions, validate_features)

1283

1284 if validate_features:

-> 1285 self._validate_features(data)

1286

1287 length = c_bst_ulong()

/home/gopal/.local/lib/python2.7/site-packages/xgboost/core.pyc in _validate_features(self, data)

1690

1691 raise ValueError(msg.format(self.feature_names,

-> 1692 data.feature_names))

1693

1694 def get_split_value_histogram(self, feature, fmap=”, bins=None, as_pandas=True):

ValueError: feature_names mismatch: [‘f0’, ‘f1’, ‘f2’, ‘f3’, ‘f4’, ‘f5’, ‘f6’, ‘f7’, ‘f8’, ‘f9’, ‘f10’, ‘f11′] [u’Item_Fat_Content’, u’Item_Visibility’, u’Item_Type’, u’Item_MRP’, u’Outlet_Size’, u’Outlet_Location_Type’, u’Outlet_Type’, u’Outlet_Years’, u’Item_Visibility_MeanRatio’, u’Outlet’, u’Identifier’, u’Item_Weight’]

expected f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11 in input data

training data did not have the following fields: Outlet_Years, Outlet_Size, Item_Visibility, Item_MRP, Item_Visibility_MeanRatio, Outlet_Location_Type, Item_Weight, Item_Type, Outlet, Identifier, Outlet_Type, Item_Fat_Content

Perhaps confirm that the two datasets have identical columns?

my train set and test set contains float vlaues but when i predicting by using classifier it says continious is not supported

That is odd. Perhaps double check your data was loaded correctly?

Hi Jason for XBGRegressor i got RMSE =1043 fro big mart dataset and the bset score i got 0.59974 so can i use best score as my accuracy as the RMSE value look very large please suggest

This is a common question that I answer here:

https://machinelearningmastery.com/faq/single-faq/how-to-know-if-a-model-has-good-performance

Hi Jason,

I just found this wonderful blog. I still have some questions about using XGBoost. I don’t know if I can ask for help from you.

I am new with using XGBoost. XGBClassifier to build the model. I have used GridSearchCV to create a tune-grid to find the optimal hyperparameters and I have gotten my final model. I used ‘auc’ as my classification metrics. My question is that I use

yPred = model.predict(Xtest),

but the result(yPred) are float values range from 0 to 1. Who do I decide the threshold value to mapping those value to 0 and 1?

I saw you used round(value), which is equivalent to setting the threshold to 0.5, I think. Is there any rule that I need to follow to find the threshold value for my model? I am looking forward to your reply. Thank you so much.

Sincerely,

Danny

Sincerely,

Yilin Wang

Thanks!

If you are using ROC AUC, you can use the threshold that achieves the best F-measure or J-metric directly.

If unsure, test each threshold from the ROC curve against the F-measure score.

I hope that helps.

Hi Jason:

Thanks for your tutorial.

This tutorial is based on the Sklearn API, do you have any example to do StratifiedKFold in XGboost’s native API?

Thanks

Sorry, I don’t have tutorials using the native apis.

Hello Jason,

thanks for this tutorial. There is a way to view the confusion matrix of every validation?

thanks.

No, typically a confusion matrix is calculated for a single hold-out dataset.