Extreme Gradient Boosting (XGBoost) is an open-source library that provides an efficient and effective implementation of the gradient boosting algorithm.

Shortly after its development and initial release, XGBoost became the go-to method and often the key component in winning solutions for a range of problems in machine learning competitions.

Regression predictive modeling problems involve predicting a numerical value such as a dollar amount or a height. **XGBoost** can be used directly for **regression predictive modeling**.

In this tutorial, you will discover how to develop and evaluate XGBoost regression models in Python.

After completing this tutorial, you will know:

- XGBoost is an efficient implementation of gradient boosting that can be used for regression predictive modeling.
- How to evaluate an XGBoost regression model using the best practice technique of repeated k-fold cross-validation.
- How to fit a final model and use it to make a prediction on new data.

Let’s get started.

## Tutorial Overview

This tutorial is divided into three parts; they are:

- Extreme Gradient Boosting
- XGBoost Regression API
- XGBoost Regression Example

## Extreme Gradient Boosting

**Gradient boosting** refers to a class of ensemble machine learning algorithms that can be used for classification or regression predictive modeling problems.

Ensembles are constructed from decision tree models. Trees are added one at a time to the ensemble and fit to correct the prediction errors made by prior models. This is a type of ensemble machine learning model referred to as boosting.

Models are fit using any arbitrary differentiable loss function and gradient descent optimization algorithm. This gives the technique its name, “*gradient boosting*,” as the loss gradient is minimized as the model is fit, much like a neural network.

For more on gradient boosting, see the tutorial:

Extreme Gradient Boosting, or XGBoost for short, is an efficient open-source implementation of the gradient boosting algorithm. As such, XGBoost is an algorithm, an open-source project, and a Python library.

It was initially developed by Tianqi Chen and was described by Chen and Carlos Guestrin in their 2016 paper titled “XGBoost: A Scalable Tree Boosting System.”

It is designed to be both computationally efficient (e.g. fast to execute) and highly effective, perhaps more effective than other open-source implementations.

The two main reasons to use XGBoost are execution speed and model performance.

XGBoost dominates structured or tabular datasets on classification and regression predictive modeling problems. The evidence is that it is the go-to algorithm for competition winners on the Kaggle competitive data science platform.

Among the 29 challenge winning solutions 3 published at Kaggle’s blog during 2015, 17 solutions used XGBoost. […] The success of the system was also witnessed in KDDCup 2015, where XGBoost was used by every winning team in the top-10.

— XGBoost: A Scalable Tree Boosting System, 2016.

Now that we are familiar with what XGBoost is and why it is important, let’s take a closer look at how we can use it in our regression predictive modeling projects.

## XGBoost Regression API

XGBoost can be installed as a standalone library and an XGBoost model can be developed using the scikit-learn API.

The first step is to install the XGBoost library if it is not already installed. This can be achieved using the pip python package manager on most platforms; for example:

1 |
sudo pip install xgboost |

You can then confirm that the XGBoost library was installed correctly and can be used by running the following script.

1 2 3 |
# check xgboost version import xgboost print(xgboost.__version__) |

Running the script will print your version of the XGBoost library you have installed.

Your version should be the same or higher. If not, you must upgrade your version of the XGBoost library.

1 |
1.1.1 |

It is possible that you may have problems with the latest version of the library. It is not your fault.

Sometimes, the most recent version of the library imposes additional requirements or may be less stable.

If you do have errors when trying to run the above script, I recommend downgrading to version 1.0.1 (or lower). This can be achieved by specifying the version to install to the pip command, as follows:

1 |
sudo pip install xgboost==1.0.1 |

If you require specific instructions for your development environment, see the tutorial:

The XGBoost library has its own custom API, although we will use the method via the scikit-learn wrapper classes: XGBRegressor and XGBClassifier. This will allow us to use the full suite of tools from the scikit-learn machine learning library to prepare data and evaluate models.

An XGBoost regression model can be defined by creating an instance of the *XGBRegressor* class; for example:

1 2 3 |
... # create an xgboost regression model model = XGBRegressor() |

You can specify hyperparameter values to the class constructor to configure the model.

Perhaps the most commonly configured hyperparameters are the following:

**n_estimators**: The number of trees in the ensemble, often increased until no further improvements are seen.**max_depth**: The maximum depth of each tree, often values are between 1 and 10.**eta**: The learning rate used to weight each model, often set to small values such as 0.3, 0.1, 0.01, or smaller.**subsample**: The number of samples (rows) used in each tree, set to a value between 0 and 1, often 1.0 to use all samples.**colsample_bytree**: Number of features (columns) used in each tree, set to a value between 0 and 1, often 1.0 to use all features.

For example:

1 2 3 |
... # create an xgboost regression model model = XGBRegressor(n_estimators=1000, max_depth=7, eta=0.1, subsample=0.7, colsample_bytree=0.8) |

Good hyperparameter values can be found by trial and error for a given dataset, or systematic experimentation such as using a grid search across a range of values.

Randomness is used in the construction of the model. This means that each time the algorithm is run on the same data, it may produce a slightly different model.

When using machine learning algorithms that have a stochastic learning algorithm, it is good practice to evaluate them by averaging their performance across multiple runs or repeats of cross-validation. When fitting a final model, it may be desirable to either increase the number of trees until the variance of the model is reduced across repeated evaluations, or to fit multiple final models and average their predictions.

Let’s take a look at how to develop an XGBoost ensemble for regression.

## XGBoost Regression Example

In this section, we will look at how we might develop an XGBoost model for a standard regression predictive modeling dataset.

First, let’s introduce a standard regression dataset.

We will use the housing dataset.

The housing dataset is a standard machine learning dataset comprising 506 rows of data with 13 numerical input variables and a numerical target variable.

Using a test harness of repeated stratified 10-fold cross-validation with three repeats, a naive model can achieve a mean absolute error (MAE) of about 6.6. A top-performing model can achieve a MAE on this same test harness of about 1.9. This provides the bounds of expected performance on this dataset.

The dataset involves predicting the house price given details of the house’s suburb in the American city of Boston.

No need to download the dataset; we will download it automatically as part of our worked examples.

The example below downloads and loads the dataset as a Pandas DataFrame and summarizes the shape of the dataset and the first five rows of data.

1 2 3 4 5 6 7 8 9 10 |
# load and summarize the housing dataset from pandas import read_csv from matplotlib import pyplot # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) # summarize shape print(dataframe.shape) # summarize first few lines print(dataframe.head()) |

Running the example confirms the 506 rows of data and 13 input variables and a single numeric target variable (14 in total). We can also see that all input variables are numeric.

1 2 3 4 5 6 7 8 9 |
(506, 14) 0 1 2 3 4 5 ... 8 9 10 11 12 13 0 0.00632 18.0 2.31 0 0.538 6.575 ... 1 296.0 15.3 396.90 4.98 24.0 1 0.02731 0.0 7.07 0 0.469 6.421 ... 2 242.0 17.8 396.90 9.14 21.6 2 0.02729 0.0 7.07 0 0.469 7.185 ... 2 242.0 17.8 392.83 4.03 34.7 3 0.03237 0.0 2.18 0 0.458 6.998 ... 3 222.0 18.7 394.63 2.94 33.4 4 0.06905 0.0 2.18 0 0.458 7.147 ... 3 222.0 18.7 396.90 5.33 36.2 [5 rows x 14 columns] |

Next, let’s evaluate a regression XGBoost model with default hyperparameters on the problem.

First, we can split the loaded dataset into input and output columns for training and evaluating a predictive model.

1 2 3 |
... # split data into input and output columns X, y = data[:, :-1], data[:, -1] |

Next, we can create an instance of the model with a default configuration.

1 2 3 |
... # define model model = XGBRegressor() |

We will evaluate the model using the best practice of repeated k-fold cross-validation with 3 repeats and 10 folds.

This can be achieved by using the RepeatedKFold class to configure the evaluation procedure and calling the cross_val_score() to evaluate the model using the procedure and collect the scores.

Model performance will be evaluated using mean squared error (MAE). Note, MAE is made negative in the scikit-learn library so that it can be maximized. As such, we can ignore the sign and assume all errors are positive.

1 2 3 4 5 |
... # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1) |

Once evaluated, we can report the estimated performance of the model when used to make predictions on new data for this problem.

In this case, because the scores were made negative, we can use the absolute() NumPy function to make the scores positive.

We then report a statistical summary of the performance using the mean and standard deviation of the distribution of scores, another good practice.

1 2 3 4 |
... # force scores to be positive scores = absolute(scores) print('Mean MAE: %.3f (%.3f)' % (scores.mean(), scores.std()) ) |

Tying this together, the complete example of evaluating an XGBoost model on the housing regression predictive modeling problem is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# evaluate an xgboost regression model on the housing dataset from numpy import absolute from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from xgboost import XGBRegressor # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values # split data into input and output columns X, y = data[:, :-1], data[:, -1] # define model model = XGBRegressor() # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1) # force scores to be positive scores = absolute(scores) print('Mean MAE: %.3f (%.3f)' % (scores.mean(), scores.std()) ) |

Running the example evaluates the XGBoost Regression algorithm on the housing dataset and reports the average MAE across the three repeats of 10-fold cross-validation.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model achieved a MAE of about 2.1.

This is a good score, better than the baseline, meaning the model has skill and close to the best score of 1.9.

1 |
Mean MAE: 2.109 (0.320) |

We may decide to use the XGBoost Regression model as our final model and make predictions on new data.

This can be achieved by fitting the model on all available data and calling the *predict()* function, passing in a new row of data.

For example:

1 2 3 |
... # make a prediction yhat = model.predict(new_data) |

We can demonstrate this with a complete example, listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# fit a final xgboost model on the housing dataset and make a prediction from numpy import asarray from pandas import read_csv from xgboost import XGBRegressor # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values # split dataset into input and output columns X, y = data[:, :-1], data[:, -1] # define model model = XGBRegressor() # fit model model.fit(X, y) # define new data row = [0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98] new_data = asarray([row]) # make a prediction yhat = model.predict(new_data) # summarize prediction print('Predicted: %.3f' % yhat) |

Running the example fits the model and makes a prediction for the new rows of data.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model predicted a value of about 24.

1 |
Predicted: 24.019 |

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

### Tutorials

- Extreme Gradient Boosting (XGBoost) Ensemble in Python
- Gradient Boosting with Scikit-Learn, XGBoost, LightGBM, and CatBoost
- Best Results for Standard Machine Learning Datasets
- How to Use XGBoost for Time Series Forecasting

### Papers

### APIs

- XGBoost Installation Guide
- xgboost.XGBRegressor API.
- sklearn.model_selection.RepeatedKFold API.
- sklearn.model_selection.cross_val_score API.

## Summary

In this tutorial, you discovered how to develop and evaluate XGBoost regression models in Python.

Specifically, you learned:

- XGBoost is an efficient implementation of gradient boosting that can be used for regression predictive modeling.
- How to evaluate an XGBoost regression model using the best practice technique of repeated k-fold cross-validation.
- How to fit a final model and use it to make a prediction on new data.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

Dear Dr Jason,

xgboost’s current version

Upgrading or installing

Thank you,

Anthony of Sydney

Nice work!

Dear Dr Jason,

Can XGBoost be used in conjunction SVM and random forest classification?

Thank you,

Anthony of Sydney

I don’t see why not.

Dear Dr Jason,

There are two ways of implementing random forest ensembles by using XGBoost’s XGBRFClassifier and using sklearn.ensemble ‘s RandomForestClassifier based on the following tutorials at:

The program:

Results:

Comments:

* The sklearn’s randomforeclassifier produced the highest accuracy at 0.917 compared to XGBoost’s XGBRFClassifier. At most the accuracy was 0.896.

– To maximise the accuracy of XGBRFClassifier,required adjusting the parameters colsample and subsample.

– subsample optimal at 0.9. Adjusting subsample 0-.9 reduced accuracy.

– adjusting colsample between 0.25 and 0.29 increased accuracy from 0.894 to 0,896

Conclusion: when implementing a random forest classifier, xklearn’s version was more accurate than XGBoost’s version.

Other remark – which I cannot explain:

* When implementing XGboost’s random forest classifier model when fitting the model.fit(X,y), in order to predict the yhat, program ‘spewed’. Please see my comment at https://machinelearningmastery.com/random-forest-ensemble-in-python/ as at 13-03-2021 at 1600 (approx).

The error when I implement model.fit(X,y) for XGBoost’s XGBRFClassifier is:

Thank you,

Anthony of Sydney

Nice experiments!

Note, RandomForestClassifier does not use xgboost.

Dear Dr Jason,

While my experiments don’t prove XGBoost’s random forest classifier (‘rfc’) is worse than sklearn’s random forest classifier, it happens for a particular set of data and features that sklearn’s random forest classifier (‘rfc’) performed marginally better than XGBoost’s random forest classifier.

In other words there may well be other conditions that may produce the opposite results of XBoost’s rfc being better than sklearn’s rfc.

Conclusion: if modelling with rfc, use both XGBoost and sklearn and pick the best performing one.

Thank you,

Anthony of Sydney`

Good advice.

Dear Dr Jason,

In your reply “Note, RandomForestClassifier does not use xgboost.”, are there any packages outside xgboost which utilizes xgboost’s “…implementation of gradient boosted decision trees designed for speed and performance…: for “… structured or tabular data…”

Ref: https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/

For example can I:

* use sklearn.svm.SVR with xgboost to use xgboost’s gradient boosted decision trees?

* use

sklearn.neighbors.KNeighborsRegressor with xgboost to use xgboost’s gradient boosted decision trees?

*use

sklearn.tree.DecisionTreeRegressor with xgboost to use xgboost’s gradient boosted decision trees?

Thank you

Anthony of Sydney

No, as far as I know xgboost is specific to decision trees.

Dear Dr Jason,

Thank you for your reply.

Where you said “…xgboost is specific to decision trees…” did you mean the specific decision trees found in the xgboost module?

Thank you,

Anthony of Sydney

No, but sure that fits too.

Dear Jason,

I write it more clearly,

is there a way to use xgboost’s gradient boosting function with sklearn’s

sklearn.tree.DecisionTreeClassifier with xgboost’s gradient boosting algorithm.

Thank you,

Anthony of Sydney

No, not as far as I know.

Dear Dr Jason,

Thank you for your reply and patience,

Anthony of Sydney

You’re welcome.

Hello Dr. Brownlee,

For a long time I have been trying to find a suitable model for a regression problem with many inputs. I have now also tested with XGBoost. The results for the training data are very good. The results of the separated test data are worse. For validation data (real data) that does not differ very much from the training data, the results are pretty bad. I think I see overfitting here. The results for the RandomForestRegressor were so similar. If it’s overfitting, do you have a tip to avoid it?

Many greetings

Matthias

Perhaps the test set is too small or not representative? Perhaps you can try repeated k-fold cross-validation to estimate model performance?

You are probably right, even if I believe that the validation data differs very little from the training data and there is actually a lot of test data. But there must be some reason. I will repeat cv again.

Many Thanks!

Hi Jason and thank you for this and other tutorials.

In the final code of…

# evaluate an xgboost regression model on the housing dataset

I do understand that sklearn is used to EVALUATE => model = XGBRegressor() where XGBRegressor() has default parameter values.

However in the 2nd final code of…

# fit a final xgboost model on the housing dataset and make a prediction

I do not understand how a FINAL XGBOOST MODEL has been arrived at.

OK so I’m assuming the word ‘final’ maybe should be replaced by ‘default’?

If I am correct then how is a FINAL model arrived at in the real world?

Is it about parameter tuning?

Thanks

Final here means the model fit on all data and used to make predictions on new data.

Indeed, you will want to tune the hyperparametres in most cases.

I don’t think it makes sense to do cross validation on the entire data here with no held out test set. I guess if we’re operating under the assumption of building a final production model per se, but that isn’t the assumption we use when comparing models. The housing data set is particularly sensitive to this because it has outliers and having them in only train or test makes a pretty big difference vs. being able to have them in both your train and “test” as you do CV. Maybe I missed the part of the code where the test is held out or I don’t understand everything done within RepeatedKFold?

I’m curious about the following: “Using a test harness of repeated stratified 10-fold cross-validation with three repeats, a naive model can achieve a mean absolute error (MAE) of about 6.6. A top-performing model can achieve a MAE on this same test harness of about 1.9. This provides the bounds of expected performance on this dataset.”

Are these numbers derived from your own experiments without a held out test set? What sort of model is a “naive” model here? I have not seen a 1.9 achieved on a held out test set elsewhere, so if you have a reference that would be great (I haven’t followed the housing data set competitions much, etc… but am trying to see how a method I am using now stacks up, I guess a pretty average run of the method I’m using has an MAE around 3, while an exceptional run can be as low as 2.3408, there is sampling involved that gives the randomness). So it is possible for it to sometimes do better than less tuned xgboost results with a held out test set, e.g. https://www.kaggle.com/shreayan98c/boston-house-price-prediction/notebook that had an MAE of 2.45 on the test set, but that didn’t use any CV in the training set (i.e. no validation set).

Hello to everyone!! 🙂

I have a question!

Can we implement also the XGBoost Ranker with your code?

Thanks in advance!

Sofia

Should be possible. Can you try?

Hi Jason,

I have two questions on your statement from above:

“Using a test harness of repeated stratified 10-fold cross-validation with three repeats, a naive model can achieve a mean absolute error (MAE) of about 6.6. A top-performing model can achieve a MAE on this same test harness of about 1.9. This provides the bounds of expected performance on this dataset.”

1. I understood from from your post on Zero Rule Algorithm how to find MAE with a naive model with a train-test split. How do you do that cross-validation?

2. How did you arrive at the MAE of a top-performing model which gives us the upper bound for the expected performance on a dataset?

Is this a stupid question? I am sorry, just in case.

Hi Jason, I am trying to use XGBRegressor on a project, but it keeps returning the same value for a given input, even after re-fitting.

So, as a test, I came to this post and used your code above (Boston Housing dataset), and it is ALSO returning the same value (which is also identical to the value you got).

X shape: (506, 13)

y shape: (506,)

input row: [0.00632, 18.0, 2.31, 0, 0.538, 6.575, 65.2, 4.09, 1, 296.0, 15.3, 396.9, 4.98]

Predicted: 24.0193386078

Predicted: 24.0193386078

Predicted: 24.0193386078

Predicted: 24.0193386078

Predicted: 24.0193386078

Predicted: 24.0193386078

Predicted: 24.0193386078

Predicted: 24.0193386078

Predicted: 24.0193386078

Predicted: 24.0193386078

(ps – on each of the runs above, the model is refitted to (X,y)

Do you get different predictions on each run with this code?

I’m using Python 3.10.3 and my libraries are all recent … I was hoping you or anyone else in the community could help pointing me in a direction to solve this issue?

Thank You !!!

Hi Alex…Have you tried to implement your model in Google Colaboratory?

Hi James, I appreciate your reply and thank you for pointing me to that resource.

As an experiment I wrote a simple code on my computer, and then ran it on Google Colab too.

This is the code (same on my computer and Google Colab):

from pandas import read_csv

import xgboost as xgb

path = ‘https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv’

ds = read_csv(path, header=None).values

ds_train = xgb.DMatrix(ds[:500,:-1], label=ds[:500,-1:])

ds_test = xgb.DMatrix(ds[500:,:-1], label=ds[500:,-1:])

params = {

‘colsample_bynode’: 0.8,

‘learning_rate’: 1,

‘max_depth’: 5,

‘num_parallel_tree’: 100,

‘objective’: ‘reg:squarederror’,

‘subsample’: 0.8,

}

num_round = 100

for _ in range(5) :

bst = xgb.train(params, ds_train, num_round)

preds = bst.predict(ds_test)

print(preds)

***********************************************************

These are the predictions on my computer:

[20.235838 23.819088 21.035912 28.117573 26.266716 21.39746 ]

[20.235838 23.819088 21.035912 28.117573 26.266716 21.39746 ]

[20.235838 23.819088 21.035912 28.117573 26.266716 21.39746 ]

[20.235838 23.819088 21.035912 28.117573 26.266716 21.39746 ]

[20.235838 23.819088 21.035912 28.117573 26.266716 21.39746 ]

And these are the predictions on Google Colab:

[20.380007 23.985199 21.223272 28.555704 26.747416 21.575823]

[20.380007 23.985199 21.223272 28.555704 26.747416 21.575823]

[20.380007 23.985199 21.223272 28.555704 26.747416 21.575823]

[20.380007 23.985199 21.223272 28.555704 26.747416 21.575823]

[20.380007 23.985199 21.223272 28.555704 26.747416 21.575823]

So, the results differ when I run the same code on different environments … but in either case it is still generating the same predictions every time I fit the model to the dataset …. I have already tried different combinations of parameters, different wrappers (Sklearn, and XGB as above), different datasets, and the outcome is always the same … equal predictions every time the model is fit and run … is this how XGBooster is supposed to be?

Again, I truly appreciate your help.

This particular line:

# split data into input and output columns

X, y = data[:, :-1], data[:, -1]

Causes the following error: pandas.errors.InvalidIndexError: (slice(None, None, None), slice(None, -1, None)). In the example shown, ‘data’ is not defined, however ‘dataframe’ is.

The following fixed this error so the example worked:

# split data into input and output columns

X, y = dataframe.iloc[:, :-1], dataframe.iloc[:, -1]

Thank you for the feedback Emerson!

Is there any reason why you didnt split the dataset into train and test, like you do with other regression projects?

Hi Lee…There is no reason and we agree that you should do so as best practice. The tutorial is showing an example of another concept, however your understanding is correct. Keep up the great work!