XGBoost for Regression

By Jason Brownlee on March 7, 2021 in XGBoost 35

Extreme Gradient Boosting (XGBoost) is an open-source library that provides an efficient and effective implementation of the gradient boosting algorithm.

Shortly after its development and initial release, XGBoost became the go-to method and often the key component in winning solutions for a range of problems in machine learning competitions.

Regression predictive modeling problems involve predicting a numerical value such as a dollar amount or a height. XGBoost can be used directly for regression predictive modeling.

In this tutorial, you will discover how to develop and evaluate XGBoost regression models in Python.

After completing this tutorial, you will know:

XGBoost is an efficient implementation of gradient boosting that can be used for regression predictive modeling.
How to evaluate an XGBoost regression model using the best practice technique of repeated k-fold cross-validation.
How to fit a final model and use it to make a prediction on new data.

Let’s get started.

XGBoost for Regression
Photo by chas B, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Extreme Gradient Boosting
XGBoost Regression API
XGBoost Regression Example

Extreme Gradient Boosting

Gradient boosting refers to a class of ensemble machine learning algorithms that can be used for classification or regression predictive modeling problems.

Ensembles are constructed from decision tree models. Trees are added one at a time to the ensemble and fit to correct the prediction errors made by prior models. This is a type of ensemble machine learning model referred to as boosting.

Models are fit using any arbitrary differentiable loss function and gradient descent optimization algorithm. This gives the technique its name, “gradient boosting,” as the loss gradient is minimized as the model is fit, much like a neural network.

For more on gradient boosting, see the tutorial:

A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning

Extreme Gradient Boosting, or XGBoost for short, is an efficient open-source implementation of the gradient boosting algorithm. As such, XGBoost is an algorithm, an open-source project, and a Python library.

It was initially developed by Tianqi Chen and was described by Chen and Carlos Guestrin in their 2016 paper titled “XGBoost: A Scalable Tree Boosting System.”

It is designed to be both computationally efficient (e.g. fast to execute) and highly effective, perhaps more effective than other open-source implementations.

The two main reasons to use XGBoost are execution speed and model performance.

XGBoost dominates structured or tabular datasets on classification and regression predictive modeling problems. The evidence is that it is the go-to algorithm for competition winners on the Kaggle competitive data science platform.

Among the 29 challenge winning solutions 3 published at Kaggle’s blog during 2015, 17 solutions used XGBoost. […] The success of the system was also witnessed in KDDCup 2015, where XGBoost was used by every winning team in the top-10.

— XGBoost: A Scalable Tree Boosting System, 2016.

Now that we are familiar with what XGBoost is and why it is important, let’s take a closer look at how we can use it in our regression predictive modeling projects.

XGBoost Regression API

XGBoost can be installed as a standalone library and an XGBoost model can be developed using the scikit-learn API.

The first step is to install the XGBoost library if it is not already installed. This can be achieved using the pip python package manager on most platforms; for example:

sudo pip install xgboost

1	sudo pip install xgboost

You can then confirm that the XGBoost library was installed correctly and can be used by running the following script.

# check xgboost version
import xgboost
print(xgboost.__version__)

# check xgboost version

import xgboost

print(xgboost.__version__)

Running the script will print your version of the XGBoost library you have installed.

Your version should be the same or higher. If not, you must upgrade your version of the XGBoost library.

1.1.1

1.1.1

It is possible that you may have problems with the latest version of the library. It is not your fault.

Sometimes, the most recent version of the library imposes additional requirements or may be less stable.

If you do have errors when trying to run the above script, I recommend downgrading to version 1.0.1 (or lower). This can be achieved by specifying the version to install to the pip command, as follows:

sudo pip install xgboost==1.0.1

1	sudo pip install xgboost==1.0.1

If you require specific instructions for your development environment, see the tutorial:

XGBoost Installation Guide

The XGBoost library has its own custom API, although we will use the method via the scikit-learn wrapper classes: XGBRegressor and XGBClassifier. This will allow us to use the full suite of tools from the scikit-learn machine learning library to prepare data and evaluate models.

An XGBoost regression model can be defined by creating an instance of the XGBRegressor class; for example:

...
# create an xgboost regression model
model = XGBRegressor()

...

# create an xgboost regression model

model = XGBRegressor()

You can specify hyperparameter values to the class constructor to configure the model.

Perhaps the most commonly configured hyperparameters are the following:

n_estimators: The number of trees in the ensemble, often increased until no further improvements are seen.
max_depth: The maximum depth of each tree, often values are between 1 and 10.
eta: The learning rate used to weight each model, often set to small values such as 0.3, 0.1, 0.01, or smaller.
subsample: The number of samples (rows) used in each tree, set to a value between 0 and 1, often 1.0 to use all samples.
colsample_bytree: Number of features (columns) used in each tree, set to a value between 0 and 1, often 1.0 to use all features.

For example:

...
# create an xgboost regression model
model = XGBRegressor(n_estimators=1000, max_depth=7, eta=0.1, subsample=0.7, colsample_bytree=0.8)

...

# create an xgboost regression model

model = XGBRegressor(n_estimators=1000, max_depth=7, eta=0.1, subsample=0.7, colsample_bytree=0.8)

Good hyperparameter values can be found by trial and error for a given dataset, or systematic experimentation such as using a grid search across a range of values.

Randomness is used in the construction of the model. This means that each time the algorithm is run on the same data, it may produce a slightly different model.

When using machine learning algorithms that have a stochastic learning algorithm, it is good practice to evaluate them by averaging their performance across multiple runs or repeats of cross-validation. When fitting a final model, it may be desirable to either increase the number of trees until the variance of the model is reduced across repeated evaluations, or to fit multiple final models and average their predictions.

Let’s take a look at how to develop an XGBoost ensemble for regression.

XGBoost Regression Example

In this section, we will look at how we might develop an XGBoost model for a standard regression predictive modeling dataset.

First, let’s introduce a standard regression dataset.

We will use the housing dataset.

The housing dataset is a standard machine learning dataset comprising 506 rows of data with 13 numerical input variables and a numerical target variable.

Using a test harness of repeated stratified 10-fold cross-validation with three repeats, a naive model can achieve a mean absolute error (MAE) of about 6.6. A top-performing model can achieve a MAE on this same test harness of about 1.9. This provides the bounds of expected performance on this dataset.

The dataset involves predicting the house price given details of the house’s suburb in the American city of Boston.

No need to download the dataset; we will download it automatically as part of our worked examples.

The example below downloads and loads the dataset as a Pandas DataFrame and summarizes the shape of the dataset and the first five rows of data.

# load and summarize the housing dataset
from pandas import read_csv
from matplotlib import pyplot
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
# summarize shape
print(dataframe.shape)
# summarize first few lines
print(dataframe.head())

# load and summarize the housing dataset

from pandas import read_csv

from matplotlib import pyplot

# load dataset

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'

dataframe = read_csv(url, header=None)

# summarize shape

print(dataframe.shape)

# summarize first few lines

print(dataframe.head())

Running the example confirms the 506 rows of data and 13 input variables and a single numeric target variable (14 in total). We can also see that all input variables are numeric.

(506, 14)
        0     1     2   3      4      5   ...  8      9     10      11    12    13
0  0.00632  18.0  2.31   0  0.538  6.575  ...   1  296.0  15.3  396.90  4.98  24.0
1  0.02731   0.0  7.07   0  0.469  6.421  ...   2  242.0  17.8  396.90  9.14  21.6
2  0.02729   0.0  7.07   0  0.469  7.185  ...   2  242.0  17.8  392.83  4.03  34.7
3  0.03237   0.0  2.18   0  0.458  6.998  ...   3  222.0  18.7  394.63  2.94  33.4
4  0.06905   0.0  2.18   0  0.458  7.147  ...   3  222.0  18.7  396.90  5.33  36.2

[5 rows x 14 columns]

(506, 14)

0 1 2 3 4 5 ... 8 9 10 11 12 13

0 0.00632 18.0 2.31 0 0.538 6.575 ... 1 296.0 15.3 396.90 4.98 24.0

1 0.02731 0.0 7.07 0 0.469 6.421 ... 2 242.0 17.8 396.90 9.14 21.6

2 0.02729 0.0 7.07 0 0.469 7.185 ... 2 242.0 17.8 392.83 4.03 34.7

3 0.03237 0.0 2.18 0 0.458 6.998 ... 3 222.0 18.7 394.63 2.94 33.4

4 0.06905 0.0 2.18 0 0.458 7.147 ... 3 222.0 18.7 396.90 5.33 36.2

[5 rows x 14 columns]

Next, let’s evaluate a regression XGBoost model with default hyperparameters on the problem.

First, we can split the loaded dataset into input and output columns for training and evaluating a predictive model.

...
# split data into input and output columns
X, y = data[:, :-1], data[:, -1]

...

# split data into input and output columns

X, y = data[:, :-1], data[:, -1]

Next, we can create an instance of the model with a default configuration.

...
# define model
model = XGBRegressor()

...

# define model

model = XGBRegressor()

We will evaluate the model using the best practice of repeated k-fold cross-validation with 3 repeats and 10 folds.

This can be achieved by using the RepeatedKFold class to configure the evaluation procedure and calling the cross_val_score() to evaluate the model using the procedure and collect the scores.

Model performance will be evaluated using mean squared error (MAE). Note, MAE is made negative in the scikit-learn library so that it can be maximized. As such, we can ignore the sign and assume all errors are positive.

...
# define model evaluation method
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)

...

# define model evaluation method

cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate model

scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)

Once evaluated, we can report the estimated performance of the model when used to make predictions on new data for this problem.

In this case, because the scores were made negative, we can use the absolute() NumPy function to make the scores positive.

We then report a statistical summary of the performance using the mean and standard deviation of the distribution of scores, another good practice.

...
# force scores to be positive
scores = absolute(scores)
print('Mean MAE: %.3f (%.3f)' % (scores.mean(), scores.std()) )

...

# force scores to be positive

scores = absolute(scores)

print('Mean MAE: %.3f (%.3f)' % (scores.mean(), scores.std()) )

Tying this together, the complete example of evaluating an XGBoost model on the housing regression predictive modeling problem is listed below.

# evaluate an xgboost regression model on the housing dataset
from numpy import absolute
from pandas import read_csv
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from xgboost import XGBRegressor
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
# split data into input and output columns
X, y = data[:, :-1], data[:, -1]
# define model
model = XGBRegressor()
# define model evaluation method
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
# force scores to be positive
scores = absolute(scores)
print('Mean MAE: %.3f (%.3f)' % (scores.mean(), scores.std()) )

# evaluate an xgboost regression model on the housing dataset

from numpy import absolute

from pandas import read_csv

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedKFold

from xgboost import XGBRegressor

# load the dataset

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'

dataframe = read_csv(url, header=None)

data = dataframe.values

# split data into input and output columns

X, y = data[:, :-1], data[:, -1]

# define model

model = XGBRegressor()

# define model evaluation method

cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate model

scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)

# force scores to be positive

scores = absolute(scores)

print('Mean MAE: %.3f (%.3f)' % (scores.mean(), scores.std()) )

Running the example evaluates the XGBoost Regression algorithm on the housing dataset and reports the average MAE across the three repeats of 10-fold cross-validation.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model achieved a MAE of about 2.1.

This is a good score, better than the baseline, meaning the model has skill and close to the best score of 1.9.

Mean MAE: 2.109 (0.320)

1	Mean MAE: 2.109 (0.320)

We may decide to use the XGBoost Regression model as our final model and make predictions on new data.

This can be achieved by fitting the model on all available data and calling the predict() function, passing in a new row of data.

For example:

...
# make a prediction
yhat = model.predict(new_data)

...

# make a prediction

yhat = model.predict(new_data)

We can demonstrate this with a complete example, listed below.

# fit a final xgboost model on the housing dataset and make a prediction
from numpy import asarray
from pandas import read_csv
from xgboost import XGBRegressor
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
# split dataset into input and output columns
X, y = data[:, :-1], data[:, -1]
# define model
model = XGBRegressor()
# fit model
model.fit(X, y)
# define new data
row = [0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98]
new_data = asarray([row])
# make a prediction
yhat = model.predict(new_data)
# summarize prediction
print('Predicted: %.3f' % yhat)

# fit a final xgboost model on the housing dataset and make a prediction

from numpy import asarray

from pandas import read_csv

from xgboost import XGBRegressor

# load the dataset

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'

dataframe = read_csv(url, header=None)

data = dataframe.values

# split dataset into input and output columns

X, y = data[:, :-1], data[:, -1]

# define model

model = XGBRegressor()

# fit model

model.fit(X, y)

# define new data

row = [0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98]

new_data = asarray([row])

# make a prediction

yhat = model.predict(new_data)

# summarize prediction

print('Predicted: %.3f' % yhat)

Running the example fits the model and makes a prediction for the new rows of data.

In this case, we can see that the model predicted a value of about 24.

Predicted: 24.019

1	Predicted: 24.019

Summary

In this tutorial, you discovered how to develop and evaluate XGBoost regression models in Python.

Specifically, you learned:

XGBoost is an efficient implementation of gradient boosting that can be used for regression predictive modeling.
How to evaluate an XGBoost regression model using the best practice technique of repeated k-fold cross-validation.
How to fit a final model and use it to make a prediction on new data.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

35 Responses to XGBoost for Regression

Anthony The Koala March 12, 2021 at 5:45 am #

Dear Dr Jason,
xgboost’s current version

import xgboost xgboost.__version__ '1.3.3'

1
2
3

import xgboost
xgboost.__version__
'1.3.3'

Upgrading or installing

pip install -U xgboost --upgrade

1

pip install -U xgboost --upgrade

Thank you,
Anthony of Sydney

Reply
- Jason Brownlee March 12, 2021 at 6:07 am #
  
  Nice work!
  
  Reply
- Nicholas Roth January 28, 2023 at 8:26 am #
  
  This:
  # split data into input and output columns
  X, y = data[:, :-1], data[:, -1]
  
  Should be this:
  # split data into input and output columns
  X, y = data.iloc[:, :-1], data.iloc[:, -1]
  
  Reply

Anthony The Koala March 12, 2021 at 5:20 pm #

Dear Dr Jason,
Can XGBoost be used in conjunction SVM and random forest classification?
Thank you,
Anthony of Sydney

Jason Brownlee March 13, 2021 at 5:25 am #

I don’t see why not.

Anthony The Koala March 13, 2021 at 5:03 pm #

Dear Dr Jason,
There are two ways of implementing random forest ensembles by using XGBoost’s XGBRFClassifier and using sklearn.ensemble ‘s RandomForestClassifier based on the following tutorials at:

https://machinelearningmastery.com/random-forest-ensembles-with-xgboost 
https://machinelearningmastery.com/random-forest-ensemble-in-python/

1 2	https://machinelearningmastery.com/random-forest-ensembles-with-xgboost https://machinelearningmastery.com/random-forest-ensemble-in-python/

The program:

# evaluate xgboost random forest algorithm for classification
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from xgboost import XGBRFClassifier
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# define the model
#model = XGBRFClassifier(n_estimators=100, subsample=0.9, colsample_bynode=0.2)
#experimenting 
#increasing n_estimators does not improve the accuracy. Same as n_estimators=100model = XGBRFClassifier(n_estimators=200, subsample=0.9, colsample_bynode=0.2)
#Changing subsample either 0.9 decreases accuracy
#Changing colsample_bynode between 0.25 to 0.29 improves accuracy to 0.896
model = XGBRFClassifier(n_estimators=100, subsample=0.9, colsample_bynode=0.28)

# define the model evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model and collect the scores
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print("using xgboost's randomforest classifer XGBRFClassifier")
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# fit the model on the whole dataset
#model.fit(X, y)
## make a single prediction
row = [[-8.52381793,5.24451077,-12.14967704,-2.92949242,0.99314133,0.67326595,-0.38657932,1.27955683,-0.60712621,3.20807316,0.60504151,-1.38706415,8.92444588,-7.43027595,-2.33653219,1.10358169,0.21547782,1.05057966,0.6975331,0.26076035]]
from numpy import asarray
row = asarray(row)
#yhat = model.predict(row)
#print('Predicted Class: %d' % yhat[0])
print()
print()
#Now doing the same with sklearn's
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
# define the model evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model and collect the scores
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print("using sklearn's  randomforest classifer RandomForestClassifier")
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
# fit the model on the whole dataset
model.fit(X, y)
## make a single prediction
#row = already define
yhat = model.predict(row)
print('Predicted Class: %d' % yhat[0])

# evaluate xgboost random forest algorithm for classification

from numpy import mean

from numpy import std

from sklearn.datasets import make_classification

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedStratifiedKFold

from xgboost import XGBRFClassifier

# define dataset

X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)

# define the model

#model = XGBRFClassifier(n_estimators=100, subsample=0.9, colsample_bynode=0.2)

#experimenting

#increasing n_estimators does not improve the accuracy. Same as n_estimators=100model = XGBRFClassifier(n_estimators=200, subsample=0.9, colsample_bynode=0.2)

#Changing subsample either 0.9 decreases accuracy

#Changing colsample_bynode between 0.25 to 0.29 improves accuracy to 0.896

model = XGBRFClassifier(n_estimators=100, subsample=0.9, colsample_bynode=0.28)

# define the model evaluation procedure

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate the model and collect the scores

n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# report performance

print("using xgboost's randomforest classifer XGBRFClassifier")

print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

# fit the model on the whole dataset

#model.fit(X, y)

## make a single prediction

row = [[-8.52381793,5.24451077,-12.14967704,-2.92949242,0.99314133,0.67326595,-0.38657932,1.27955683,-0.60712621,3.20807316,0.60504151,-1.38706415,8.92444588,-7.43027595,-2.33653219,1.10358169,0.21547782,1.05057966,0.6975331,0.26076035]]

from numpy import asarray

row = asarray(row)

#yhat = model.predict(row)

#print('Predicted Class: %d' % yhat[0])

print()

#Now doing the same with sklearn's

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()

# define the model evaluation procedure

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate the model and collect the scores

n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# report performance

print("using sklearn's randomforest classifer RandomForestClassifier")

print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

# fit the model on the whole dataset

model.fit(X, y)

## make a single prediction

#row = already define

yhat = model.predict(row)

print('Predicted Class: %d' % yhat[0])

Results:

using xgboost's randomforest classifer XGBRFClassifier
Mean Accuracy: 0.896 (0.037)


using sklearn's  randomforest classifer RandomForestClassifier
Mean Accuracy: 0.917 (0.031)
Predicted Class: 1

using xgboost's randomforest classifer XGBRFClassifier

Mean Accuracy: 0.896 (0.037)

using sklearn's randomforest classifer RandomForestClassifier

Mean Accuracy: 0.917 (0.031)

Predicted Class: 1

Comments:
* The sklearn’s randomforeclassifier produced the highest accuracy at 0.917 compared to XGBoost’s XGBRFClassifier. At most the accuracy was 0.896.
– To maximise the accuracy of XGBRFClassifier,required adjusting the parameters colsample and subsample.
– subsample optimal at 0.9. Adjusting subsample 0-.9 reduced accuracy.
– adjusting colsample between 0.25 and 0.29 increased accuracy from 0.894 to 0,896

Conclusion: when implementing a random forest classifier, xklearn’s version was more accurate than XGBoost’s version.

Other remark – which I cannot explain:
* When implementing XGboost’s random forest classifier model when fitting the model.fit(X,y), in order to predict the yhat, program ‘spewed’. Please see my comment at https://machinelearningmastery.com/random-forest-ensemble-in-python/ as at 13-03-2021 at 1600 (approx).
The error when I implement model.fit(X,y) for XGBoost’s XGBRFClassifier is:

The error that I get when copying the identical code and trying to do model.fit(X,y) is:
***
[16:58:45] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.3.0/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
***

Other notes:
<pre>
numpy.__version__; sklearn.__version__; xgboost.__version__;"....respectively"
'1.20.1'
'0.23.2'
'1.3.3'
'....respectively'

The error that I get when copying the identical code and trying to do model.fit(X,y) is:

***

[16:58:45] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.3.0/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.

***

Other notes:

<pre>

numpy.__version__; sklearn.__version__; xgboost.__version__;"....respectively"

'1.20.1'

'0.23.2'

'1.3.3'

'....respectively'

Thank you,
Anthony of Sydney

Jason Brownlee March 14, 2021 at 5:24 am #

Nice experiments!

Note, RandomForestClassifier does not use xgboost.

Reply
- Anthony The Koala March 14, 2021 at 1:01 pm #
  
  Dear Dr Jason,
  
  While my experiments don’t prove XGBoost’s random forest classifier (‘rfc’) is worse than sklearn’s random forest classifier, it happens for a particular set of data and features that sklearn’s random forest classifier (‘rfc’) performed marginally better than XGBoost’s random forest classifier.
  
  In other words there may well be other conditions that may produce the opposite results of XBoost’s rfc being better than sklearn’s rfc.
  
  Conclusion: if modelling with rfc, use both XGBoost and sklearn and pick the best performing one.
  
  Thank you,
  Anthony of Sydney`
- Jason Brownlee March 15, 2021 at 5:52 am #
  
  Good advice.
- Anthony The Koala March 18, 2021 at 4:13 pm #
  
  Dear Dr Jason,
  In your reply “Note, RandomForestClassifier does not use xgboost.”, are there any packages outside xgboost which utilizes xgboost’s “…implementation of gradient boosted decision trees designed for speed and performance…: for “… structured or tabular data…”
  
  Ref: https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/
  
  For example can I:
  * use sklearn.svm.SVR with xgboost to use xgboost’s gradient boosted decision trees?
  * use
  sklearn.neighbors.KNeighborsRegressor with xgboost to use xgboost’s gradient boosted decision trees?
  *use
  sklearn.tree.DecisionTreeRegressor with xgboost to use xgboost’s gradient boosted decision trees?
  
  Thank you
  Anthony of Sydney
- Jason Brownlee March 19, 2021 at 6:16 am #
  
  No, as far as I know xgboost is specific to decision trees.
- Anthony The Koala March 19, 2021 at 7:24 am #
  
  Dear Dr Jason,
  Thank you for your reply.
  Where you said “…xgboost is specific to decision trees…” did you mean the specific decision trees found in the xgboost module?
  Thank you,
  Anthony of Sydney
- Jason Brownlee March 19, 2021 at 7:51 am #
  
  No, but sure that fits too.
- Anthony The Koala March 19, 2021 at 9:50 pm #
  
  Dear Jason,
  I write it more clearly,
  is there a way to use xgboost’s gradient boosting function with sklearn’s
  sklearn.tree.DecisionTreeClassifier with xgboost’s gradient boosting algorithm.
  Thank you,
  Anthony of Sydney
- Jason Brownlee March 20, 2021 at 5:21 am #
  
  No, not as far as I know.
- Anthony The Koala March 20, 2021 at 5:42 am #
  
  Dear Dr Jason,
  Thank you for your reply and patience,
  Anthony of Sydney
- Jason Brownlee March 21, 2021 at 6:00 am #
  
  You’re welcome.

Matthias March 22, 2021 at 8:21 pm #

Hello Dr. Brownlee,

For a long time I have been trying to find a suitable model for a regression problem with many inputs. I have now also tested with XGBoost. The results for the training data are very good. The results of the separated test data are worse. For validation data (real data) that does not differ very much from the training data, the results are pretty bad. I think I see overfitting here. The results for the RandomForestRegressor were so similar. If it’s overfitting, do you have a tip to avoid it?
Many greetings

Matthias

Reply
- Jason Brownlee March 23, 2021 at 4:56 am #
  
  Perhaps the test set is too small or not representative? Perhaps you can try repeated k-fold cross-validation to estimate model performance?
  
  Reply
Matthias March 24, 2021 at 7:16 pm #

You are probably right, even if I believe that the validation data differs very little from the training data and there is actually a lot of test data. But there must be some reason. I will repeat cv again.
Many Thanks!

Reply
Tom April 19, 2021 at 7:44 pm #

Hi Jason and thank you for this and other tutorials.

In the final code of…
# evaluate an xgboost regression model on the housing dataset
I do understand that sklearn is used to EVALUATE => model = XGBRegressor() where XGBRegressor() has default parameter values.

However in the 2nd final code of…
# fit a final xgboost model on the housing dataset and make a prediction
I do not understand how a FINAL XGBOOST MODEL has been arrived at.

OK so I’m assuming the word ‘final’ maybe should be replaced by ‘default’?

If I am correct then how is a FINAL model arrived at in the real world?
Is it about parameter tuning?

Thanks

Reply
- Jason Brownlee April 20, 2021 at 5:56 am #
  
  Final here means the model fit on all data and used to make predictions on new data.
  
  Indeed, you will want to tune the hyperparametres in most cases.
  
  Reply
ttbek November 1, 2021 at 2:50 am #

I don’t think it makes sense to do cross validation on the entire data here with no held out test set. I guess if we’re operating under the assumption of building a final production model per se, but that isn’t the assumption we use when comparing models. The housing data set is particularly sensitive to this because it has outliers and having them in only train or test makes a pretty big difference vs. being able to have them in both your train and “test” as you do CV. Maybe I missed the part of the code where the test is held out or I don’t understand everything done within RepeatedKFold?

I’m curious about the following: “Using a test harness of repeated stratified 10-fold cross-validation with three repeats, a naive model can achieve a mean absolute error (MAE) of about 6.6. A top-performing model can achieve a MAE on this same test harness of about 1.9. This provides the bounds of expected performance on this dataset.”

Are these numbers derived from your own experiments without a held out test set? What sort of model is a “naive” model here? I have not seen a 1.9 achieved on a held out test set elsewhere, so if you have a reference that would be great (I haven’t followed the housing data set competitions much, etc… but am trying to see how a method I am using now stacks up, I guess a pretty average run of the method I’m using has an MAE around 3, while an exceptional run can be as low as 2.3408, there is sampling involved that gives the randomness). So it is possible for it to sometimes do better than less tuned xgboost results with a held out test set, e.g. https://www.kaggle.com/shreayan98c/boston-house-price-prediction/notebook that had an MAE of 2.45 on the test set, but that didn’t use any CV in the training set (i.e. no validation set).

Reply
Sofia V. December 9, 2021 at 2:35 am #

Hello to everyone!! 🙂

I have a question!

Can we implement also the XGBoost Ranker with your code?

Thanks in advance!

Sofia

Reply
- Adrian Tam December 10, 2021 at 4:16 am #
  
  Should be possible. Can you try?
  
  Reply
Medlien December 30, 2021 at 10:16 pm #

Hi Jason,

I have two questions on your statement from above:

“Using a test harness of repeated stratified 10-fold cross-validation with three repeats, a naive model can achieve a mean absolute error (MAE) of about 6.6. A top-performing model can achieve a MAE on this same test harness of about 1.9. This provides the bounds of expected performance on this dataset.”

1. I understood from from your post on Zero Rule Algorithm how to find MAE with a naive model with a train-test split. How do you do that cross-validation?

2. How did you arrive at the MAE of a top-performing model which gives us the upper bound for the expected performance on a dataset?

Reply
- Medlien January 18, 2022 at 9:29 pm #
  
  Is this a stupid question? I am sorry, just in case.
  
  Reply
Alex Fontes May 16, 2022 at 9:50 am #

Hi Jason, I am trying to use XGBRegressor on a project, but it keeps returning the same value for a given input, even after re-fitting.
So, as a test, I came to this post and used your code above (Boston Housing dataset), and it is ALSO returning the same value (which is also identical to the value you got).

X shape: (506, 13)
y shape: (506,)
input row: [0.00632, 18.0, 2.31, 0, 0.538, 6.575, 65.2, 4.09, 1, 296.0, 15.3, 396.9, 4.98]
Predicted: 24.0193386078
Predicted: 24.0193386078
Predicted: 24.0193386078
Predicted: 24.0193386078
Predicted: 24.0193386078
Predicted: 24.0193386078
Predicted: 24.0193386078
Predicted: 24.0193386078
Predicted: 24.0193386078
Predicted: 24.0193386078

(ps – on each of the runs above, the model is refitted to (X,y)

Do you get different predictions on each run with this code?
I’m using Python 3.10.3 and my libraries are all recent … I was hoping you or anyone else in the community could help pointing me in a direction to solve this issue?

Thank You !!!

Reply
- James Carmichael May 17, 2022 at 9:55 am #
  
  Hi Alex…Have you tried to implement your model in Google Colaboratory?
  
  Reply
Alex Fontes May 18, 2022 at 10:16 am #

Hi James, I appreciate your reply and thank you for pointing me to that resource.
As an experiment I wrote a simple code on my computer, and then ran it on Google Colab too.

This is the code (same on my computer and Google Colab):

from pandas import read_csv
import xgboost as xgb

path = ‘https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv’
ds = read_csv(path, header=None).values

ds_train = xgb.DMatrix(ds[:500,:-1], label=ds[:500,-1:])
ds_test = xgb.DMatrix(ds[500:,:-1], label=ds[500:,-1:])

params = {
‘colsample_bynode’: 0.8,
‘learning_rate’: 1,
‘max_depth’: 5,
‘num_parallel_tree’: 100,
‘objective’: ‘reg:squarederror’,
‘subsample’: 0.8,
}
num_round = 100

for _ in range(5) :
bst = xgb.train(params, ds_train, num_round)
preds = bst.predict(ds_test)
print(preds)

***********************************************************
These are the predictions on my computer:
[20.235838 23.819088 21.035912 28.117573 26.266716 21.39746 ]
[20.235838 23.819088 21.035912 28.117573 26.266716 21.39746 ]
[20.235838 23.819088 21.035912 28.117573 26.266716 21.39746 ]
[20.235838 23.819088 21.035912 28.117573 26.266716 21.39746 ]
[20.235838 23.819088 21.035912 28.117573 26.266716 21.39746 ]

And these are the predictions on Google Colab:
[20.380007 23.985199 21.223272 28.555704 26.747416 21.575823]
[20.380007 23.985199 21.223272 28.555704 26.747416 21.575823]
[20.380007 23.985199 21.223272 28.555704 26.747416 21.575823]
[20.380007 23.985199 21.223272 28.555704 26.747416 21.575823]
[20.380007 23.985199 21.223272 28.555704 26.747416 21.575823]

So, the results differ when I run the same code on different environments … but in either case it is still generating the same predictions every time I fit the model to the dataset …. I have already tried different combinations of parameters, different wrappers (Sklearn, and XGB as above), different datasets, and the outcome is always the same … equal predictions every time the model is fit and run … is this how XGBooster is supposed to be?

Again, I truly appreciate your help.

Reply
Emerson de Lemmus July 13, 2022 at 1:16 am #

This particular line:

# split data into input and output columns
X, y = data[:, :-1], data[:, -1]

Causes the following error: pandas.errors.InvalidIndexError: (slice(None, None, None), slice(None, -1, None)). In the example shown, ‘data’ is not defined, however ‘dataframe’ is.

The following fixed this error so the example worked:

# split data into input and output columns
X, y = dataframe.iloc[:, :-1], dataframe.iloc[:, -1]

Reply
- James Carmichael July 13, 2022 at 7:46 am #
  
  Thank you for the feedback Emerson!
  
  Reply
Lee September 22, 2022 at 7:14 pm #

Is there any reason why you didnt split the dataset into train and test, like you do with other regression projects?

Reply
- James Carmichael September 23, 2022 at 5:55 am #
  
  Hi Lee…There is no reason and we agree that you should do so as best practice. The tutorial is showing an example of another concept, however your understanding is correct. Keep up the great work!
  
  Reply
Atena December 8, 2022 at 8:21 am #

Dear Dr Jason,
Can XGBoost be used on a small dataset with 5 features and 40 samples?

Reply

Navigation

XGBoost for Regression

Tutorial Overview

Extreme Gradient Boosting

XGBoost Regression API

XGBoost Regression Example

Further Reading

Tutorials

Papers

APIs

Summary

Discover The Algorithm Winning Competitions!

Develop Your Own XGBoost Models in Minutes

Bring The Power of XGBoost To Your Own Projects

More On This Topic

35 Responses to XGBoost for Regression

Leave a Reply Click here to cancel reply.