How to Save Gradient Boosting Models with XGBoost in Python

By Jason Brownlee on August 27, 2020 in XGBoost 26

XGBoost can be used to create some of the most performant models for tabular data using the gradient boosting algorithm.

Once trained, it is often a good practice to save your model to file for later use in making predictions new test and validation datasets and entirely new data.

In this post you will discover how to save your XGBoost models to file using the standard Python pickle API.

After completing this tutorial, you will know:

How to save and later load your trained XGBoost model using pickle.
How to save and later load your trained XGBoost model using joblib.

Kick-start your project with my new book XGBoost With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Update Jan/2017: Updated to reflect changes in scikit-learn API version 0.18.1.
Update Mar/2018: Added alternate link to download the dataset as the original appears to have been taken down.
Update Oct/2019: Updated to use Joblib API directly.

How to Save Gradient Boosting Models with XGBoost in Python
Photo by Keoni Cabral, some rights reserved.

Need help with XGBoost in Python?

Take my free 7-day email course and discover xgboost (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Serialize Your XGBoost Model with Pickle

Pickle is the standard way of serializing objects in Python.

You can use the Python pickle API to serialize your machine learning algorithms and save the serialized format to a file, for example:

# save model to file
pickle.dump(model, open("pima.pickle.dat", "wb"))

1 2	# save model to file pickle.dump(model, open("pima.pickle.dat", "wb"))

Later you can load this file to deserialize your model and use it to make new predictions, for example:

# load model from file
loaded_model = pickle.load(open("pima.pickle.dat", "rb"))

1 2	# load model from file loaded_model = pickle.load(open("pima.pickle.dat", "rb"))

The example below demonstrates how you can train a XGBoost model on the Pima Indians onset of diabetes dataset, save the model to file and later load it to make predictions.

Download the dataset and save it to your current working directory.

The full code listing is provided below for completeness.

# Train XGBoost model, save to file using pickle, load and make predictions
from numpy import loadtxt
import xgboost
import pickle
from sklearn import model_selection
from sklearn.metrics import accuracy_score
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]
# split data into train and test sets
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, Y, test_size=test_size, random_state=seed)
# fit model no training data
model = xgboost.XGBClassifier()
model.fit(X_train, y_train)
# save model to file
pickle.dump(model, open("pima.pickle.dat", "wb"))

# some time later...

# load model from file
loaded_model = pickle.load(open("pima.pickle.dat", "rb"))
# make predictions for test data
y_pred = loaded_model.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

# Train XGBoost model, save to file using pickle, load and make predictions

from numpy import loadtxt

import xgboost

import pickle

from sklearn import model_selection

from sklearn.metrics import accuracy_score

# load data

dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")

# split data into X and y

X = dataset[:,0:8]

Y = dataset[:,8]

# split data into train and test sets

seed = 7

test_size = 0.33

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, Y, test_size=test_size, random_state=seed)

# fit model no training data

model = xgboost.XGBClassifier()

model.fit(X_train, y_train)

# save model to file

pickle.dump(model, open("pima.pickle.dat", "wb"))

# some time later...

# load model from file

loaded_model = pickle.load(open("pima.pickle.dat", "rb"))

# make predictions for test data

y_pred = loaded_model.predict(X_test)

predictions = [round(value) for value in y_pred]

# evaluate predictions

accuracy = accuracy_score(y_test, predictions)

print("Accuracy: %.2f%%" % (accuracy * 100.0))

Running this example saves your trained XGBoost model to the pima.pickle.dat pickle file in the current working directory.

pima.pickle.dat

1	pima.pickle.dat

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

After loading the model and making predictions on the training dataset, the accuracy of the model is printed.

Accuracy: 77.95%

1	Accuracy: 77.95%

Serialize XGBoost Model with joblib

Joblib is part of the SciPy ecosystem and provides utilities for pipelining Python jobs.

The Joblib API provides utilities for saving and loading Python objects that make use of NumPy data structures, efficiently. It may be a faster approach for you to use with very large models.

The API looks a lot like the pickle API, for example, you may save your trained model as follows:

# save model to file
joblib.dump(model, "pima.joblib.dat")

1 2	# save model to file joblib.dump(model, "pima.joblib.dat")

You can later load the model from file and use it to make predictions as follows:

# load model from file
loaded_model = joblib.load("pima.joblib.dat")

1 2	# load model from file loaded_model = joblib.load("pima.joblib.dat")

The example below demonstrates how you can train an XGBoost model for classification on the Pima Indians onset of diabetes dataset, save the model to file using Joblib and load it at a later time in order to make predictions.

# Train XGBoost model, save to file using joblib, load and make predictions
from numpy import loadtxt
from xgboost import XGBClassifier
from joblib import dump
from joblib import load
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]
# split data into train and test sets
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
# fit model on training data
model = XGBClassifier()
model.fit(X_train, y_train)
# save model to file
dump(model, "pima.joblib.dat")
print("Saved model to: pima.joblib.dat")

# some time later...

# load model from file
loaded_model = load("pima.joblib.dat")
print("Loaded model from: pima.joblib.dat")
# make predictions for test data
predictions = loaded_model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

# Train XGBoost model, save to file using joblib, load and make predictions

from numpy import loadtxt

from xgboost import XGBClassifier

from joblib import dump

from joblib import load

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

# load data

dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")

# split data into X and y

X = dataset[:,0:8]

Y = dataset[:,8]

# split data into train and test sets

seed = 7

test_size = 0.33

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

# fit model on training data

model = XGBClassifier()

model.fit(X_train, y_train)

# save model to file

dump(model, "pima.joblib.dat")

print("Saved model to: pima.joblib.dat")

# some time later...

# load model from file

loaded_model = load("pima.joblib.dat")

print("Loaded model from: pima.joblib.dat")

# make predictions for test data

predictions = loaded_model.predict(X_test)

# evaluate predictions

accuracy = accuracy_score(y_test, predictions)

print("Accuracy: %.2f%%" % (accuracy * 100.0))

Running the example saves the model to file as pima.joblib.dat in the current working directory and also creates one file for each NumPy array within the model (in this case two additional files).

pima.joblib.dat
pima.joblib.dat_01.npy
pima.joblib.dat_02.npy

pima.joblib.dat

pima.joblib.dat_01.npy

pima.joblib.dat_02.npy

After the model is loaded, it is evaluated on the training dataset and the accuracy of the predictions is printed.

Accuracy: 77.95%

1	Accuracy: 77.95%

Summary

In this post, you discovered how to serialize your trained XGBoost models and later load them in order to make predictions.

Specifically, you learned:

How to serialize and later load your trained XGBoost model using the pickle API.
How to serialize and later load your trained XGBoost model using the joblib API.

Do you have any questions about serializing your XGBoost models or about this post? Ask your questions in the comments and I will do my best to answer.

26 Responses to How to Save Gradient Boosting Models with XGBoost in Python

koji June 23, 2018 at 1:18 am #

Hi, Jason. Thank you for sharing your knowledge and I enjoy reading your posts.
By the way, is there any point to pickle a XGBoost model instead of using like xgb.Booster(model_file=’model.model’)?

Here is my experiment.

%timeit model = xgb.Booster(model_file=’model.model’)
118 µs ± 1.73 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

pickle.dump(model, open(“model.pickle”, “wb”))
%timeit loaded_model = pickle.load(open(“model.pickle”, “rb”))
139 µs ± 1.54 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

I am currently looking for a better way how I use the XGBoost model in the production. My concern is that reading a file might be slow if there are many requests from the client side.

Reply
- Jason Brownlee June 23, 2018 at 6:20 am #
  
  I don’t think so, it really depends on your project/code.
  
  Reply
Ran Feldesh October 6, 2018 at 5:59 am #

Maybe due to sklearn version, running the code as is results in an error, ‘cross_validation’ is not found. Deleting this (while keeping ‘train_test_split’) and revising the relevant import statement to ‘from sklearn.model_selection import train_test_split’, just like your other XGBoost tutorial, solves this.

Reply
- Jason Brownlee October 6, 2018 at 11:40 am #
  
  You must use scikit-learn version 0.8 or higher.
  
  Reply
John November 4, 2018 at 1:56 am #

Hi Great Post,

It would be great if you can write a tutorial detailing how to convert a xgboost model to pmml. Some explanations on PMMLPipeline and how to properly use it to generate pmml using sklearn2pmml would be really helpful

Reply
- Jason Brownlee November 4, 2018 at 6:28 am #
  
  Thanks for the suggestion.
  
  Reply
Sophia Yue November 16, 2019 at 7:15 am #

Hi, Jason. I enjoy reading your posts. Per my understanding , there is no cross_validation object in sklearn. You might remove it from cross_validation.train_test_split(X, Y, test_size=test_size, random_state=seed).

XGBClassifier is a black box to me. After model been build I can see the feature important and the accuracy. What else do I need to know?

What information had been saved via pickle.dumop? Is it possible to see the content of pima.pickle.dat in a meaningful way?

Thanks,
Sophia

Reply
- Jason Brownlee November 16, 2019 at 7:29 am #
  
  You can enumerate the folds manually if needed. Or use the model wrapper.
  
  If you want to know how it works, see this:
  https://machinelearningmastery.com/start-here/#xgboost
  
  No.
  
  Reply
Xin January 14, 2020 at 2:41 am #

Hey Jason, I got an error when I tried to dump trained model by XGBost algorithm:

AttributeError: function ‘XGBoosterSerializeToBuffer’ not found

neither pickle nor joblib dump works unfrotunately, but dump and load works for regression algorithm for example.
Do you have any suggestions ?

Reply
- Jason Brownlee January 14, 2020 at 7:24 am #
  
  Ouch. I have not seen that before sorry.
  
  Perhaps try posting code/error to stackoverflow?
  
  Reply
  - Xin January 15, 2020 at 4:07 am #
    
    Hey, I solved the issue by moving the xgboost related files to the right path so that it can call the right library. But it still does not always work out when the virtual environment changes.
    
    Reply
    - Jason Brownlee January 15, 2020 at 8:29 am #
      
      Perhaps avoid virtual environments?
      
      Reply
San February 4, 2020 at 7:32 pm #

Hi,
I have a dataset and in particular I’m dealing with a multi-class classification problem. The target variable consists of 22 unique classes. Can I know if it’s necessary to encode the target variable with one hot encoding or is it just fine to leave it with label encoding?

Because if I encode the target variable with one hot encoding, it will result in many new columns.

Reply
- Jason Brownlee February 5, 2020 at 8:06 am #
  
  Label encoding is fine for xgboost.
  
  Reply
San February 14, 2020 at 6:17 pm #

Hi

I want to break this original multi-class classification problem into a set of multi-class sub problems. Is it okay to decide the class hierarchies manually, if I know how to categorize the classes in the original dataset to smaller sub problems, in advance?

Because, I found some research papers proposing ways to deduce the class hierarchies automatically for multi-class classification problems by various methods such as similarity matrix.

Reply
- Jason Brownlee February 15, 2020 at 6:24 am #
  
  Sure.
  
  Reply
Matan March 12, 2020 at 2:53 am #

Does anyone know if this method has been deprecated? cross_validation.train_test_split

I’m using sklearn version 0.22.1

The method appears in documentation for version 0.16.1 but I can’t find any information on whether or not its been deprecated.

Reply
- Jason Brownlee March 12, 2020 at 8:53 am #
  
  It has moved:
  https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
  
  Reply
Isaac Tepatl June 26, 2020 at 2:42 am #

Hi Jason,

This tutorial is very helpful, your tutorials awesome… I just have a quick question, I would like to know if we can print the Logistic Regression Coefficients from a XGBoost Model. I am training a model that eventually I will deploy into a SQL Server but I need the actual coefficient estimates.

Thank you so much…

Reply
- Jason Brownlee June 26, 2020 at 5:37 am #
  
  Thanks!
  
  You can print logistic regression coefficients from a logistic regression model:
  https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
  
  Reply
Kartik Shenoy July 4, 2020 at 2:49 am #

Hi Jason,

I trained an XGB model on Kaggle notebook, dumped the model using pickle, the way you mentioned it.

But when I try loading this pickle model from my laptop, I get the error:

AttributeError: Can’t get attribute ‘XGBoostLabelEncoder’ on

The code I used for loading the model:

import pickle
from xgboost import XGBClassifier
import xgboost

model = pickle.load(open(‘./filename.pkl’,’rb’))

I can’t seem to understand what did I miss.

Reply
- Jason Brownlee July 4, 2020 at 6:04 am #
  
  Sorry to hear that.
  
  Perhaps confirm that you are using an identical version of the library on your workstation as you did on Kaggle?
  
  Reply
  - SUMRITI RANJAN PATRA November 8, 2020 at 10:57 pm #
    
    Hey can anyone tell me how to save the model from best iteration which i got from early stopping using pickle. Normally it saves the last iteration not the best iteration please someone tell me.
    Thank you.
    
    Reply
    - Jason Brownlee November 9, 2020 at 6:12 am #
      
      Are you sure? It does not sound right to me.
      
      Early stopping will stop training given your criteria – the model at the time the training is stopped is the “best”.
      
      Reply
William Smith September 13, 2020 at 11:48 am #

Hi Jason
The documentation now says that pickle and joblib should be avoided, because models won’t be loadable if you upgrade xgboost and they’ve changed the binary format.

Instead they say use save_model and load_model.

Could you update your examples and yuor book (which I bought)?

See https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html

Reply
- Jason Brownlee September 14, 2020 at 6:44 am #
  
  Thanks William, I’ll update the examples ASAP.
  
  Reply

Navigation

How to Save Gradient Boosting Models with XGBoost in Python

Need help with XGBoost in Python?

Serialize Your XGBoost Model with Pickle

Serialize XGBoost Model with joblib

Summary

Discover The Algorithm Winning Competitions!

Develop Your Own XGBoost Models in Minutes

Bring The Power of XGBoost To Your Own Projects

More On This Topic

26 Responses to How to Save Gradient Boosting Models with XGBoost in Python

Leave a Reply Click here to cancel reply.