How to Develop Your First XGBoost Model in Python with scikit-learn

XGBoost is an implementation of gradient boosted decision trees designed for speed and performance that is dominative competitive machine learning.

In this post you will discover how you can install and create your first XGBoost model in Python.

After reading this post you will know:

  • How to install XGBoost on your system for use in Python.
  • How to prepare data and train your first XGBoost model.
  • How to make predictions using your XGBoost model.

Let’s get started.

  • Update Jan/2017: Updated to reflect changes in scikit-learn API version 0.18.1.
  • Update Mar/2017: Adding missing import, made imports clearer.
  • Update March/2018: Added alternate link to download the dataset as the original appears to have been taken down.
How to Develop Your First XGBoost Model in Python with scikit-learn

How to Develop Your First XGBoost Model in Python with scikit-learn
Photo by Justin Henry, some rights reserved.

Tutorial Overview

This tutorial is broken down into the following 6 sections:

  1. Install XGBoost for use with Python.
  2. Problem definition and download dataset.
  3. Load and prepare data.
  4. Train XGBoost model.
  5. Make predictions and evaluate model.
  6. Tie it all together and run the example.

Need help with XGBoost in Python?

Take my free 7-day email course and discover configuration, tuning and more (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

1. Install XGBoost for Use in Python

Assuming you have a working SciPy environment, XGBoost can be installed easily using pip.

For example:

To update your installation of XGBoost you can type:

An alternate way to install XGBoost if you cannot use pip or you want to run the latest code from GitHub requires that you make a clone of the XGBoost project and perform a manual build and installation.

For example to build XGBoost without multithreading on Mac OS X (with GCC already installed via macports or homebrew), you can type:

You can learn more about how to install XGBoost for different platforms on the XGBoost Installation Guide. For up-to-date instructions for installing XGBoost for Python see the XGBoost Python Package.

For reference, you can review the XGBoost Python API reference.

2. Problem Description: Predict Onset of Diabetes

In this tutorial we are going to use the Pima Indians onset of diabetes dataset.

This dataset is comprised of 8 input variables that describe medical details of patients and one output variable to indicate whether the patient will have an onset of diabetes within 5 years.

You can learn more about this dataset on the UCI Machine Learning Repository website.

This is a good dataset for a first XGBoost model because all of the input variables are numeric and the problem is a simple binary classification problem. It is not necessarily a good problem for the XGBoost algorithm because it is a relatively small dataset and an easy problem to model.

Download this dataset and place it into your current working directory with the file name “pima-indians-diabetes.csv” (update: download from here).

3. Load and Prepare Data

In this section we will load the data from file and prepare it for use for training and evaluating an XGBoost model.

We will start off by importing the classes and functions we intend to use in this tutorial.

Next, we can load the CSV file as a NumPy array using the NumPy function loadtext().

We must separate the columns (attributes or features) of the dataset into input patterns (X) and output patterns (Y). We can do this easily by specifying the column indices in the NumPy array format.

Finally, we must split the X and Y data into a training and test dataset. The training set will be used to prepare the XGBoost model and the test set will be used to make new predictions, from which we can evaluate the performance of the model.

For this we will use the train_test_split() function from the scikit-learn library. We also specify a seed for the random number generator so that we always get the same split of data each time this example is executed.

We are now ready to train our model.

4. Train the XGBoost Model

XGBoost provides a wrapper class to allow models to be treated like classifiers or regressors in the scikit-learn framework.

This means we can use the full scikit-learn library with XGBoost models.

The XGBoost model for classification is called XGBClassifier. We can create and and fit it to our training dataset. Models are fit using the scikit-learn API and the model.fit() function.

Parameters for training the model can be passed to the model in the constructor. Here, we use the sensible defaults.

You can see the parameters used in a trained model by printing the model, for example:

You can learn more about the defaults for the XGBClassifier and XGBRegressor classes in the XGBoost Python scikit-learn API.

You can learn more about the meaning of each parameter and how to configure them on the XGBoost parameters page.

We are now ready to use the trained model to make predictions.

5. Make Predictions with XGBoost Model

We can make predictions using the fit model on the test dataset.

To make predictions we use the scikit-learn function model.predict().

By default, the predictions made by XGBoost are probabilities. Because this is a binary classification problem, each prediction is the probability of the input pattern belonging to the first class. We can easily convert them to binary class values by rounding them to 0 or 1.

Now that we have used the fit model to make predictions on new data, we can evaluate the performance of the predictions by comparing them to the expected values. For this we will use the built in accuracy_score() function in scikit-learn.

6. Tie it All Together

We can tie all of these pieces together, below is the full code listing.

Running this example produces the following output.

This is a good accuracy score on this problem, which we would expect, given the capabilities of the model and the modest complexity of the problem.

Summary

In this post you discovered how to develop your first XGBoost model in Python.

Specifically, you learned:

  • How to install XGBoost on your system ready for use with Python.
  • How to prepare data and train your first XGBoost model on a standard machine learning dataset.
  • How to make predictions and evaluate the performance of a trained XGBoost model using scikit-learn.

Do you have any questions about XGBoost or about this post? Ask your questions in the comments and I will do my best to answer.


Want To Learn The Algorithm Winning Competitions?

XGBoost With Python

Develop Your Own XGBoost Models in Minutes

…with just a few lines of Python

Discover how in my new Ebook:
XGBoost With Python

It covers self-study tutorials like:
Algorithm Fundamentals, Scaling, Hyperparameters, and much more…

Bring The Power of XGBoost To
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.


73 Responses to How to Develop Your First XGBoost Model in Python with scikit-learn

  1. Qichang Feng August 26, 2016 at 8:21 pm #

    Hi Jason,

    First of all thanks for all your great posts. I have learned a lot from them.

    I have a question regarding the code seperating input features X and response variable Y. It seems you include the last column in the features as well which should not be the case.

    X = dataset[:,0:8]

    The correct one should be X = dataset[:, 0:7] to match 8 input variables for the medical details of patients.

    The error happened in your mini-course handbook as well.

    • Jason Brownlee August 27, 2016 at 11:32 am #

      You’re welcome Qichang.

      Perhaps you are getting different results based on the version of Python or Numpy you are using.

      I can confirm that the code in the post is correct:

      There are 9 columns, only the first 8 are stored in X with the 9th stored in Y. The above snippet produces:

      Does that help?

      Tested on Python 2.7.11 and numpy 1.11.1.

      • Qichang August 28, 2016 at 10:27 am #

        Hi Jason,

        Thanks a lot for your quick reply. It is my mistake as I am confused with 0:8 because I am also learning R recently. In R, the last number of 0:8 is included while it is excluded in Python. I should have checked the shape.

        Thanks again.

  2. Joao Pires September 21, 2016 at 6:42 am #

    Hi
    I run the code and I get this error:
    model = xgboost.XGBClassifier()
    AttributeError: ‘module’ object has no attribute ‘XGBClassifier’

    Do you know why?

    Thks

  3. SG Huang September 29, 2016 at 7:40 pm #

    Thanks Jason for the clear guide.

    What is the normal ways to improve the accuracy in practice? Shall we do some featuring engineering, or change to a different model?

    I have learned the basics of machine learning through online courses, but there is still a gap between what I learned in the courses and the practical problems such as the competitions on Kaggle. Can you share some insights?

    • Jason Brownlee September 30, 2016 at 7:51 am #

      I would recommend trying some feature engineering first.

      Try some new framings of the problem.

      Then later try algorithm tuning and ensemble methods.

      I have a list of things to try in the following post, it talks about deep learning but the techniques are general enough for most methods:
      http://machinelearningmastery.com/improve-deep-learning-performance/

      I hope that helps as a start.

  4. Jessica November 11, 2016 at 4:39 am #

    Thank you for this, it’s extremely helpful.

    I wrote a model for my data last night, and it performed very well.
    I tried to re-run it today, and it gave me an error trying to import xgboost.

    I typed in “import xgboost”
    And I got: “ImportError: No module named xgboost”

    • Jason Brownlee November 11, 2016 at 10:06 am #

      Sorry to hear that Jessica.

      I wonder if something changed with your environment.

      Perhaps try running everything from the command line.
      Confirm you’re using the same user.
      Confirm xgboost is still installed on the system (pip show or something…)

  5. Trupti November 21, 2016 at 5:26 pm #

    hello, thanks for the fantastic explanation!!
    I have a query. Can we get the list of significant variables that entered in the model? How do we read the “feature_importances_”?
    Also, how to fin-tune the xgboost model?
    Thanks again!

  6. Trupti November 21, 2016 at 7:55 pm #

    Hello. Thanks for the explanation!
    Can you tell me if I can see the list of variables entering in the model. Also, how do we fine tune the model further??
    Once we have the xgboost model..how do we productionise it? In logistic regression we get an equation which can be automated to run in real time production, what do we get in xgboost?

  7. Peter Tan December 8, 2016 at 8:26 am #

    Hi Jason, I am running into the same issue as some of the readers here:

    AttributeError: ‘module’ object has no attribute ‘XGBClassifier’

    To ensure I did not have any typo, I have created a complete copy of your sample code and I still get the same issue.

    (I do have import xgboost in my code).

    I am using xgboost 0.6a2 with anaconda2-4.2.0. Just wondering if you have run into similar issues.

  8. Hector December 30, 2016 at 1:29 pm #

    Hello Jason, I ran the example code here and one error returned as:

    File “./test.py”, line 21
    model = xgboost.XGBClassifier()
    ^
    SyntaxError: invalid syntax

    Can you tell me what I did wrong? I can successfully import the packages.

    I am using python 3.5 and xgboost 0.6.

    • Jason Brownlee December 31, 2016 at 7:02 am #

      Perhaps a copy paste error? Check for extra white space in your copy of the code.

  9. Trupti January 7, 2017 at 5:31 pm #

    I am using predict_proba to create predicted probabilities by xgboost model. Can I save these probs in the same train data on which model is built so that I can further create reports to show management about validations of the scorecard.

    • Jason Brownlee January 8, 2017 at 5:20 am #

      Sorry, I don’t think I understand.

      Predicted probabilities on the training dataset will be biased. You may want to report on the probabilities for a hold-out dataset.

  10. Niranjan March 14, 2017 at 3:23 am #

    Hi, It was a very nice intro to xgboost. Please add a import for train_test_split function

  11. Keren March 27, 2017 at 12:15 am #

    Hi Jason,
    I didn’t manage to find a clear explanation for the way the probabilities given as output by predict_proba() are computed.

    In random forest for example, I understand it reflects the mean of proportions of the samples belonging to the class among the relevant leaves of all the trees.

    However in XGBoost I couldn’t understand the computation from the documentation or the code. Shouldn’t it give different weights for each tree?

    • Jason Brownlee March 27, 2017 at 7:56 am #

      Good question Keren, I’m not sure off hand.

      You could check some of the original stochastic gradient boosting papers or even reach out to the xgboost authors.

  12. Niranjan April 20, 2017 at 8:31 pm #

    Hi, Jason, Thank you for such a nice explaination, would you help me out regarding how to print the training accuracy while we call the fit function in xgboost?

  13. sumi May 25, 2017 at 3:52 pm #

    Hi,

    Thankyou for your post. It was really helpful.But can you tell me why do I get ‘ImportError: cannot import name XGBClassifier’ when I run this code?i have installed XG Boost successfully and I still have this error. Please help me.

  14. vishwas May 25, 2017 at 10:20 pm #

    how to combine Xgboost classifier and Deep learning and create ensemble(voting classifier)…can you please elaborate more on ensemble techniques

  15. joao June 10, 2017 at 6:29 pm #

    In your step by step explanation you have: “from xgboost import XGBClassifier” and then you use: “model = xgboost.XGBClassifier()”. This will give an error.
    In the full code you have it right though.

  16. Mahmoud July 18, 2017 at 6:56 pm #

    Hello Dr Jason, thanks for the quick cool tutorial. It is fundamental and very beneficial.
    one question, how do I use GPU for training and prediction purposes in XGBoost? I am working on large dataset. thanks a lot in advance.

  17. Bhupendra singh October 6, 2017 at 5:54 am #

    hey ! this performed very well but how will I know which features are selected
    ?

  18. Bhupendra singh October 6, 2017 at 5:55 am #

    sorry I asked a wrong question …

  19. xuyuewei October 25, 2017 at 7:39 pm #

    Thanks a lot

  20. Eric Wu November 11, 2017 at 4:56 am #

    Gee, the 20 or so lines of code is the basic recipe for almost all supervised learning tasks and XGBoost is like the default algorithm. I wish there is a way I could “double” bookmark this page. Well done!

  21. kono November 14, 2017 at 8:52 am #

    Hi Jason,

    XGBClassifier’s default objective is binary:logisitc. For binary:logistic, is its objective function the summation of logloss? If so, why XGBoost use “error”(accuracy score) as the default evaluation metric instead of “logloss”?

    https://github.com/dmlc/xgboost/blob/master/doc/parameter.md#learning-task-parameters

    Kono

  22. mit December 12, 2017 at 6:41 pm #

    Could you please give me an example how a model should be developed using training data and perform a test on the test data?

    I mean, How I can do the following:

    1. Use training data to develop model and use test data to predict;
    2. Use the combined data set (Train and test dataset) and apply Cross-validation.

  23. Frankli December 13, 2017 at 2:01 pm #

    Hi, Jason

    how to adjust the parameters in this model?

    it seems that this blackbox can do everything, but we don’t know the detail in it

    • Jason Brownlee December 13, 2017 at 4:14 pm #

      You can use the hyperparameters to change the way the model is trained.

  24. frankli December 13, 2017 at 5:31 pm #

    thanks, but what is hyperparameters? a package in xgboost?
    any sample codes?

  25. Nasir January 13, 2018 at 8:25 am #

    Hi Jason

    Thanks for very nice tutorial. I would appreciate, if you give me advice.
    I have vibration data (structured format). I am using deep learning Keras using tensorflow. But I read that “Specifically, gradient boosting is used for problems where structured data is available, whereas deep learning is used for perceptual problems such as image classification.

    Practitioners of the former almost always use the
    excellent XGBoost library, which offers support for the two most popular languages of
    data science: Python and R”

    I am very confused and would like to know your expert opinion that I have to switch and use gradient boosting? I am interested to use for regression purpose.

    Hope to hear from you.

    Nasir

  26. Pratip January 30, 2018 at 7:51 pm #

    Hello sir ,

    I am trying use this :
    from xgboost import XGBClassifier

    but it gives me an error as cannot import name ‘XGBClassifier’

    But when i import xgboost it works .
    Can you tell me my error why its not working ?

  27. Matthew March 10, 2018 at 12:01 am #

    The diabetes dataset link is returning a 404. Any idea where it has gone?

  28. Gary March 19, 2018 at 8:31 am #

    I’m getting an error XGBoostError: sklearn needs to be installed in order to use this module however I _do_have sklearn installed in the active environment (and in all the other. I think)

  29. Deep March 25, 2018 at 9:31 pm #

    I am new in ML concept & your examples are very helpful & simple to understand.

    I have recreated the same example based on my data.

    My code below:

    model = XGBClassifier()
    model.fit(X_test,Y_test)

    Q = vectorizer.transform([“I want to play online game”]).toarray()
    pred_data = model.predict(Q)

    I am getting correct prediction but how can I get the score of the prediction correctly.
    Even I used predict_proba of xgboost & getting all the scores but is this the way to get the score of my prediction or some other way is there?

  30. File April 8, 2018 at 6:55 pm #

    Thank you Jason, this blog really helps a lot.
    And I noticed that the dataset you referred is not available anymore. Could you recommend another bi-classification dataset please, thanks –

  31. IrriAnalytics May 3, 2018 at 7:16 pm #

    How can I obtain the set of decision rules ( cuts on the features), once I have built the model?

    • Jason Brownlee May 4, 2018 at 7:41 am #

      Good question, generally this is not feasible given that there many be hundreds or thousands of trees in the model.

  32. charliew May 9, 2018 at 3:12 am #

    Thanks for the work. I ran into an error when trying to do:

    model = XGBClassifier(objective=’multi:softprob’)
    model.fit(X_train, Y_train)

    the error is: b’value 0for Parameter num_class should be greater equal to 1′

    It works fine if I don’t specify objective=’multi:softprob’. Just wondering if you have any experience with XGBClassifier(objective=’multi:softprob’)?

    Thanks

    • Jason Brownlee May 9, 2018 at 6:26 am #

      Sorry, I have not seen this error.

      Perhaps post to stackoverflow?

  33. Kate May 20, 2018 at 9:52 pm #

    Hi!
    I’m currently experimenting with XGBoost for an important project and have uploaded a question on StackOverflow. I just read this post and it is clearer to me now, but you do not use the xgboost.train method. It this included in the XGBRegressor wrapper? I did use xgboost.train, which gave me an error, while xgboost.fit does not produce this error. Could you maybe take a look at it?
    https://stackoverflow.com/questions/50426680/xgboost-gives-keyerror-best-msg

    Thanks in advance!
    Kind regards

    • Jason Brownlee May 21, 2018 at 6:30 am #

      Perhaps you can summarize your problem for me in one or two lines?

  34. Michael June 7, 2018 at 3:36 pm #

    Hi,

    I am using XGBRegressor wrapper to predict the sales of a product, there are 50 products, I want to know the coefficient as in linear regression to see which product sales is affecting how much to the dependent sales variable. Let say Y = B1X1 + B2X2 + ….. BnXn + C , i want the values of B1,B2,….Bn from tree regressor(XGBRegressor).

    • Jason Brownlee June 8, 2018 at 6:05 am #

      An xgboost model is different from a linear regression. There are no list of coefficients, just a ton of trees.

      • Michael June 8, 2018 at 11:38 pm #

        Thanks, but is there a way where I can determine that what percentage of other product sales is affecting the sales of my dependent (sales)
        variable

        • Jason Brownlee June 9, 2018 at 6:54 am #

          Yes, but this might be a question of statistical methods, not predictive modeling.

  35. Todd June 8, 2018 at 6:38 am #

    Jason, thanks for the great article (and site)
    I have a text classification problem that I normally use Logistic Regression to solve. So I’m used to transforming the features in order to fit a model, but I normally don’t have to do anything to the text labels. The labels are text categories (e.g. labels = [‘cancel’, ‘change’, ‘contact support’, etc]. I am now receiving error

    dtrain = xgb.DMatrix(X_train_dtm, label=y_train)

    TypeError: must be real number, not str

    y_train is text data. How would I start to solve for this? Any pointers? Do I need to do some sort of transformation to the labels?

Leave a Reply