The post How to Use XGBoost for Time Series Forecasting appeared first on Machine Learning Mastery.

]]>It is both fast and efficient, performing well, if not the best, on a wide range of predictive modeling tasks and is a favorite among data science competition winners, such as those on Kaggle.

XGBoost can also be used for time series forecasting, although it requires that the time series dataset be transformed into a supervised learning problem first. It also requires the use of a specialized technique for evaluating the model called walk-forward validation, as evaluating the model using k-fold cross validation would result in optimistically biased results.

In this tutorial, you will discover how to develop an XGBoost model for time series forecasting.

After completing this tutorial, you will know:

- XGBoost is an implementation of the gradient boosting ensemble algorithm for classification and regression.
- Time series datasets can be transformed into supervised learning using a sliding-window representation.
- How to fit, evaluate, and make predictions with an XGBoost model for time series forecasting.

**Kick-start your project** with my new book XGBoost With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Update Aug/2020**: Fixed bug in the calculation of MAE, updated model config to make better predictions (thanks Kaustav!)

This tutorial is divided into three parts; they are:

- XGBoost Ensemble
- Time Series Data Preparation
- XGBoost for Time Series Forecasting

XGBoost is short for Extreme Gradient Boosting and is an efficient implementation of the stochastic gradient boosting machine learning algorithm.

The stochastic gradient boosting algorithm, also called gradient boosting machines or tree boosting, is a powerful machine learning technique that performs well or even best on a wide range of challenging machine learning problems.

Tree boosting has been shown to give state-of-the-art results on many standard classification benchmarks.

— XGBoost: A Scalable Tree Boosting System, 2016.

It is an ensemble of decision trees algorithm where new trees fix errors of those trees that are already part of the model. Trees are added until no further improvements can be made to the model.

XGBoost provides a highly efficient implementation of the stochastic gradient boosting algorithm and access to a suite of model hyperparameters designed to provide control over the model training process.

The most important factor behind the success of XGBoost is its scalability in all scenarios. The system runs more than ten times faster than existing popular solutions on a single machine and scales to billions of examples in distributed or memory-limited settings.

— XGBoost: A Scalable Tree Boosting System, 2016.

XGBoost is designed for classification and regression on tabular datasets, although it can be used for time series forecasting.

For more on the gradient boosting and XGBoost implementation, see the tutorial:

First, the XGBoost library must be installed.

You can install it using pip, as follows:

sudo pip install xgboost

Once installed, you can confirm that it was installed successfully and that you are using a modern version by running the following code:

# xgboost import xgboost print("xgboost", xgboost.__version__)

Running the code, you should see the following version number or higher.

xgboost 1.0.1

Although the XGBoost library has its own Python API, we can use XGBoost models with the scikit-learn API via the XGBRegressor wrapper class.

An instance of the model can be instantiated and used just like any other scikit-learn class for model evaluation. For example:

... # define model model = XGBRegressor()

Now that we are familiar with XGBoost, let’s look at how we can prepare a time series dataset for supervised learning.

Time series data can be phrased as supervised learning.

Given a sequence of numbers for a time series dataset, we can restructure the data to look like a supervised learning problem. We can do this by using previous time steps as input variables and use the next time step as the output variable.

Let’s make this concrete with an example. Imagine we have a time series as follows:

time, measure 1, 100 2, 110 3, 108 4, 115 5, 120

We can restructure this time series dataset as a supervised learning problem by using the value at the previous time step to predict the value at the next time-step.

Reorganizing the time series dataset this way, the data would look as follows:

X, y ?, 100 100, 110 110, 108 108, 115 115, 120 120, ?

Note that the time column is dropped and some rows of data are unusable for training a model, such as the first and the last.

This representation is called a sliding window, as the window of inputs and expected outputs is shifted forward through time to create new “*samples*” for a supervised learning model.

For more on the sliding window approach to preparing time series forecasting data, see the tutorial:

We can use the shift() function in Pandas to automatically create new framings of time series problems given the desired length of input and output sequences.

This would be a useful tool as it would allow us to explore different framings of a time series problem with machine learning algorithms to see which might result in better-performing models.

The function below will take a time series as a NumPy array time series with one or more columns and transform it into a supervised learning problem with the specified number of inputs and outputs.

# transform a time series dataset into a supervised learning dataset def series_to_supervised(data, n_in=1, n_out=1, dropnan=True): n_vars = 1 if type(data) is list else data.shape[1] df = DataFrame(data) cols = list() # input sequence (t-n, ... t-1) for i in range(n_in, 0, -1): cols.append(df.shift(i)) # forecast sequence (t, t+1, ... t+n) for i in range(0, n_out): cols.append(df.shift(-i)) # put it all together agg = concat(cols, axis=1) # drop rows with NaN values if dropnan: agg.dropna(inplace=True) return agg.values

We can use this function to prepare a time series dataset for XGBoost.

For more on the step-by-step development of this function, see the tutorial:

Once the dataset is prepared, we must be careful in how it is used to fit and evaluate a model.

For example, it would not be valid to fit the model on data from the future and have it predict the past. The model must be trained on the past and predict the future.

This means that methods that randomize the dataset during evaluation, like k-fold cross-validation, cannot be used. Instead, we must use a technique called walk-forward validation.

In walk-forward validation, the dataset is first split into train and test sets by selecting a cut point, e.g. all data except the last 12 months is used for training and the last 12 months is used for testing.

If we are interested in making a one-step forecast, e.g. one month, then we can evaluate the model by training on the training dataset and predicting the first step in the test dataset. We can then add the real observation from the test set to the training dataset, refit the model, then have the model predict the second step in the test dataset.

Repeating this process for the entire test dataset will give a one-step prediction for the entire test dataset from which an error measure can be calculated to evaluate the skill of the model.

For more on walk-forward validation, see the tutorial:

The function below performs walk-forward validation.

It takes the entire supervised learning version of the time series dataset and the number of rows to use as the test set as arguments.

It then steps through the test set, calling the *xgboost_forecast()* function to make a one-step forecast. An error measure is calculated and the details are returned for analysis.

# walk-forward validation for univariate data def walk_forward_validation(data, n_test): predictions = list() # split dataset train, test = train_test_split(data, n_test) # seed history with training dataset history = [x for x in train] # step over each time-step in the test set for i in range(len(test)): # split test row into input and output columns testX, testy = test[i, :-1], test[i, -1] # fit model on history and make a prediction yhat = xgboost_forecast(history, testX) # store forecast in list of predictions predictions.append(yhat) # add actual observation to history for the next loop history.append(test[i]) # summarize progress print('>expected=%.1f, predicted=%.1f' % (testy, yhat)) # estimate prediction error error = mean_absolute_error(test[:, -1], predictions) return error, test[:, 1], predictions

The *train_test_split()* function is called to split the dataset into train and test sets.

We can define this function below.

# split a univariate dataset into train/test sets def train_test_split(data, n_test): return data[:-n_test, :], data[-n_test:, :]

We can use the *XGBRegressor* class to make a one-step forecast.

The *xgboost_forecast()* function below implements this, taking the training dataset and test input row as input, fitting a model, and making a one-step prediction.

# fit an xgboost model and make a one step prediction def xgboost_forecast(train, testX): # transform list into array train = asarray(train) # split into input and output columns trainX, trainy = train[:, :-1], train[:, -1] # fit model model = XGBRegressor(objective='reg:squarederror', n_estimators=1000) model.fit(trainX, trainy) # make a one-step prediction yhat = model.predict([testX]) return yhat[0]

Now that we know how to prepare time series data for forecasting and evaluate an XGBoost model, next we can look at using XGBoost on a real dataset.

In this section, we will explore how to use XGBoost for time series forecasting.

We will use a standard univariate time series dataset with the intent of using the model to make a one-step forecast.

You can use the code in this section as the starting point in your own project and easily adapt it for multivariate inputs, multivariate forecasts, and multi-step forecasts.

We will use the daily female births dataset, that is the monthly births across three years.

You can download the dataset from here, place it in your current working directory with the filename “*daily-total-female-births.csv*“.

The first few lines of the dataset look as follows:

"Date","Births" "1959-01-01",35 "1959-01-02",32 "1959-01-03",30 "1959-01-04",31 "1959-01-05",44 ...

First, let’s load and plot the dataset.

The complete example is listed below.

# load and plot the time series dataset from pandas import read_csv from matplotlib import pyplot # load dataset series = read_csv('daily-total-female-births.csv', header=0, index_col=0) values = series.values # plot dataset pyplot.plot(values) pyplot.show()

Running the example creates a line plot of the dataset.

We can see there is no obvious trend or seasonality.

A persistence model can achieve a MAE of about 6.7 births when predicting the last 12 months. This provides a baseline in performance above which a model may be considered skillful.

Next, we can evaluate the XGBoost model on the dataset when making one-step forecasts for the last 12 months of data.

We will use only the previous 6 time steps as input to the model and default model hyperparameters, except we will change the loss to ‘*reg:squarederror*‘ (to avoid a warning message) and use a 1,000 trees in the ensemble (to avoid underlearning).

The complete example is listed below.

# forecast monthly births with xgboost from numpy import asarray from pandas import read_csv from pandas import DataFrame from pandas import concat from sklearn.metrics import mean_absolute_error from xgboost import XGBRegressor from matplotlib import pyplot # transform a time series dataset into a supervised learning dataset def series_to_supervised(data, n_in=1, n_out=1, dropnan=True): n_vars = 1 if type(data) is list else data.shape[1] df = DataFrame(data) cols = list() # input sequence (t-n, ... t-1) for i in range(n_in, 0, -1): cols.append(df.shift(i)) # forecast sequence (t, t+1, ... t+n) for i in range(0, n_out): cols.append(df.shift(-i)) # put it all together agg = concat(cols, axis=1) # drop rows with NaN values if dropnan: agg.dropna(inplace=True) return agg.values # split a univariate dataset into train/test sets def train_test_split(data, n_test): return data[:-n_test, :], data[-n_test:, :] # fit an xgboost model and make a one step prediction def xgboost_forecast(train, testX): # transform list into array train = asarray(train) # split into input and output columns trainX, trainy = train[:, :-1], train[:, -1] # fit model model = XGBRegressor(objective='reg:squarederror', n_estimators=1000) model.fit(trainX, trainy) # make a one-step prediction yhat = model.predict(asarray([testX])) return yhat[0] # walk-forward validation for univariate data def walk_forward_validation(data, n_test): predictions = list() # split dataset train, test = train_test_split(data, n_test) # seed history with training dataset history = [x for x in train] # step over each time-step in the test set for i in range(len(test)): # split test row into input and output columns testX, testy = test[i, :-1], test[i, -1] # fit model on history and make a prediction yhat = xgboost_forecast(history, testX) # store forecast in list of predictions predictions.append(yhat) # add actual observation to history for the next loop history.append(test[i]) # summarize progress print('>expected=%.1f, predicted=%.1f' % (testy, yhat)) # estimate prediction error error = mean_absolute_error(test[:, -1], predictions) return error, test[:, -1], predictions # load the dataset series = read_csv('daily-total-female-births.csv', header=0, index_col=0) values = series.values # transform the time series data into supervised learning data = series_to_supervised(values, n_in=6) # evaluate mae, y, yhat = walk_forward_validation(data, 12) print('MAE: %.3f' % mae) # plot expected vs preducted pyplot.plot(y, label='Expected') pyplot.plot(yhat, label='Predicted') pyplot.legend() pyplot.show()

Running the example reports the expected and predicted values for each step in the test set, then the MAE for all predicted values.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that the model performs better than a persistence model, achieving a MAE of about 5.9 births, compared to 6.7 births.

**Can you do better?**

You can test different XGBoost hyperparameters and numbers of time steps as input to see if you can achieve better performance. Share your results in the comments below.

>expected=42.0, predicted=44.5 >expected=53.0, predicted=42.5 >expected=39.0, predicted=40.3 >expected=40.0, predicted=32.5 >expected=38.0, predicted=41.1 >expected=44.0, predicted=45.3 >expected=34.0, predicted=40.2 >expected=37.0, predicted=35.0 >expected=52.0, predicted=32.5 >expected=48.0, predicted=41.4 >expected=55.0, predicted=46.6 >expected=50.0, predicted=47.2 MAE: 5.957

A line plot is created comparing the series of expected values and predicted values for the last 12 months of the dataset.

This gives a geometric interpretation of how well the model performed on the test set.

Once a final XGBoost model configuration is chosen, a model can be finalized and used to make a prediction on new data.

This is called an **out-of-sample forecast**, e.g. predicting beyond the training dataset. This is identical to making a prediction during the evaluation of the model: as we always want to evaluate a model using the same procedure that we expect to use when the model is used to make prediction on new data.

The example below demonstrates fitting a final XGBoost model on all available data and making a one-step prediction beyond the end of the dataset.

# finalize model and make a prediction for monthly births with xgboost from numpy import asarray from pandas import read_csv from pandas import DataFrame from pandas import concat from xgboost import XGBRegressor # transform a time series dataset into a supervised learning dataset def series_to_supervised(data, n_in=1, n_out=1, dropnan=True): n_vars = 1 if type(data) is list else data.shape[1] df = DataFrame(data) cols = list() # input sequence (t-n, ... t-1) for i in range(n_in, 0, -1): cols.append(df.shift(i)) # forecast sequence (t, t+1, ... t+n) for i in range(0, n_out): cols.append(df.shift(-i)) # put it all together agg = concat(cols, axis=1) # drop rows with NaN values if dropnan: agg.dropna(inplace=True) return agg.values # load the dataset series = read_csv('daily-total-female-births.csv', header=0, index_col=0) values = series.values # transform the time series data into supervised learning train = series_to_supervised(values, n_in=6) # split into input and output columns trainX, trainy = train[:, :-1], train[:, -1] # fit model model = XGBRegressor(objective='reg:squarederror', n_estimators=1000) model.fit(trainX, trainy) # construct an input for a new preduction row = values[-6:].flatten() # make a one-step prediction yhat = model.predict(asarray([row])) print('Input: %s, Predicted: %.3f' % (row, yhat[0]))

Running the example fits an XGBoost model on all available data.

A new row of input is prepared using the last 6 months of known data and the next month beyond the end of the dataset is predicted.

Input: [34 37 52 48 55 50], Predicted: 42.708

This section provides more resources on the topic if you are looking to go deeper.

- A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning
- Time Series Forecasting as Supervised Learning
- How to Convert a Time Series to a Supervised Learning Problem in Python
- How To Backtest Machine Learning Models for Time Series Forecasting

In this tutorial, you discovered how to develop an XGBoost model for time series forecasting.

Specifically, you learned:

- XGBoost is an implementation of the gradient boosting ensemble algorithm for classification and regression.
- Time series datasets can be transformed into supervised learning using a sliding-window representation.
- How to fit, evaluate, and make predictions with an XGBoost model for time series forecasting.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Use XGBoost for Time Series Forecasting appeared first on Machine Learning Mastery.

]]>The post Comparing 13 Algorithms on 165 Datasets (hint: use Gradient Boosting) appeared first on Machine Learning Mastery.

]]>It is a central question in applied machine learning.

In a recent paper by Randal Olson and others, they attempt to answer it and give you a guide for algorithms and parameters to try on your problem first, before spot checking a broader suite of algorithms.

In this post, you will discover a study and findings from evaluating many machine learning algorithms across a large number of machine learning datasets and the recommendations made from this study.

After reading this post, you will know:

- That ensemble tree algorithms perform well across a wide range of datasets.
- That it is critical to test a suite of algorithms on a problem as there is no silver bullet algorithm.
- That it is critical to test a suite of configurations for a given algorithm as it can result in as much as a 50% improvement on some problems.

**Kick-start your project** with my new book XGBoost With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

In 2017, Randal Olson, et al. released a pre-print for a paper with the intriguing title “Data-driven Advice for Applying Machine Learning to Bioinformatics Problems“.

The goal of their work was to address the question that every practitioner faces when getting started on their predictive modeling problem; namely:

**What algorithm should I use?**

The authors describe this problem as choice overload, as follows:

Although having several readily-available ML algorithm implementations is advantageous to bioinformatics researchers seeking to move beyond simple statistics, many researchers experience “choice overload” and find difficulty in selecting the right ML algorithm for their problem at hand.

They approach the problem by running a decent sample of algorithms across a large sample of standard machine learning datasets to see what algorithms and parameters work best in general.

They describe their paper as:

… a thorough analysis of 13 state-of-the-art, commonly used machine learning algorithms on a set of 165 publicly available classification problems in order to provide data-driven algorithm recommendations to current researchers

It is very much similar to the paper “Do We Need Hundreds of Classifiers to Solve Real World Classification Problems?” covered in the post “Use Random Forest: Testing 179 Classifiers on 121 Datasets” both in methodology and findings.

A total of 13 different algorithms were chosen for the study.

Algorithms were chosen to provide a mix of types or underlying assumptions.

The goal was to represent the most common classes of algorithms used in the literature, as well as recent state-of-the-art algorithms

The complete list of algorithms is provided below.

- Gaussian Naive Bayes (GNB)
- Bernoulli Naive Bayes (BNB)
- Multinomial Naive Bayes (MNB)
- Logistic Regression (LR)
- Stochastic Gradient Descent (SGD)
- Passive Aggressive Classifier (PAC)
- Support Vector Classifier (SVC)
- K-Nearest Neighbor (KNN)
- Decision Tree (DT)
- Random Forest (RF)
- Extra Trees Classifier (ERF)
- AdaBoost (AB)
- Gradient Tree Boosting (GTB)

The scikit-learn library was used for the implementations of these algorithms.

Each algorithm has zero or more parameters, and a grid search across sensible parameter values was performed for each algorithm.

For each algorithm, the hyperparameters were tuned using a fixed grid search.

A table of algorithms and the hyperparameters evaluated is listed below, taken from the paper.

Algorithms were evaluated using 10-fold cross-validation and the balanced accuracy measure.

Cross-validation was not repeated, perhaps introducing some statistical noise into the results.

A selection of 165 standard machine learning problems were selected for the study.

Many of the problems were drawn from the field of bioinformatics, although not all datasets belong to this field of study.

All prediction problems were classification type problems with two or more classes.

The algorithms were compared on 165 supervised classification datasets from the Penn Machine Learning Benchmark (PMLB). […] PMLB is a collection of publicly available classification problems that have been standardized to the same format and collected in a central location with easy access via Python.

The datasets were drawn from the Penn Machine Learning Benchmark (PMLB) collection, which is a project that provides standard machine learning datasets in a uniform format and made available by a simple Python API. You can learn more about this dataset catalog on the GitHub Project:

All datasets were standardized prior to fitting models.

Prior to evaluating each ML algorithm, we scaled the features of every dataset by subtracting the mean and scaling the features to unit variance.

Other data preparation was not performed, nor feature selection or feature engineering.

The large number of experiments performed resulted in a lot of skill scores to analyze.

The analysis of the results was handled well, asking interesting questions and providing findings in the form of easy-to-understand charts.

The entire experimental design consisted of over 5.5 million ML algorithm and parameter evaluations in total, resulting in a rich set of data that is analyzed from several viewpoints…

Algorithm performance was ranked for each dataset, then the average rank of each algorithm was calculated.

This provided a rough and easy to understand idea of which algorithms performed well or not, on average.

The results showed that both Gradient boosting and random forest had the lowest rank (performed best) and that the Naive Bayes approaches had the highest rank (performed worst) on average.

The post-hoc test underlines the impressive performance of Gradient Tree Boosting, which significantly outperforms every algorithm except Random Forest at the p < 0.01 level.

This is demonstrated with a nice chart, taken from the paper.

No single algorithm performs best or worst.

This is wisdom known to machine learning practitioners, but difficult to grasp for beginners in the field.

There is no silver bullet and you must test a suite of algorithms on a given dataset to see what works best.

… it is worth noting that no one ML algorithm performs best across all 165 datasets. For example, there are 9 datasets for which Multinomial NB performs as well as or better than Gradient Tree Boosting, despite being the overall worst- and best-ranked algorithms, respectively. Therefore, it is still important to consider different ML algorithms when applying ML to new datasets.

Further, picking the right algorithm is not enough. You must also pick the right configuration of the algorithm for your dataset.

… both selecting the right ML algorithm and tuning its parameters is vitally important for most problems.

The results found that tuning an algorithm lifted skill of a method anywhere from 3% to 50%, depending on the algorithm and the dataset.

The results demonstrate why it is unwise to use default ML algorithm hyperparameters: tuning often improves an algorithm’s accuracy by 3-5%, depending on the algorithm. In some cases, parameter tuning led to CV accuracy improvements of 50%.

This is demonstrated with a chart from the paper that shows the spread of improvement offered by parameter tuning on each algorithm.

Not all algorithms are required.

The results found that five algorithms and specific parameters achieved top 1% in performance across 106 of the 165 tested datasets.

These five algorithms are recommended as a starting point for spot checking algorithms on a given dataset in bioinformatics, but I would suggest also more generally:

- Gradient Boosting
- Random Forest
- Support Vector Classifier
- Extra Trees
- Logistic Regression

The paper provides a table of these algorithms, including the recommend parameter settings and the number of datasets covered, e.g. where the algorithm and configuration achieved top 1% performance.

There are two big findings from this paper that are valuable for practitioners, especially those starting out or those who are under pressure to get a result on their own predictive modeling problem.

If in doubt or under time pressure, use ensemble tree algorithms such as gradient boosting and random forest on your dataset.

The analysis demonstrates the strength of state-of-the-art, tree-based ensemble algorithms, while also showing the problem-dependent nature of ML algorithm performance.

No one can look at your problem and tell you what algorithm to use, and there is no silver bullet algorithm.

You must test a suite of algorithms and a suite of parameters for each algorithm to see what works best for your specific problem.

In addition, the analysis shows that selecting the right ML algorithm and thoroughly tuning its parameters can lead to a significant improvement in predictive accuracy on most problems, and is a critical step in every ML application.

I talk about this all the time; for example, see the post:

This section provides more resources on the topic if you are looking to go deeper.

- Data-driven Advice for Applying Machine Learning to Bioinformatics Problems
- scikit-learn benchmarks on GitHub
- Penn Machine Learning Benchmarks
- Quantitative comparison of scikit-learn’s predictive models on a large number of machine learning datasets: A good start
- Use Random Forest: Testing 179 Classifiers on 121 Datasets

In this post, you discovered a study and findings from evaluating many machine learning algorithms across a large number of machine learning datasets.

Specifically, you learned:

- That ensemble tree algorithms perform well across a wide range of datasets.
- That it is critical to test a suite of algorithms on a problem as there is no silver bullet algorithm.
- That it is critical to test a suite of configurations for a given algorithm as it can result in as much as a 50% improvement on some problems.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Comparing 13 Algorithms on 165 Datasets (hint: use Gradient Boosting) appeared first on Machine Learning Mastery.

]]>The post How to Install XGBoost for Python on macOS appeared first on Machine Learning Mastery.

]]>It is a library at the center of many winning solutions in Kaggle data science competitions.

In this tutorial, you will discover how to install the XGBoost library for Python on macOS.

**Kick-start your project** with my new book XGBoost With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into 3 parts; they are:

- Install MacPorts
- Build XGBoost
- Install XGBoost

**Note**: I have used this procedure for years on a range of different macOS versions and it has not changed. This tutorial was written and tested on macOS High Sierra (10.13.1).

You need GCC and a Python environment installed in order to build and install XGBoost for Python.

I recommend GCC 7 and Python 3.6 and I recommend installing these prerequisites using MacPorts.

- 1. For help installing MacPorts and a Python environment step-by-step, see this tutorial:

>> How to Install a Python 3 Environment on Mac OS X for Machine Learning and Deep Learning

- 2. After MacPorts and a working Python environment are installed, you can install and select GCC 7 as follows:

sudo port install gcc7 sudo port select --set gcc mp-gcc7

- 3. Confirm your GCC installation was successful as follows:

gcc -v

You should see the version of GCC printed; for example:

.. gcc version 7.2.0 (MacPorts gcc7 7.2.0_0)

What version did you see?

Let me know in the comments below.

The next step is to download and compile XGBoost for your system.

- 1. First, check out the code repository from GitHub:

git clone --recursive https://github.com/dmlc/xgboost

- 2. Change into the xgboost directory.

cd xgboost/

- 3. Copy the configuration we intend to use to compile XGBoost into position.

cp make/config.mk ./config.mk

- 4. Compile XGBoost; this requires that you specify the number of cores on your system (e.g. 8, change as needed).

make -j8

The build process may take a minute and should not produce any error messages, although you may see some warnings that you can safely ignore.

For example, the last snippet of the compilation might look as follows:

... a - build/learner.o a - build/logging.o a - build/c_api/c_api.o a - build/c_api/c_api_error.o a - build/common/common.o a - build/common/hist_util.o a - build/data/data.o a - build/data/simple_csr_source.o a - build/data/simple_dmatrix.o a - build/data/sparse_page_dmatrix.o a - build/data/sparse_page_raw_format.o a - build/data/sparse_page_source.o a - build/data/sparse_page_writer.o a - build/gbm/gblinear.o a - build/gbm/gbm.o a - build/gbm/gbtree.o a - build/metric/elementwise_metric.o a - build/metric/metric.o a - build/metric/multiclass_metric.o a - build/metric/rank_metric.o a - build/objective/multiclass_obj.o a - build/objective/objective.o a - build/objective/rank_obj.o a - build/objective/regression_obj.o a - build/predictor/cpu_predictor.o a - build/predictor/predictor.o a - build/tree/tree_model.o a - build/tree/tree_updater.o a - build/tree/updater_colmaker.o a - build/tree/updater_fast_hist.o a - build/tree/updater_histmaker.o a - build/tree/updater_prune.o a - build/tree/updater_refresh.o a - build/tree/updater_skmaker.o a - build/tree/updater_sync.o c++ -std=c++11 -Wall -Wno-unknown-pragmas -Iinclude -Idmlc-core/include -Irabit/include -I/include -O3 -funroll-loops -msse2 -fPIC -fopenmp -o xgboost build/cli_main.o build/learner.o build/logging.o build/c_api/c_api.o build/c_api/c_api_error.o build/common/common.o build/common/hist_util.o build/data/data.o build/data/simple_csr_source.o build/data/simple_dmatrix.o build/data/sparse_page_dmatrix.o build/data/sparse_page_raw_format.o build/data/sparse_page_source.o build/data/sparse_page_writer.o build/gbm/gblinear.o build/gbm/gbm.o build/gbm/gbtree.o build/metric/elementwise_metric.o build/metric/metric.o build/metric/multiclass_metric.o build/metric/rank_metric.o build/objective/multiclass_obj.o build/objective/objective.o build/objective/rank_obj.o build/objective/regression_obj.o build/predictor/cpu_predictor.o build/predictor/predictor.o build/tree/tree_model.o build/tree/tree_updater.o build/tree/updater_colmaker.o build/tree/updater_fast_hist.o build/tree/updater_histmaker.o build/tree/updater_prune.o build/tree/updater_refresh.o build/tree/updater_skmaker.o build/tree/updater_sync.o dmlc-core/libdmlc.a rabit/lib/librabit.a -pthread -lm -fopenmp

Did this step work for you?

Let me know in the comments below.

You are now ready to install XGBoost on your system.

- 1. Change directory into the Python package of the xgboost project.

cd python-package

- 2. Install the Python XGBoost package.

sudo python setup.py install

The installation is very fast.

For example, at the end of the installation, you may see messages like the following:

... Installed /opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/xgboost-0.6-py3.6.egg Processing dependencies for xgboost==0.6 Searching for scipy==1.0.0 Best match: scipy 1.0.0 Adding scipy 1.0.0 to easy-install.pth file Using /opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages Searching for numpy==1.13.3 Best match: numpy 1.13.3 Adding numpy 1.13.3 to easy-install.pth file Using /opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages Finished processing dependencies for xgboost==0.6

- 3. Confirm that the installation was successful by printing the xgboost version, which requires the library to be loaded.

Save the following code to a file called *version.py.*

import xgboost print("xgboost", xgboost.__version__)

Run the script from the command line:

python version.py

You should see the XGBoost version printed to screen:

xgboost 0.6

How did you do?

Post your results in the comments below.

This section provides more resources on the topic if you are looking to go deeper.

- How to Install a Python 3 Environment on Mac OS X for Machine Learning and Deep Learning
- MacPorts Installation Guide
- XGBoost Installation Guide

In this tutorial, you discovered how to install XGBoost for Python on macOS step-by-step.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Install XGBoost for Python on macOS appeared first on Machine Learning Mastery.

]]>The post 7 Step Mini-Course to Get Started with XGBoost in Python appeared first on Machine Learning Mastery.

]]>XGBoost is an implementation of gradient boosting that is being used to win machine learning competitions.

It is powerful but it can be hard to get started.

In this post, you will discover a 7-part crash course on XGBoost with Python.

This mini-course is designed for Python machine learning practitioners that are already comfortable with scikit-learn and the SciPy ecosystem.

**Kick-start your project** with my new book XGBoost With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Update Jan/2017**: Updated to reflect changes in scikit-learn API version 0.18.1.**Update Mar/2018**: Added alternate link to download the dataset as the original appears to have been taken down.

(**Tip**: *you might want to print or bookmark this page so that you can refer back to it later*.)

Before we get started, let’s make sure you are in the right place. The list below provides some general guidelines as to who this course was designed for.

Don’t panic if you don’t match these points exactly, you might just need to brush up in one area or another to keep up.

**Developers that know how to write a little code**. This means that it is not a big deal for you to get things done with Python and know how to setup the SciPy ecosystem on your workstation (a prerequisite). It does not mean your a wizard coder, but it does mean you’re not afraid to install packages and write scripts.**Developers that know a little machine learning**. This means you know about the basics of machine learning like cross validation, some algorithms and the bias-variance trade-off. It does not mean that you are a machine learning PhD, just that you know the landmarks or know where to look them up.

This mini-course is not a textbook on XGBoost. There will be no equations.

It will take you from a developer that knows a little machine learning in Python to a developer who can get results and bring the power of XGBoost to your own projects.

Take my free 7-day email course and discover xgboost (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

This mini-course is divided into 7 parts.

Each lesson was designed to take the average developer about 30 minutes. You might finish some much sooner and others you may choose to go deeper and spend more time.

You can complete each part as quickly or as slowly as you like. A comfortable schedule may be to complete one lesson per day over a one week period. Highly recommended.

The topics you will cover over the next 7 lessons are as follows:

**Lesson 01**: Introduction to Gradient Boosting.**Lesson 02**: Introduction to XGBoost.**Lesson 03**: Develop Your First XGBoost Model.**Lesson 04**: Monitor Performance and Early Stopping.**Lesson 05**: Feature Importance with XGBoost.**Lesson 06**: How to Configure Gradient Boosting.**Lesson 07**: XGBoost Hyperparameter Tuning.

This is going to be a lot of fun.

You’re going to have to do some work though, a little reading, a little research and a little programming. You want to learn about XGBoost right?

(**Tip**: *Help for with these lessons can be found on this blog, use the search feature*.)

Any questions at all, please post in the comments below.

Share your results in the comments.

Hang in there, don’t give up!

Gradient boosting is one of the most powerful techniques for building predictive models.

The idea of boosting came out of the idea of whether a weak learner can be modified to become better. The first realization of boosting that saw great success in application was Adaptive Boosting or AdaBoost for short. The weak learners in AdaBoost are decision trees with a single split, called decision stumps for their shortness.

AdaBoost and related algorithms were recast in a statistical framework and became known as Gradient Boosting Machines. The statistical framework cast boosting as a numerical optimization problem where the objective is to minimize the loss of the model by adding weak learners using a gradient descent like procedure, hence the name.

The Gradient Boosting algorithm involves three elements:

**A loss function to be optimized**, such as cross entropy for classification or mean squared error for regression problems.**A weak learner to make predictions**, such as a greedily constructed decision tree.**An additive model,**used to add weak learners to minimize the loss function.

New weak learners are added to the model in an effort to correct the residual errors of all previous trees. The result is a powerful predictive modeling algorithm, perhaps more powerful than random forest.

In the next lesson we will take a closer look at the XGBoost implementation of gradient boosting.

XGBoost is an implementation of gradient boosted decision trees designed for speed and performance.

XGBoost stands for e**X**treme **G**radient **Boosti**ng.

It was developed by Tianqi Chen and is laser focused on computational speed and model performance, as such there are few frills.

In addition to supporting all key variations of the technique, the real interest is the speed provided by the careful engineering of the implementation, including:

**Parallelization**of tree construction using all of your CPU cores during training.**Distributed Computing**for training very large models using a cluster of machines.**Out-of-Core Computing**for very large datasets that don’t fit into memory.**Cache Optimization**of data structures and algorithms to make best use of hardware.

Traditionally, gradient boosting implementations are slow because of the sequential nature in which each tree must be constructed and added to the model.

The on performance in the development of XGBoost has resulted in one of the best predictive modeling algorithms that can now harness the full capability of your hardware platform, or very large computers you might rent in the cloud.

As such, XGBoost has been a cornerstone in competitive machine learning, being the technique used to win and recommended by winners. For example, here is what some recent Kaggle competition winners have said:

As the winner of an increasing amount of Kaggle competitions, XGBoost showed us again to be a great all-round algorithm worth having in your toolbox.

When in doubt, use xgboost.

In the next lesson, we will develop our first XGBoost model in Python.

Assuming you have a working SciPy environment, XGBoost can be installed easily using pip.

For example:

sudo pip install xgboost

You can learn more about installing and building XGBoost on your platform in the XGBoost Installation Instructions.

XGBoost models can be used directly in the scikit-learn framework using the wrapper classes, **XGBClassifier** for classification and **XGBRegressor** for regression problems.

This is the recommended way to use XGBoost in Python.

Download the Pima Indians onset of diabetes dataset.

It is a good test dataset for binary classification as all input variables are numeric, meaning the problem can be modeled directly with no data preparation.

We can train an XGBoost model for classification by constructing it and calling the **model.fit()** function:

model = XGBClassifier() model.fit(X_train, y_train)

This model can then be used to make predictions by calling the **model.predict()** function on new data.

y_pred = model.predict(X_test)

We can tie this all together as follows:

# First XGBoost model for Pima Indians dataset from numpy import loadtxt from xgboost import XGBClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # load data dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",") # split data into X and y X = dataset[:,0:8] Y = dataset[:,8] # split data into train and test sets seed = 7 test_size = 0.33 X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed) # fit model on training data model = XGBClassifier() model.fit(X_train, y_train) # make predictions for test data y_pred = model.predict(X_test) predictions = [round(value) for value in y_pred] # evaluate predictions accuracy = accuracy_score(y_test, predictions) print("Accuracy: %.2f%%" % (accuracy * 100.0))

In the next lesson we will look at how we can use early stopping to limit overfitting.

The XGBoost model can evaluate and report on the performance on a test set for the model during training.

It supports this capability by specifying both a test dataset and an evaluation metric on the call to **model.fit()** when training the model and specifying verbose output (**verbose=True**).

For example, we can report on the binary classification error rate (**error**) on a standalone test set (**eval_set**) while training an XGBoost model as follows:

eval_set = [(X_test, y_test)] model.fit(X_train, y_train, eval_metric="error", eval_set=eval_set, verbose=True)

Running a model with this configuration will report the performance of the model after each tree is added. For example:

... [89] validation_0-error:0.204724 [90] validation_0-error:0.208661

We can use this evaluation to stop training once no further improvements have been made to the model.

We can do this by setting the **early_stopping_rounds** parameter when calling **model.fit()** to the number of iterations that no improvement is seen on the validation dataset before training is stopped.

The full example using the Pima Indians Onset of Diabetes dataset is provided below.

# exmaple of early stopping from numpy import loadtxt from xgboost import XGBClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # load data dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",") # split data into X and y X = dataset[:,0:8] Y = dataset[:,8] # split data into train and test sets seed = 7 test_size = 0.33 X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed) # fit model on training data model = XGBClassifier() eval_set = [(X_test, y_test)] model.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="logloss", eval_set=eval_set, verbose=True) # make predictions for test data y_pred = model.predict(X_test) predictions = [round(value) for value in y_pred] # evaluate predictions accuracy = accuracy_score(y_test, predictions) print("Accuracy: %.2f%%" % (accuracy * 100.0))

In the next lesson, we will look at how we calculate the importance of features using XGBoost

A benefit of using ensembles of decision tree methods like gradient boosting is that they can automatically provide estimates of feature importance from a trained predictive model.

A trained XGBoost model automatically calculates feature importance on your predictive modeling problem.

These importance scores are available in the **feature_importances_** member variable of the trained model. For example, they can be printed directly as follows:

print(model.feature_importances_)

The XGBoost library provides a built-in function to plot features ordered by their importance.

The function is called **plot_importance()** and can be used as follows:

plot_importance(model) pyplot.show()

These importance scores can help you decide what input variables to keep or discard. They can also be used as the basis for automatic feature selection techniques.

The full example of plotting feature importance scores using the Pima Indians Onset of Diabetes dataset is provided below.

# plot feature importance using built-in function from numpy import loadtxt from xgboost import XGBClassifier from xgboost import plot_importance from matplotlib import pyplot # load data dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",") # split data into X and y X = dataset[:,0:8] y = dataset[:,8] # fit model on training data model = XGBClassifier() model.fit(X, y) # plot feature importance plot_importance(model) pyplot.show()

In the next lesson we will look at heuristics for best configuring the gradient boosting algorithm.

Gradient boosting is one of the most powerful techniques for applied machine learning and as such is quickly becoming one of the most popular.

But how do you configure gradient boosting on your problem?

A number of configuration heuristics were published in the original gradient boosting papers. They can be summarized as:

- Learning rate or shrinkage (
**learning_rate**in XGBoost) should be set to 0.1 or lower, and smaller values will require the addition of more trees. - The depth of trees (
**max_depth**in XGBoost) should be configured in the range of 2-to-8, where not much benefit is seen with deeper trees. - Row sampling (
**subsample**in XGBoost) should be configured in the range of 30% to 80% of the training dataset, and compared to a value of 100% for no sampling.

These are a good starting points when configuring your model.

A good general configuration strategy is as follows:

- Run the default configuration and review plots of the learning curves on the training and validation datasets.
- If the system is overlearning, decrease the learning rate and/or increase the number of trees.
- If the system is underlearning, speed the learning up to be more aggressive by increasing the learning rate and/or decreasing the number of trees.

Owen Zhang, the former #1 ranked competitor on Kaggle and now CTO at Data Robot proposes an interesting strategy to configure XGBoost.

He suggests to set the number of trees to a target value such as 100 or 1000, then tune the learning rate to find the best model. This is an efficient strategy for quickly finding a good model.

In the next and final lesson, we will look at an example of tuning the XGBoost hyperparameters.

The scikit-learn framework provides the capability to search combinations of parameters.

This capability is provided in the **GridSearchCV** class and can be used to discover the best way to configure the model for top performance on your problem.

For example, we can define a grid of the number of trees (**n_estimators**) and tree sizes (**max_depth**) to evaluate by defining a grid as:

n_estimators = [50, 100, 150, 200] max_depth = [2, 4, 6, 8] param_grid = dict(max_depth=max_depth, n_estimators=n_estimators)

And then evaluate each combination of parameters using 10-fold cross validation as:

kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold, verbose=1) result = grid_search.fit(X, label_encoded_y)

We can then review the results to determine the best combination and the general trends in varying the combinations of parameters.

This is the best practice when applying XGBoost to your own problems. The parameters to consider tuning are:

- The number and size of trees (
**n_estimators**and**max_depth**). - The learning rate and number of trees (
**learning_rate**and**n_estimators**). - The row and column subsampling rates (
**subsample**,**colsample_bytree**and**colsample_bylevel**).

Below is a full example of tuning just the **learning_rate** on the Pima Indians Onset of Diabetes dataset.

# Tune learning_rate from numpy import loadtxt from xgboost import XGBClassifier from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold # load data dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",") # split data into X and y X = dataset[:,0:8] Y = dataset[:,8] # grid search model = XGBClassifier() learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3] param_grid = dict(learning_rate=learning_rate) kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold) grid_result = grid_search.fit(X, Y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param))

Congratulations, you made it. Well done!

Take a moment and look back at how far you have come:

- You learned about the gradient boosting algorithm and the XGBoost library.
- You developed your first XGBoost model.
- You learned how to use advanced features like early stopping and feature importance.
- You learned how to configure gradient boosted models and how to design controlled experiments to tune XGBoost hyperparameters.

Don’t make light of this, you have come a long way in a short amount of time. This is just the beginning of your journey with XGBoost in Python. Keep practicing and developing your skills.

Did you enjoy this mini-course? Do you have any questions or sticking points?

Leave a comment and let me know.

The post 7 Step Mini-Course to Get Started with XGBoost in Python appeared first on Machine Learning Mastery.

]]>The post Stochastic Gradient Boosting with XGBoost and scikit-learn in Python appeared first on Machine Learning Mastery.

]]>Subsets of the the rows in the training data can be taken to train individual trees called bagging. When subsets of rows of the training data are also taken when calculating each split point, this is called random forest.

These techniques can also be used in the gradient tree boosting model in a technique called stochastic gradient boosting.

In this post you will discover stochastic gradient boosting and how to tune the sampling parameters using XGBoost with scikit-learn in Python.

After reading this post you will know:

- The rationale behind training trees on subsamples of data and how this can be used in gradient boosting.
- How to tune row-based subsampling in XGBoost using scikit-learn.
- How to tune column-based subsampling by both tree and split-point in XGBoost.

**Kick-start your project** with my new book XGBoost With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Update Jan/2017**: Updated to reflect changes in scikit-learn API version 0.18.1.

Take my free 7-day email course and discover xgboost (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Gradient boosting is a greedy procedure.

New decision trees are added to the model to correct the residual error of the existing model.

Each decision tree is created using a greedy search procedure to select split points that best minimize an objective function. This can result in trees that use the same attributes and even the same split points again and again.

Bagging is a technique where a collection of decision trees are created, each from a different random subset of rows from the training data. The effect is that better performance is achieved from the ensemble of trees because the randomness in the sample allows slightly different trees to be created, adding variance to the ensembled predictions.

Random forest takes this one step further, by allowing the features (columns) to be subsampled when choosing split points, adding further variance to the ensemble of trees.

These same techniques can be used in the construction of decision trees in gradient boosting in a variation called stochastic gradient boosting.

It is common to use aggressive sub-samples of the training data such as 40% to 80%.

In this tutorial we are going to look at the effect of different subsampling techniques in gradient boosting.

We will tune three different flavors of stochastic gradient boosting supported by the XGBoost library in Python, specifically:

- Subsampling of rows in the dataset when creating each tree.
- Subsampling of columns in the dataset when creating each tree.
- Subsampling of columns for each split in the dataset when creating each tree.

In this tutorial we will use the Otto Group Product Classification Challenge dataset.

This dataset is available for free from Kaggle (you will need to sign-up to Kaggle to be able to download this dataset). You can download the training dataset **train.csv.zip** from the Data page and place the unzipped **train.csv** file into your working directory.

This dataset describes the 93 obfuscated details of more than 61,000 products grouped into 10 product categories (e.g. fashion, electronics, etc.). Input attributes are counts of different events of some kind.

The goal is to make predictions for new products as an array of probabilities for each of the 10 categories and models are evaluated using multiclass logarithmic loss (also called cross entropy).

This competition was completed in May 2015 and this dataset is a good challenge for XGBoost because of the nontrivial number of examples, the difficulty of the problem and the fact that little data preparation is required (other than encoding the string class variables as integers).

Row subsampling involves selecting a random sample of the training dataset without replacement.

Row subsampling can be specified in the scikit-learn wrapper of the XGBoost class in the **subsample** parameter. The default is 1.0 which is no sub-sampling.

We can use the grid search capability built into scikit-learn to evaluate the effect of different subsample values from 0.1 to 1.0 on the Otto dataset.

[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0]

There are 9 variations of subsample and each model will be evaluated using 10-fold cross validation, meaning that 9×10 or 90 models need to be trained and tested.

The complete code listing is provided below.

# XGBoost on Otto dataset, tune subsample from pandas import read_csv from xgboost import XGBClassifier from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import LabelEncoder import matplotlib matplotlib.use('Agg') from matplotlib import pyplot # load data data = read_csv('train.csv') dataset = data.values # split data into X and y X = dataset[:,0:94] y = dataset[:,94] # encode string class values as integers label_encoded_y = LabelEncoder().fit_transform(y) # grid search model = XGBClassifier() subsample = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0] param_grid = dict(subsample=subsample) kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold) grid_result = grid_search.fit(X, label_encoded_y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) # plot pyplot.errorbar(subsample, means, yerr=stds) pyplot.title("XGBoost subsample vs Log Loss") pyplot.xlabel('subsample') pyplot.ylabel('Log Loss') pyplot.savefig('subsample.png')

Running this example prints the best configuration as well as the log loss for each tested configuration.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that the best results achieved were 0.3, or training trees using a 30% sample of the training dataset.

Best: -0.000647 using {'subsample': 0.3} -0.001156 (0.000286) with: {'subsample': 0.1} -0.000765 (0.000430) with: {'subsample': 0.2} -0.000647 (0.000471) with: {'subsample': 0.3} -0.000659 (0.000635) with: {'subsample': 0.4} -0.000717 (0.000849) with: {'subsample': 0.5} -0.000773 (0.000998) with: {'subsample': 0.6} -0.000877 (0.001179) with: {'subsample': 0.7} -0.001007 (0.001371) with: {'subsample': 0.8} -0.001239 (0.001730) with: {'subsample': 1.0}

We can plot these mean and standard deviation log loss values to get a better understanding of how performance varies with the subsample value.

We can see that indeed 30% has the best mean performance, but we can also see that as the ratio increased, the variance in performance grows quite markedly.

It is interesting to note that the mean performance of all **subsample** values outperforms the mean performance without subsampling (**subsample=1.0**).

We can also create a random sample of the features (or columns) to use prior to creating each decision tree in the boosted model.

In the XGBoost wrapper for scikit-learn, this is controlled by the **colsample_bytree** parameter.

The default value is 1.0 meaning that all columns are used in each decision tree. We can evaluate values for **colsample_bytree** between 0.1 and 1.0 incrementing by 0.1.

[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0]

The full code listing is provided below.

# XGBoost on Otto dataset, tune colsample_bytree from pandas import read_csv from xgboost import XGBClassifier from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import LabelEncoder import matplotlib matplotlib.use('Agg') from matplotlib import pyplot # load data data = read_csv('train.csv') dataset = data.values # split data into X and y X = dataset[:,0:94] y = dataset[:,94] # encode string class values as integers label_encoded_y = LabelEncoder().fit_transform(y) # grid search model = XGBClassifier() colsample_bytree = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0] param_grid = dict(colsample_bytree=colsample_bytree) kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold) grid_result = grid_search.fit(X, label_encoded_y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) # plot pyplot.errorbar(colsample_bytree, means, yerr=stds) pyplot.title("XGBoost colsample_bytree vs Log Loss") pyplot.xlabel('colsample_bytree') pyplot.ylabel('Log Loss') pyplot.savefig('colsample_bytree.png')

Running this example prints the best configuration as well as the log loss for each tested configuration.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that the best performance for the model was **colsample_bytree=1.0**. This suggests that subsampling columns on this problem does not add value.

Best: -0.001239 using {'colsample_bytree': 1.0} -0.298955 (0.002177) with: {'colsample_bytree': 0.1} -0.092441 (0.000798) with: {'colsample_bytree': 0.2} -0.029993 (0.000459) with: {'colsample_bytree': 0.3} -0.010435 (0.000669) with: {'colsample_bytree': 0.4} -0.004176 (0.000916) with: {'colsample_bytree': 0.5} -0.002614 (0.001062) with: {'colsample_bytree': 0.6} -0.001694 (0.001221) with: {'colsample_bytree': 0.7} -0.001306 (0.001435) with: {'colsample_bytree': 0.8} -0.001239 (0.001730) with: {'colsample_bytree': 1.0}

Plotting the results, we can see the performance of the model plateau (at least at this scale) with values between 0.5 to 1.0.

Rather than subsample the columns once for each tree, we can subsample them at each split in the decision tree. In principle, this is the approach used in random forest.

We can set the size of the sample of columns used at each split in the **colsample_bylevel** parameter in the XGBoost wrapper classes for scikit-learn.

As before, we will vary the ratio from 10% to the default of 100%.

The full code listing is provided below.

# XGBoost on Otto dataset, tune colsample_bylevel from pandas import read_csv from xgboost import XGBClassifier from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import LabelEncoder import matplotlib matplotlib.use('Agg') from matplotlib import pyplot # load data data = read_csv('train.csv') dataset = data.values # split data into X and y X = dataset[:,0:94] y = dataset[:,94] # encode string class values as integers label_encoded_y = LabelEncoder().fit_transform(y) # grid search model = XGBClassifier() colsample_bylevel = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0] param_grid = dict(colsample_bylevel=colsample_bylevel) kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold) grid_result = grid_search.fit(X, label_encoded_y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) # plot pyplot.errorbar(colsample_bylevel, means, yerr=stds) pyplot.title("XGBoost colsample_bylevel vs Log Loss") pyplot.xlabel('colsample_bylevel') pyplot.ylabel('Log Loss') pyplot.savefig('colsample_bylevel.png')

Running this example prints the best configuration as well as the log loss for each tested configuration.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that the best results were achieved by setting **colsample_bylevel** to 70%, resulting in an (inverted) log loss of -0.001062, which is better than -0.001239 seen when setting the per-tree column sampling to 100%.

This suggest to not give up on column subsampling if per-tree results suggest using 100% of columns, and to instead try per-split column subsampling.

Best: -0.001062 using {'colsample_bylevel': 0.7} -0.159455 (0.007028) with: {'colsample_bylevel': 0.1} -0.034391 (0.003533) with: {'colsample_bylevel': 0.2} -0.007619 (0.000451) with: {'colsample_bylevel': 0.3} -0.002982 (0.000726) with: {'colsample_bylevel': 0.4} -0.001410 (0.000946) with: {'colsample_bylevel': 0.5} -0.001182 (0.001144) with: {'colsample_bylevel': 0.6} -0.001062 (0.001221) with: {'colsample_bylevel': 0.7} -0.001071 (0.001427) with: {'colsample_bylevel': 0.8} -0.001239 (0.001730) with: {'colsample_bylevel': 1.0}

We can plot the performance of each **colsample_bylevel** variation. The results show relatively low variance and seemingly a plateau in performance after a value of 0.3 at this scale.

In this post you discovered stochastic gradient boosting with XGBoost in Python.

Specifically, you learned:

- About stochastic boosting and how you can subsample your training data to improve the generalization of your model
- How to tune row subsampling with XGBoost in Python and scikit-learn.
- How to tune column subsampling with XGBoost both per-tree and per-split.

Do you have any questions about stochastic gradient boosting or about this post? Ask your questions in the comments and I will do my best to answer.

The post Stochastic Gradient Boosting with XGBoost and scikit-learn in Python appeared first on Machine Learning Mastery.

]]>The post Tune Learning Rate for Gradient Boosting with XGBoost in Python appeared first on Machine Learning Mastery.

]]>One effective way to slow down learning in the gradient boosting model is to use a learning rate, also called shrinkage (or eta in XGBoost documentation).

In this post you will discover the effect of the learning rate in gradient boosting and how to tune it on your machine learning problem using the XGBoost library in Python.

After reading this post you will know:

- The effect learning rate has on the gradient boosting model.
- How to tune learning rate on your machine learning on your problem.
- How to tune the trade-off between the number of boosted trees and learning rate on your problem.

**Kick-start your project** with my new book XGBoost With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Update Jan/2017**: Updated to reflect changes in scikit-learn API version 0.18.1.

Take my free 7-day email course and discover xgboost (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Gradient boosting involves creating and adding trees to the model sequentially.

New trees are created to correct the residual errors in the predictions from the existing sequence of trees.

The effect is that the model can quickly fit, then overfit the training dataset.

A technique to slow down the learning in the gradient boosting model is to apply a weighting factor for the corrections by new trees when added to the model.

This weighting is called the shrinkage factor or the learning rate, depending on the literature or the tool.

Naive gradient boosting is the same as gradient boosting with shrinkage where the shrinkage factor is set to 1.0. Setting values less than 1.0 has the effect of making less corrections for each tree added to the model. This in turn results in more trees that must be added to the model.

It is common to have small values in the range of 0.1 to 0.3, as well as values less than 0.1.

Let’s investigate the effect of the learning rate on a standard machine learning dataset.

In this tutorial we will use the Otto Group Product Classification Challenge dataset.

This dataset is available for free from Kaggle (you will need to sign-up to Kaggle to be able to download this dataset). You can download the training dataset **train.csv.zip** from the Data page and place the unzipped **train.csv** file into your working directory.

This dataset describes the 93 obfuscated details of more than 61,000 products grouped into 10 product categories (e.g. fashion, electronics, etc.). Input attributes are counts of different events of some kind.

The goal is to make predictions for new products as an array of probabilities for each of the 10 categories and models are evaluated using multiclass logarithmic loss (also called cross entropy).

This competition was completed in May 2015 and this dataset is a good challenge for XGBoost because of the nontrivial number of examples, the difficulty of the problem and the fact that little data preparation is required (other than encoding the string class variables as integers).

When creating gradient boosting models with XGBoost using the scikit-learn wrapper, the **learning_rate** parameter can be set to control the weighting of new trees added to the model.

We can use the grid search capability in scikit-learn to evaluate the effect on logarithmic loss of training a gradient boosting model with different learning rate values.

We will hold the number of trees constant at the default of 100 and evaluate of suite of standard values for the learning rate on the Otto dataset.

learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3]

There are 6 variations of learning rate to be tested and each variation will be evaluated using 10-fold cross validation, meaning that there is a total of 6×10 or 60 XGBoost models to be trained and evaluated.

The log loss for each learning rate will be printed as well as the value that resulted in the best performance.

# XGBoost on Otto dataset, Tune learning_rate from pandas import read_csv from xgboost import XGBClassifier from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import LabelEncoder import matplotlib matplotlib.use('Agg') from matplotlib import pyplot # load data data = read_csv('train.csv') dataset = data.values # split data into X and y X = dataset[:,0:94] y = dataset[:,94] # encode string class values as integers label_encoded_y = LabelEncoder().fit_transform(y) # grid search model = XGBClassifier() learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3] param_grid = dict(learning_rate=learning_rate) kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold) grid_result = grid_search.fit(X, label_encoded_y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) # plot pyplot.errorbar(learning_rate, means, yerr=stds) pyplot.title("XGBoost learning_rate vs Log Loss") pyplot.xlabel('learning_rate') pyplot.ylabel('Log Loss') pyplot.savefig('learning_rate.png')

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running this example prints the best result as well as the log loss for each of the evaluated learning rates.

Best: -0.001156 using {'learning_rate': 0.2} -2.155497 (0.000081) with: {'learning_rate': 0.0001} -1.841069 (0.000716) with: {'learning_rate': 0.001} -0.597299 (0.000822) with: {'learning_rate': 0.01} -0.001239 (0.001730) with: {'learning_rate': 0.1} -0.001156 (0.001684) with: {'learning_rate': 0.2} -0.001158 (0.001666) with: {'learning_rate': 0.3}

Interestingly, we can see that the best learning rate was 0.2.

This is a high learning rate and it suggest that perhaps the default number of trees of 100 is too low and needs to be increased.

We can also plot the effect of the learning rate of the (inverted) log loss scores, although the log10-like spread of chosen learning_rate values means that most are squashed down the left-hand side of the plot near zero.

Next, we will look at varying the number of trees whilst varying the learning rate.

Smaller learning rates generally require more trees to be added to the model.

We can explore this relationship by evaluating a grid of parameter pairs. The number of decision trees will be varied from 100 to 500 and the learning rate varied on a log10 scale from 0.0001 to 0.1.

n_estimators = [100, 200, 300, 400, 500] learning_rate = [0.0001, 0.001, 0.01, 0.1]

There are 5 variations of **n_estimators** and 4 variations of **learning_rate**. Each combination will be evaluated using 10-fold cross validation, so that is a total of 4x5x10 or 200 XGBoost models that must be trained and evaluated.

The expectation is that for a given learning rate, performance will improve and then plateau as the number of trees is increased. The full code listing is provided below.

# XGBoost on Otto dataset, Tune learning_rate and n_estimators from pandas import read_csv from xgboost import XGBClassifier from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import LabelEncoder import matplotlib matplotlib.use('Agg') from matplotlib import pyplot import numpy # load data data = read_csv('train.csv') dataset = data.values # split data into X and y X = dataset[:,0:94] y = dataset[:,94] # encode string class values as integers label_encoded_y = LabelEncoder().fit_transform(y) # grid search model = XGBClassifier() n_estimators = [100, 200, 300, 400, 500] learning_rate = [0.0001, 0.001, 0.01, 0.1] param_grid = dict(learning_rate=learning_rate, n_estimators=n_estimators) kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold) grid_result = grid_search.fit(X, label_encoded_y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) # plot results scores = numpy.array(means).reshape(len(learning_rate), len(n_estimators)) for i, value in enumerate(learning_rate): pyplot.plot(n_estimators, scores[i], label='learning_rate: ' + str(value)) pyplot.legend() pyplot.xlabel('n_estimators') pyplot.ylabel('Log Loss') pyplot.savefig('n_estimators_vs_learning_rate.png')

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example prints the best combination as well as the log loss for each evaluated pair.

Best: -0.001152 using {'n_estimators': 300, 'learning_rate': 0.1} -2.155497 (0.000081) with: {'n_estimators': 100, 'learning_rate': 0.0001} -2.115540 (0.000159) with: {'n_estimators': 200, 'learning_rate': 0.0001} -2.077211 (0.000233) with: {'n_estimators': 300, 'learning_rate': 0.0001} -2.040386 (0.000304) with: {'n_estimators': 400, 'learning_rate': 0.0001} -2.004955 (0.000373) with: {'n_estimators': 500, 'learning_rate': 0.0001} -1.841069 (0.000716) with: {'n_estimators': 100, 'learning_rate': 0.001} -1.572384 (0.000692) with: {'n_estimators': 200, 'learning_rate': 0.001} -1.364543 (0.000699) with: {'n_estimators': 300, 'learning_rate': 0.001} -1.196490 (0.000713) with: {'n_estimators': 400, 'learning_rate': 0.001} -1.056687 (0.000728) with: {'n_estimators': 500, 'learning_rate': 0.001} -0.597299 (0.000822) with: {'n_estimators': 100, 'learning_rate': 0.01} -0.214311 (0.000929) with: {'n_estimators': 200, 'learning_rate': 0.01} -0.080729 (0.000982) with: {'n_estimators': 300, 'learning_rate': 0.01} -0.030533 (0.000949) with: {'n_estimators': 400, 'learning_rate': 0.01} -0.011769 (0.001071) with: {'n_estimators': 500, 'learning_rate': 0.01} -0.001239 (0.001730) with: {'n_estimators': 100, 'learning_rate': 0.1} -0.001153 (0.001702) with: {'n_estimators': 200, 'learning_rate': 0.1} -0.001152 (0.001704) with: {'n_estimators': 300, 'learning_rate': 0.1} -0.001153 (0.001708) with: {'n_estimators': 400, 'learning_rate': 0.1} -0.001153 (0.001708) with: {'n_estimators': 500, 'learning_rate': 0.1}

We can see that the best result observed was a learning rate of 0.1 with 300 trees.

It is hard to pick out trends from the raw data and small negative log loss results. Below is a plot of each learning rate as a series showing log loss performance as the number of trees is varied.

We can see that the expected general trend holds, where the performance (inverted log loss) improves as the number of trees is increased.

Performance is generally poor for the smaller learning rates, suggesting that a much larger number of trees may be required. We may need to increase the number of trees to many thousands which may be quite computationally expensive.

The results for **learning_rate=0.1** are obscured due the large y-axis scale of the graph. We can extract the performance measure for just **learning_rate=0.1** and plot them directly.

# Plot performance for learning_rate=0.1 from matplotlib import pyplot n_estimators = [100, 200, 300, 400, 500] loss = [-0.001239, -0.001153, -0.001152, -0.001153, -0.001153] pyplot.plot(n_estimators, loss) pyplot.xlabel('n_estimators') pyplot.ylabel('Log Loss') pyplot.title('XGBoost learning_rate=0.1 n_estimators vs Log Loss') pyplot.show()

Running this code shows the increased performance as the number of trees are added, followed by a plateau in performance across 400 and 500 trees.

In this post you discovered the effect of weighting the addition of new trees to a gradient boosting model, called shrinkage or the learning rate.

Specifically, you learned:

- That adding a learning rate is intended to slow down the adaptation of the model to the training data.
- How to evaluate a range of learning rate values on your machine learning problem.
- How to evaluate the relationship of varying both the number of trees and the learning rate on your problem.

Do you have any questions regarding shrinkage in gradient boosting or about this post? Ask your questions in the comments and I will do my best to answer them.

The post Tune Learning Rate for Gradient Boosting with XGBoost in Python appeared first on Machine Learning Mastery.

]]>The post How to Train XGBoost Models in the Cloud with Amazon Web Services appeared first on Machine Learning Mastery.

]]>It is implemented to make best use of your computing resources, including all CPU cores and memory.

In this post you will discover how you can setup a server on Amazon’s cloud service to quickly and cheaply create very large models.

After reading this post you will know:

- How to setup and configure an Amazon EC2 server instance for use with XGBoost.
- How to confirm the parallel capabilities of XGBoost are working on your server.
- How to transfer data and code to your server and train a very large model.

**Kick-start your project** with my new book XGBoost With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Update May/2020**: Updated instructions.

Take my free 7-day email course and discover xgboost (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

The process is quite simple. Below is an overview of the steps we are going to complete in this tutorial.

- Setup Your AWS Account (if needed).
- Launch Your AWS Instance.
- Login and Run Your Code.
- Train an XGBoost Model.
- Close Your AWS Instance.

**Note, it costs money to use a virtual server instance on Amazon**. The cost is very low for ad hoc model development (e.g. less than one US dollar per hour), which is why this is so attractive, but it is not free.

The server instance runs Linux. It is desirable although not required that you know how to navigate Linux or a Unix-like environment. We’re just running our Python scripts, so no advanced skills are needed.

You need an account on Amazon Web Services.

- 1. You can create an account using the Amazon Web Services portal by clicking “Sign in to the Console”. From there you can sign in using an existing Amazon account or create a new account.

- 2. If creating an account, you will need to provide your details as well as a valid credit card that Amazon can charge. The process is a lot quicker if you are already an Amazon customer and have your credit card on file.

**Note**: If you have created a new account, you may have to request to Amazon support in order to be approved to use larger (non-free) server instance in the rest of this tutorial.

Now that you have an AWS account, you want to launch an EC2 virtual server instance on which you can run XGBoost.

Launching an instance is as easy as selecting the image to load and starting the virtual server.

We will use an existing Fedora Linux image and install Python and XGBoost manually.

- 1. Login to your AWS console if you have not already.

- 2. Click on EC2 for launching a new virtual server.
- 3. Select “N. California” from the drop-down in the top right hand corner. This is important otherwise you may not be able to find the image (called an AMI) that we plan to use.

- 4. Click the “Launch Instance” button.
- 5. Click “Community AMIs”. An AMI is an Amazon Machine Image. It is a frozen instance of a server that you can select and instantiate on a new virtual server.

- 6. Enter AMI: “
**Fedora-Cloud-Base-24**” in the “Search community AMIs” search box and press enter. You should be presented with a single result.

This is an image for a base installation of Fedora Linux version 24. A very easy to use Linux distribution.

- 7. Click “Select” to choose the AMI in the search result.
- 8. Now you need to select the hardware on which to run the image. Scroll down and select the “c3.8xlarge” hardware.

This is a large instance that includes 32 CPU cores, 60 Gigabytes of RAM and a 2 large SSD disks.

- 9. Click “Review and Launch” to finalize the configuration of your server instance.

You will see a warning like “Your instance configuration is not eligible for the free usage tier”. This is just indicating that you will be charged for your time on this server. We know this, ignore this warning.

- 10. Click the “Launch” button.
- 11. Select your SSH key pair.
- If you have a key pair because you have used EC2 before, select “Choose an existing key pair” and choose your key pair from the list. Then check “I acknowledge…”.
- If you do not have a key pair, select the option “Create a new key pair” and enter a “Key pair name” such as “xgboost-keypair”. Click the “Download Key Pair” button.

- 12. Open a Terminal and change directory to where you downloaded your key pair.
- 13. If you have not already done so, restrict the access permissions on your key pair file. This is required as part of the SSH access to your server. For example, on your console you can type:

cd Downloads chmod 600 xgboost-keypair.pem

- 14. Click “Launch Instances.

**Note**: If this is your first time using AWS, Amazon may have to validate your request and this could take up to 2 hours (often just a few minutes).

- 15. Click “View Instances” to review the status of your instance.

Your server is now running and ready for you to log in.

Now that you have launched your server instance, it is time to login and configure it for use.

You will need to configure your server each time you launch it. Therefore, it is a good idea to batch all work so you can make good use of your configured server.

Configuring the server will not take long, perhaps 10 minutes total.

- 1. Click “View Instances” in your Amazon EC2 console if you have not already.
- 2. Copy “Public IP” (down the bottom of the screen in Description) to clipboard.

In this example my IP address is 52.53.185.166.

**Do not use this IP address, your IP address will be different**.

- 3. Open a Terminal and change directory to where you downloaded your key pair. Login in to your server using SSH, for example you can type:

ssh -i xgboost-keypair.pem fedora@52.53.185.166

- 4. You may be prompted with a warning the first time you log into your server instance. You can ignore this warning, just type “yes” and press enter.

You are now logged into your server.

Double check the number of CPU cores on your instance, by typing

cat /proc/cpuinfo | grep processor | wc -l

You should see:

32

The first step is to install all of the packages to support XGBoost.

This includes GCC, Python and the SciPy stack. We will use the Fedora package manager dnf (the new yum).

**Note**: We will be using Python 2 and some older versions of the libraries. This is intentional as these instructions do not appear to work for the very latest versions of the libraries and Python 3.

This is a single line:

sudo dnf install gcc gcc-c++ make git unzip python python2-numpy python2-scipy python2-scikit-learn python2-pandas python2-matplotlib

Type “y” and press Enter when prompted to confirm the packages to install.

This will take a few minutes to download and install all of the required packages.

Once completed, we can confirm the environment was installed successfully.

Type:

gcc --version

You should see:

gcc (GCC) 6.3.1 20161221 (Red Hat 6.3.1-1) Copyright (C) 2016 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Type:

python --version

You should see:

Python 2.7.13

Type:

python -c "import scipy;print(scipy.__version__)" python -c "import numpy;print(numpy.__version__)" python -c "import pandas;print(pandas.__version__)" python -c "import sklearn;print(sklearn.__version__)"

You should see something like:

0.16.1 1.11.0 0.18.0 0.17.1

**Note**: If any of these checks failed, stop and correct any errors. You must have a complete working environment before moving on.

We are now ready to install XGBoost.

The installation instructions for XGBoost are complete and we can follow them directly.

First we need to download the project on the server.

git clone --recursive https://github.com/dmlc/xgboost cd xgboost git checkout tags/v0.90

Next we need to compile it. The -j argument can be used to specify the number of cores to expect. We can set this to 32 for the 32 cores on our AWS instance.

If you chose different AWS hardware, you can set this appropriately.

make -j32

The XGBoost project should build successfully (e.g. no errors).

We are now ready to install the Python version of the library.

cd python-package sudo python setup.py install

That is it.

We can confirm the installation was successful by typing:

python -c "import xgboost;print(xgboost.__version__)"

This should print something like:

0.90

Let’s test out your large AWS instance by running XGBoost with a lot of cores.

In this tutorial we will use the Otto Group Product Classification Challenge dataset.

This dataset is available for free from Kaggle (you will need to sign-up to Kaggle to be able to download this dataset). It describes the 93 obfuscated details of more than 61,000 products grouped into 10 product categories (e.g. fashion, electronics, etc.). Input variables are counts of different events of some kind.

The goal is to make predictions for new products as an array of probabilities for each of the 10 categories and models are evaluated using multiclass logarithmic loss (also called cross entropy).

This competition was completed in May 2015 and this dataset is a good challenge for XGBoost because of the nontrivial number of examples, the difficulty of the problem and the fact that little data preparation is required (other than encoding the string class variables as integers).

Create a new directory called **work/** on your workstation.

You can download the training dataset **train.csv.zip** from the Data page and place it in your **work/** directory on your workstation.

We will evaluate the time taken to train an XGBoost on this dataset using different numbers of cores.

We will try 1 core, half the cores 16 and all of the 32 cores. We can specify the number of cores used by the XGBoost algorithm by setting the **nthread** parameter in the **XGBClassifier** class (the scikit-learn wrapper for XGBoost).

The complete example is listed below. Save it in a file with the name **work/script.py**.

# Otto multi-core test from pandas import read_csv from xgboost import XGBClassifier from sklearn.preprocessing import LabelEncoder import time # load data data = read_csv('train.csv') dataset = data.values # split data into X and y X = dataset[:,0:94] y = dataset[:,94] # encode string class values as integers label_encoded_y = LabelEncoder().fit_transform(y) # evaluate the effect of the number of threads results = [] num_threads = [1, 16, 32] for n in num_threads: start = time.time() model = XGBClassifier(nthread=n) model.fit(X, label_encoded_y) elapsed = time.time() - start print(n, elapsed) results.append(elapsed)

Now, we can copy your **work/** directory with the data and script to your AWS server.

From your workstation in the current directory where the **work/** directory is located, type:

scp -r -i xgboost-keypair.pem work fedora@52.53.185.166:/home/fedora/

Of course, you will need to use your key file and the IP address of your server.

This will create a new **work/** directory in your home directory on your server.

Log back onto your server instance (if needed):

ssh -i xgboost-keypair.pem fedora@52.53.185.166

Change directory to your work directory and unzip the training data.

cd work unzip ./train.csv.data

Now we can run the script and train our XGBoost models and calculate how long it takes using different numbers of cores:

python script.py

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

You should see output as follows.

(1, 96.34455895423889) (16, 8.31994891166687) (32, 7.604229927062988)

You can see little difference between 16 and 32 cores.

I believe the reason for this is that AWS is giving access to 16 physical cores with hyperthreading, offering an additional virtual cores. Nevertheless, building a large XGBoost model in 7 seconds is great.

You can use this as a template for your copying your own data and scripts to your AWS instance.

A good tip is to run your scripts as a background process and forward any output to a file. This is just in case your connection to the server is interrupted or you want to close it down and let the server run your code all night.

You can run your code as a background process and redirect output to a file by typing:

nohup python script.py >script.py.out 2>&1 &

Now that we are done, we can shut down the AWS instance.

When you are finished with your work you must close your instance.

Remember you are charged by the amount of time that you use the instance. It is cheap, but you do not want to leave an instance on if you are not using it.

- 1. Log out of your instance at the terminal, for example you can type:

exit

- 2. Log in to your AWS account with your web browser.
- 3. Click EC2.
- 4. Click “Instances” from the left-hand side menu.
- 5. Select your running instance from the list (it may already be selected if you only have one running instance).
- 6. Click the “Actions” button and select “Instance State” and choose “Terminate”. Confirm that you want to terminate your running instance.

It may take a number of seconds for the instance to close and to be removed from your list of instances.

That’s it.

In this post you discovered how to train large XGBoost models on Amazon cloud infrastructure.

Specifically, you learned:

- How to start and configure a Linux server instance on Amazon EC2 for XGBoost.
- How to install all of the required software needed to run the XGBoost library in Python.
- How to transfer data and code to your server and train a large model making use of all of the cores on the server.

Do you have any questions about training XGBoost models on Amazon Web Services or about this post? Ask your questions in the comments and I will do my best to answer.

The post How to Train XGBoost Models in the Cloud with Amazon Web Services appeared first on Machine Learning Mastery.

]]>The post How to Configure the Gradient Boosting Algorithm appeared first on Machine Learning Mastery.

]]>But how do you configure gradient boosting on your problem?

In this post you will discover how you can configure gradient boosting on your machine learning problem by looking at configurations reported in books, papers and as a result of competitions.

After reading this post, you will know:

- How to configure gradient boosting according to the original sources.
- Ideas for configuring the algorithm from defaults and suggestions in standard implementations.
- Rules of thumb for configuring gradient boosting and XGBoost from a top Kaggle competitors.

**Kick-start your project** with my new book XGBoost With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

Take my free 7-day email course and discover xgboost (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

In the 1999 paper “Greedy Function Approximation: A Gradient Boosting Machine“, Jerome Friedman comments on the trade-off between the number of trees (M) and the learning rate (v):

The v-M trade-off is clearly evident; smaller values of v give rise to larger optimal M-values. They also provide higher accuracy, with a diminishing return for v < 0.125. The misclassification error rate is very flat for M > 200, so that optimal M-values for it are unstable. … the qualitative nature of these results is fairly universal.

He suggests to first set a large value for the number of trees, then tune the shrinkage parameter to achieve the best results. Studies in the paper preferred a shrinkage value of 0.1, a number of trees in the range 100 to 500 and the number of terminal nodes in a tree between 2 and 8.

In the 1999 paper “Stochastic Gradient Boosting“, Friedman reiterated the preference for the shrinkage parameter:

The “shrinkage” parameter 0 < v < 1 controls the learning rate of the procedure. Empirically …, it was found that small values (v <= 0.1) lead to much better generalization error.

In the paper, Friedman introduces and empirically investigates stochastic gradient boosting (row-based sub-sampling). He finds that almost all subsampling percentages are better than so-called deterministic boosting and perhaps 30%-to-50% is a good value to choose on some problems and 50%-to-80% on others.

… the best value of the sampling fraction … is approximately 40% (f=0.4) … However, sampling only 30% or even 20% of the data at each iteration gives considerable improvement over no sampling at all, with a corresponding computational speed-up by factors of 3 and 5 respectively.

He also studied the effect of the number of terminal nodes in trees finding values like 3 and 6 better than larger values like 11, 21 and 41.

In both cases the optimal tree size as averaged over 100 targets is L = 6. Increasing the capacity of the base learner by using larger trees degrades performance through “over-fitting”.

In his talk titled “Gradient Boosting Machine Learning” at H2O, Trevor Hastie made the comment that in general gradient boosting performs better than random forest, which in turn performs better than individual decision trees.

Gradient Boosting > Random Forest > Bagging > Single Trees

Chapter 10 titled “Boosting and Additive Trees” of the book “The Elements of Statistical Learning: Data Mining, Inference, and Prediction” is dedicated to boosting. In it they provide both heuristics for configuring gradient boosting as well as some empirical studies.

They comment that a good value the number of nodes in the tree (J) is about 6, with generally good values in the range of 4-to-8.

Although in many applications J = 2 will be insufficient, it is unlikely that J > 10 will be required. Experience so far indicates that 4 <= J <= 8 works well in the context of boosting, with results being fairly insensitive to particular choices in this range.

They suggest monitoring the performance on a validation dataset in order to calibrate the number of trees and to use an early stopping procedure once performance on the validation dataset begins to degrade.

As in Friedman’s first gradient boosting paper, they comment on the trade-off between the number of trees (M) and the learning rate (v) and recommend a small value for the learning rate < 0.1.

Smaller values of v lead to larger values of M for the same training risk, so that there is a tradeoff between them. … In fact, the best strategy appears to be to set v to be very small (v < 0.1) and then choose M by early stopping.

Also, as in Friedman’s stochastic gradient boosting paper, they recommend a subsampling percentage (n) without replacement with a value of about 50%.

A typical value for n can be 1/2, although for large N, n can be substantially smaller than 1/2.

The gradient boosting algorithm is implemented in R as the gbm package.

Reviewing the package documentation, the gbm() function specifies sensible defaults:

- n.trees = 100 (number of trees).
- interaction.depth = 1 (number of leaves).
- n.minobsinnode = 10 (minimum number of samples in tree terminal nodes).
- shrinkage = 0.001 (learning rate).

It is interesting to note that a smaller shrinkage factor is used and that stumps are the default. The small shrinkage is explained by Ridgeway next.

In the vignette for using the gbm package in R titled “Generalized Boosted Models: A guide to the gbm package“, Greg Ridgeway provides some usage heuristics. He suggest firs setting the learning rate (lambda) to as small as possible then tuning the number of trees (iterations or T) using cross validation.

In practice I set lambda to be as small as possible and then select T by cross-validation. Performance is best when lambda is as small as possible performance with decreasing marginal utility for smaller and smaller lambda.

He comments on his rationale for setting the default shrinkage to the small value of 0.001 rather than 0.1.

It is important to know that smaller values of shrinkage (almost) always give improved predictive performance. That is, setting shrinkage=0.001 will almost certainly result in a model with better out-of-sample predictive performance than setting shrinkage=0.01. … The model with shrinkage=0.001 will likely require ten times as many iterations as the model with shrinkage=0.01

Ridgeway also uses quite large numbers of trees (called iterations here), thousands rather than hundreds

I usually aim for 3,000 to 10,000 iterations with shrinkage rates between 0.01 and 0.001.

The Python library provides an implementation of gradient boosting for classification called the GradientBoostingClassifier class and regression called the GradientBoostingRegressor class.

It is useful to review the default configuration for the algorithm in this library.

There are many parameters, but below are a few key defaults.

- learning_rate=0.1 (shrinkage).
- n_estimators=100 (number of trees).
- max_depth=3.
- min_samples_split=2.
- min_samples_leaf=1.
- subsample=1.0.

It is interesting to note that the default shrinkage does match Friedman and that the tree depth is not set to stumps like the R package. A tree depth of 3 (if the created tree was symmetrical) will have 8 leaf nodes, matching the upper bound of the preferred number of terminal nodes in Friedman’s studies (alternately max_leaf_nodes can be set).

In the scikit-learn user guide under the section titled “Gradient Tree Boosting” the authors comment that setting the maximum leaf nodes has a similar effect to setting the max depth to the maximum leaf nodes minus one, but results in worse performance.

We found that max_leaf_nodes=k gives comparable results to max_depth=k-1 but is significantly faster to train at the expense of a slightly higher training error.

In a small study demonstrating regularization methods for gradient boosting titled “Gradient Boosting regularization“, the results show the benefit of using both shrinkage and sub-sampling.

The XGBoost library is dedicated to the gradient boosting algorithm.

It too specifies default parameters that are interesting to note, firstly the XGBoost Parameters page:

- eta=0.3 (shrinkage or learning rate).
- max_depth=6.
- subsample=1.

This shows a higher learning rate and a larger max depth than we see in most studies and other libraries. Similarly, we can summarize the default parameters for XGBoost in the Python API reference.

- max_depth=3.
- learning_rate=0.1.
- n_estimators=100.
- subsample=1.

These defaults are generally more in-line with scikit-learn defaults and recommendations from the papers.

In a talk to TechEd Europe titled “xgboost: An R package for Fast and Accurate Gradient Boosting“, when asked how to configure XGBoost, Tong He suggested the three most important parameters to tune are:

- Number of trees.
- Tree depth.
- Step Size (learning rate).

He also provide a terse configuration strategy for new problems:

- Run the default configuration (and presumably review learning curves?).
- If the system is overlearning, slow the learning down (using shrinkage?).
- If the system is underlearning, speed the learning up to be more aggressive (using shrinkage?).

In Owen Zhang’s talk to the NYC Data Science Academy in 2015 titled “Winning Data Science Competitions“, he provides some general tips for configuring gradient boost with XGBoost. Owen is a heavy user of gradient boosting.

My confession: I (over)use GBM. When in doubt, use GBM.

He provides some tips for configuring gradient boosting:

- learning rate + number of trees: Target 500-to-1000 trees and tune learning rate.
- number of samples in leaf: the number of observations needed to get a good mean estimate.
- interaction depth: 10+.

In an updated slide deck for the same talk, he gives a summary of common parameters he uses for XGBoost:

We can see a few interesting things in this table.

- Simplified the relationship of learning rate and the number of trees as an approximate ratio: learning rate = [2-10]/trees.
- Explores values for both row and column sampling for stochastic gradient boosting.
- Explores close to the same range for max depth as reported by Friedman (4-10).
- Tunes minimum leaf weight as an approximate ratio of 3 over the percentage of the number of rare events.

In a similar talk by Owen at ODSC Boston 2015 titled “Open Source Tools and Data Science Competitions“, he again summarized common parameters he uses:

We can see some minor differences that may be relevant.

- Target 100 rather than 1000 trees and tune learning rate.
- Min child weight as 1 over the square root of the event rate.
- No sub sampling of rows.

Finally, Abhishek Thakur, in his post titled “Approaching (Almost) Any Machine Learning Problem” provided a similar table listing out key XGBoost parameters and suggestions for tuning.

The spreads do cover the general defaults suggested above and more.

It is interesting to note that Abhishek does provides some suggestions for tuning the alpha and beta model penalization terms as well as row sampling.

In this post, you got insight into how to configure gradient boosting for your own machine learning problems.

Specifically you learned:

- About the trade-off in the number of trees and the shrinkage and good defaults for sub-sampling.
- Different ideas on limiting tree size and good defaults for tree depth and the number of terminal nodes.
- Grid search strategies used by a top Kaggle competition winner.

Do you have any questions about configuring gradient boosting or about this post? Ask your questions in the comments.

The post How to Configure the Gradient Boosting Algorithm appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning appeared first on Machine Learning Mastery.

]]>In this post you will discover the gradient boosting machine learning algorithm and get a gentle introduction into where it came from and how it works.

After reading this post, you will know:

- The origin of boosting from learning theory and AdaBoost.
- How gradient boosting works including the loss function, weak learners and the additive model.
- How to improve performance over the base algorithm with various regularization schemes.

**Kick-start your project** with my new book XGBoost With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

Take my free 7-day email course and discover xgboost (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

The idea of boosting came out of the idea of whether a weak learner can be modified to become better.

Michael Kearns articulated the goal as the “*Hypothesis Boosting Problem*” stating the goal from a practical standpoint as:

… an efficient algorithm for converting relatively poor hypotheses into very good hypotheses

— Thoughts on Hypothesis Boosting [PDF], 1988

A weak hypothesis or weak learner is defined as one whose performance is at least slightly better than random chance.

These ideas built upon Leslie Valiant’s work on distribution free or Probably Approximately Correct (PAC) learning, a framework for investigating the complexity of machine learning problems.

Hypothesis boosting was the idea of filtering observations, leaving those observations that the weak learner can handle and focusing on developing new weak learns to handle the remaining difficult observations.

The idea is to use the weak learning method several times to get a succession of hypotheses, each one refocused on the examples that the previous ones found difficult and misclassified. … Note, however, it is not obvious at all how this can be done

— Probably Approximately Correct: Nature’s Algorithms for Learning and Prospering in a Complex World, page 152, 2013

The first realization of boosting that saw great success in application was Adaptive Boosting or AdaBoost for short.

Boosting refers to this general problem of producing a very accurate prediction rule by combining rough and moderately inaccurate rules-of-thumb.

— A decision-theoretic generalization of on-line learning and an application to boosting [PDF], 1995

The weak learners in AdaBoost are decision trees with a single split, called decision stumps for their shortness.

AdaBoost works by weighting the observations, putting more weight on difficult to classify instances and less on those already handled well. New weak learners are added sequentially that focus their training on the more difficult patterns.

This means that samples that are difficult to classify receive increasing larger weights until the algorithm identifies a model that correctly classifies these samples

— Applied Predictive Modeling, 2013

Predictions are made by majority vote of the weak learners’ predictions, weighted by their individual accuracy. The most successful form of the AdaBoost algorithm was for binary classification problems and was called AdaBoost.M1.

You can learn more about the AdaBoost algorithm in the post:

AdaBoost and related algorithms were recast in a statistical framework first by Breiman calling them ARCing algorithms.

Arcing is an acronym for Adaptive Reweighting and Combining. Each step in an arcing algorithm consists of a weighted minimization followed by a recomputation of [the classifiers] and [weighted input].

— Prediction Games and Arching Algorithms [PDF], 1997

This framework was further developed by Friedman and called Gradient Boosting Machines. Later called just gradient boosting or gradient tree boosting.

The statistical framework cast boosting as a numerical optimization problem where the objective is to minimize the loss of the model by adding weak learners using a gradient descent like procedure.

This class of algorithms were described as a stage-wise additive model. This is because one new weak learner is added at a time and existing weak learners in the model are frozen and left unchanged.

Note that this stagewise strategy is different from stepwise approaches that readjust previously entered terms when new ones are added.

— Greedy Function Approximation: A Gradient Boosting Machine [PDF], 1999

The generalization allowed arbitrary differentiable loss functions to be used, expanding the technique beyond binary classification problems to support regression, multi-class classification and more.

Gradient boosting involves three elements:

- A loss function to be optimized.
- A weak learner to make predictions.
- An additive model to add weak learners to minimize the loss function.

The loss function used depends on the type of problem being solved.

It must be differentiable, but many standard loss functions are supported and you can define your own.

For example, regression may use a squared error and classification may use logarithmic loss.

A benefit of the gradient boosting framework is that a new boosting algorithm does not have to be derived for each loss function that may want to be used, instead, it is a generic enough framework that any differentiable loss function can be used.

Decision trees are used as the weak learner in gradient boosting.

Specifically regression trees are used that output real values for splits and whose output can be added together, allowing subsequent models outputs to be added and “correct” the residuals in the predictions.

Trees are constructed in a greedy manner, choosing the best split points based on purity scores like Gini or to minimize the loss.

Initially, such as in the case of AdaBoost, very short decision trees were used that only had a single split, called a decision stump. Larger trees can be used generally with 4-to-8 levels.

It is common to constrain the weak learners in specific ways, such as a maximum number of layers, nodes, splits or leaf nodes.

This is to ensure that the learners remain weak, but can still be constructed in a greedy manner.

Trees are added one at a time, and existing trees in the model are not changed.

A gradient descent procedure is used to minimize the loss when adding trees.

Traditionally, gradient descent is used to minimize a set of parameters, such as the coefficients in a regression equation or weights in a neural network. After calculating error or loss, the weights are updated to minimize that error.

Instead of parameters, we have weak learner sub-models or more specifically decision trees. After calculating the loss, to perform the gradient descent procedure, we must add a tree to the model that reduces the loss (i.e. follow the gradient). We do this by parameterizing the tree, then modify the parameters of the tree and move in the right direction by (reducing the residual loss.

Generally this approach is called functional gradient descent or gradient descent with functions.

One way to produce a weighted combination of classifiers which optimizes [the cost] is by gradient descent in function space

— Boosting Algorithms as Gradient Descent in Function Space [PDF], 1999

The output for the new tree is then added to the output of the existing sequence of trees in an effort to correct or improve the final output of the model.

A fixed number of trees are added or training stops once loss reaches an acceptable level or no longer improves on an external validation dataset.

Gradient boosting is a greedy algorithm and can overfit a training dataset quickly.

It can benefit from regularization methods that penalize various parts of the algorithm and generally improve the performance of the algorithm by reducing overfitting.

In this this section we will look at 4 enhancements to basic gradient boosting:

- Tree Constraints
- Shrinkage
- Random sampling
- Penalized Learning

It is important that the weak learners have skill but remain weak.

There are a number of ways that the trees can be constrained.

A good general heuristic is that the more constrained tree creation is, the more trees you will need in the model, and the reverse, where less constrained individual trees, the fewer trees that will be required.

Below are some constraints that can be imposed on the construction of decision trees:

**Number of trees**, generally adding more trees to the model can be very slow to overfit. The advice is to keep adding trees until no further improvement is observed.**Tree depth**, deeper trees are more complex trees and shorter trees are preferred. Generally, better results are seen with 4-8 levels.**Number of nodes or number of leaves**, like depth, this can constrain the size of the tree, but is not constrained to a symmetrical structure if other constraints are used.**Number of observations per split**imposes a minimum constraint on the amount of training data at a training node before a split can be considered**Minimim improvement to loss**is a constraint on the improvement of any split added to a tree.

The predictions of each tree are added together sequentially.

The contribution of each tree to this sum can be weighted to slow down the learning by the algorithm. This weighting is called a shrinkage or a learning rate.

Each update is simply scaled by the value of the “learning rate parameter v”

— Greedy Function Approximation: A Gradient Boosting Machine [PDF], 1999

The effect is that learning is slowed down, in turn require more trees to be added to the model, in turn taking longer to train, providing a configuration trade-off between the number of trees and learning rate.

Decreasing the value of v [the learning rate] increases the best value for M [the number of trees].

— Greedy Function Approximation: A Gradient Boosting Machine [PDF], 1999

It is common to have small values in the range of 0.1 to 0.3, as well as values less than 0.1.

Similar to a learning rate in stochastic optimization, shrinkage reduces the influence of each individual tree and leaves space for future trees to improve the model.

— Stochastic Gradient Boosting [PDF], 1999

A big insight into bagging ensembles and random forest was allowing trees to be greedily created from subsamples of the training dataset.

This same benefit can be used to reduce the correlation between the trees in the sequence in gradient boosting models.

This variation of boosting is called stochastic gradient boosting.

at each iteration a subsample of the training data is drawn at random (without replacement) from the full training dataset. The randomly selected subsample is then used, instead of the full sample, to fit the base learner.

— Stochastic Gradient Boosting [PDF], 1999

A few variants of stochastic boosting that can be used:

- Subsample rows before creating each tree.
- Subsample columns before creating each tree
- Subsample columns before considering each split.

Generally, aggressive sub-sampling such as selecting only 50% of the data has shown to be beneficial.

According to user feedback, using column sub-sampling prevents over-fitting even more so than the traditional row sub-sampling

— XGBoost: A Scalable Tree Boosting System, 2016

Additional constraints can be imposed on the parameterized trees in addition to their structure.

Classical decision trees like CART are not used as weak learners, instead a modified form called a regression tree is used that has numeric values in the leaf nodes (also called terminal nodes). The values in the leaves of the trees can be called weights in some literature.

As such, the leaf weight values of the trees can be regularized using popular regularization functions, such as:

- L1 regularization of weights.
- L2 regularization of weights.

The additional regularization term helps to smooth the final learnt weights to avoid over-fitting. Intuitively, the regularized objective will tend to select a model employing simple and predictive functions.

— XGBoost: A Scalable Tree Boosting System, 2016

Gradient boosting is a fascinating algorithm and I am sure you want to go deeper.

This section lists various resources that you can use to learn more about the gradient boosting algorithm.

- Gradient Boosting Machine Learning, Trevor Hastie, 2014
- Gradient Boosting, Alexander Ihler, 2012
- GBM, John Mount, 2015
- Learning: Boosting, MIT 6.034 Artificial Intelligence, 2010
- xgboost: An R package for Fast and Accurate Gradient Boosting, 2016
- XGBoost: A Scalable Tree Boosting System, Tianqi Chen, 2016

- Section 8.2.3 Boosting, page 321, An Introduction to Statistical Learning: with Applications in R.
- Section 8.6 Boosting, page 203, Applied Predictive Modeling.
- Section 14.5 Stochastic Gradient Boosting, page 390, Applied Predictive Modeling.
- Section 16.4 Boosting, page 556, Machine Learning: A Probabilistic Perspective
- Chapter 10 Boosting and Additive Trees, page 337, The Elements of Statistical Learning: Data Mining, Inference, and Prediction

- Thoughts on Hypothesis Boosting [PDF], Michael Kearns, 1988
- A decision-theoretic generalization of on-line learning and an application to boosting [PDF], 1995
- Arcing the edge [PDF], 1998
- Stochastic Gradient Boosting [PDF], 1999
- Boosting Algorithms as Gradient Descent in Function Space [PDF], 1999

In this post you discovered the gradient boosting algorithm for predictive modeling in machine learning.

Specifically, you learned:

- The history of boosting in learning theory and AdaBoost.
- How the gradient boosting algorithm works with a loss function, weak learners and an additive model.
- How to improve the performance of gradient boosting with regularization.

Do you have any questions about the gradient boosting algorithm or about this post? Ask your questions in the comments and I will do my best to answer.

The post A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning appeared first on Machine Learning Mastery.

]]>The post How to Tune the Number and Size of Decision Trees with XGBoost in Python appeared first on Machine Learning Mastery.

]]>This raises the question as to how many trees (weak learners or estimators) to configure in your gradient boosting model and how big each tree should be.

In this post you will discover how to design a systematic experiment to select the number and size of decision trees to use on your problem.

After reading this post you will know:

- How to evaluate the effect of adding more decision trees to your XGBoost model.
- How to evaluate the effect of creating larger decision trees to your XGBoost model.
- How to investigate the relationship between the number and depth of trees on your problem.

**Kick-start your project** with my new book XGBoost With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Update Jan/2017**: Updated to reflect changes in scikit-learn API version 0.18.1.

Take my free 7-day email course and discover xgboost (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

In this tutorial we will use the Otto Group Product Classification Challenge dataset.

This dataset is available for free from Kaggle (you will need to sign-up to Kaggle to be able to download this dataset). You can download the training dataset **train.csv.zip** from the Data page and place the unzipped **train.csv** file into your working directory.

This dataset describes the 93 obfuscated details of more than 61,000 products grouped into 10 product categories (e.g. fashion, electronics, etc.). Input attributes are counts of different events of some kind.

Most implementations of gradient boosting are configured by default with a relatively small number of trees, such as hundreds or thousands.

The general reason is that on most problems, adding more trees beyond a limit does not improve the performance of the model.

The reason is in the way that the boosted tree model is constructed, sequentially where each new tree attempts to model and correct for the errors made by the sequence of previous trees. Quickly, the model reaches a point of diminishing returns.

We can demonstrate this point of diminishing returns easily on the Otto dataset.

The number of trees (or rounds) in an XGBoost model is specified to the XGBClassifier or XGBRegressor class in the n_estimators argument. The default in the XGBoost library is 100.

Using scikit-learn we can perform a grid search of the **n_estimators** model parameter, evaluating a series of values from 50 to 350 with a step size of 50 (50, 150, 200, 250, 300, 350).

# grid search model = XGBClassifier() n_estimators = range(50, 400, 50) param_grid = dict(n_estimators=n_estimators) kfold = StratifiedKFold(n_splits scoring="neg_log_loss", n_jobs=-1, cv=kfold) result = grid_search.fit(X, label_encoded_y)

We can perform this grid search on the Otto dataset, using 10-fold cross validation, requiring 60 models to be trained (6 configurations * 10 folds).

The full code listing is provided below for completeness.

# XGBoost on Otto dataset, Tune n_estimators from pandas import read_csv from xgboost import XGBClassifier from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import LabelEncoder import matplotlib matplotlib.use('Agg') from matplotlib import pyplot # load data data = read_csv('train.csv') dataset = data.values # split data into X and y X = dataset[:,0:94] y = dataset[:,94] # encode string class values as integers label_encoded_y = LabelEncoder().fit_transform(y) # grid search model = XGBClassifier() n_estimators = range(50, 400, 50) param_grid = dict(n_estimators=n_estimators) kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold) grid_result = grid_search.fit(X, label_encoded_y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) # plot pyplot.errorbar(n_estimators, means, yerr=stds) pyplot.title("XGBoost n_estimators vs Log Loss") pyplot.xlabel('n_estimators') pyplot.ylabel('Log Loss') pyplot.savefig('n_estimators.png')

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running this example prints the following results.

Best: -0.001152 using {'n_estimators': 250} -0.010970 (0.001083) with: {'n_estimators': 50} -0.001239 (0.001730) with: {'n_estimators': 100} -0.001163 (0.001715) with: {'n_estimators': 150} -0.001153 (0.001702) with: {'n_estimators': 200} -0.001152 (0.001702) with: {'n_estimators': 250} -0.001152 (0.001704) with: {'n_estimators': 300} -0.001153 (0.001706) with: {'n_estimators': 350}

We can see that the cross validation log loss scores are negative. This is because the scikit-learn cross validation framework inverted them. The reason is that internally, the framework requires that all metrics that are being optimized are to be maximized, whereas log loss is a minimization metric. It can easily be made maximizing by inverting the scores.

The best number of trees was **n_estimators=250** resulting in a log loss of 0.001152, but really not a significant difference from **n_estimators=200**. In fact, there is not a large relative difference in the number of trees between 100 and 350 if we plot the results.

Below is line graph showing the relationship between the number of trees and mean (inverted) logarithmic loss, with the standard deviation shown as error bars.

In gradient boosting, we can control the size of decision trees, also called the number of layers or the depth.

Shallow trees are expected to have poor performance because they capture few details of the problem and are generally referred to as weak learners. Deeper trees generally capture too many details of the problem and overfit the training dataset, limiting the ability to make good predictions on new data.

Generally, boosting algorithms are configured with weak learners, decision trees with few layers, sometimes as simple as just a root node, also called a decision stump rather than a decision tree.

The maximum depth can be specified in the **XGBClassifier** and **XGBRegressor** wrapper classes for XGBoost in the **max_depth** parameter. This parameter takes an integer value and defaults to a value of 3.

model = XGBClassifier(max_depth=3)

We can tune this hyperparameter of XGBoost using the grid search infrastructure in scikit-learn on the Otto dataset. Below we evaluate odd values for **max_depth** between 1 and 9 (1, 3, 5, 7, 9).

Each of the 5 configurations is evaluated using 10-fold cross validation, resulting in 50 models being constructed. The full code listing is provided below for completeness.

# XGBoost on Otto dataset, Tune max_depth from pandas import read_csv from xgboost import XGBClassifier from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import LabelEncoder import matplotlib matplotlib.use('Agg') from matplotlib import pyplot # load data data = read_csv('train.csv') dataset = data.values # split data into X and y X = dataset[:,0:94] y = dataset[:,94] # encode string class values as integers label_encoded_y = LabelEncoder().fit_transform(y) # grid search model = XGBClassifier() max_depth = range(1, 11, 2) print(max_depth) param_grid = dict(max_depth=max_depth) kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold, verbose=1) grid_result = grid_search.fit(X, label_encoded_y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) # plot pyplot.errorbar(max_depth, means, yerr=stds) pyplot.title("XGBoost max_depth vs Log Loss") pyplot.xlabel('max_depth') pyplot.ylabel('Log Loss') pyplot.savefig('max_depth.png')

Running this example prints the log loss for each **max_depth**.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

The optimal configuration was **max_depth=5** resulting in a log loss of 0.001236.

Best: -0.001236 using {'max_depth': 5} -0.026235 (0.000898) with: {'max_depth': 1} -0.001239 (0.001730) with: {'max_depth': 3} -0.001236 (0.001701) with: {'max_depth': 5} -0.001237 (0.001701) with: {'max_depth': 7} -0.001237 (0.001701) with: {'max_depth': 9}

Reviewing the plot of log loss scores, we can see a marked jump from **max_depth=1** to **max_depth=3** then pretty even performance for the rest the values of **max_depth**.

Although the best score was observed for **max_depth=5**, it is interesting to note that there was practically little difference between using **max_depth=3** or **max_depth=7**.

This suggests a point of diminishing returns in **max_depth** on a problem that you can tease out using grid search. A graph of **max_depth** values is plotted against (inverted) logarithmic loss below.

There is a relationship between the number of trees in the model and the depth of each tree.

We would expect that deeper trees would result in fewer trees being required in the model, and the inverse where simpler trees (such as decision stumps) require many more trees to achieve similar results.

We can investigate this relationship by evaluating a grid of **n_estimators** and **max_depth** configuration values. To avoid the evaluation taking too long, we will limit the total number of configuration values evaluated. Parameters were chosen to tease out the relationship rather than optimize the model.

We will create a grid of 4 different n_estimators values (50, 100, 150, 200) and 4 different max_depth values (2, 4, 6, 8) and each combination will be evaluated using 10-fold cross validation. A total of 4*4*10 or 160 models will be trained and evaluated.

The full code listing is provided below.

# XGBoost on Otto dataset, Tune n_estimators and max_depth from pandas import read_csv from xgboost import XGBClassifier from sklearn.model_selection import GridSearchCV from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import LabelEncoder import matplotlib matplotlib.use('Agg') from matplotlib import pyplot import numpy # load data data = read_csv('train.csv') dataset = data.values # split data into X and y X = dataset[:,0:94] y = dataset[:,94] # encode string class values as integers label_encoded_y = LabelEncoder().fit_transform(y) # grid search model = XGBClassifier() n_estimators = [50, 100, 150, 200] max_depth = [2, 4, 6, 8] print(max_depth) param_grid = dict(max_depth=max_depth, n_estimators=n_estimators) kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold, verbose=1) grid_result = grid_search.fit(X, label_encoded_y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param)) # plot results scores = numpy.array(means).reshape(len(max_depth), len(n_estimators)) for i, value in enumerate(max_depth): pyplot.plot(n_estimators, scores[i], label='depth: ' + str(value)) pyplot.legend() pyplot.xlabel('n_estimators') pyplot.ylabel('Log Loss') pyplot.savefig('n_estimators_vs_max_depth.png')

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the code produces a listing of the logloss for each parameter pair.

Best: -0.001141 using {'n_estimators': 200, 'max_depth': 4} -0.012127 (0.001130) with: {'n_estimators': 50, 'max_depth': 2} -0.001351 (0.001825) with: {'n_estimators': 100, 'max_depth': 2} -0.001278 (0.001812) with: {'n_estimators': 150, 'max_depth': 2} -0.001266 (0.001796) with: {'n_estimators': 200, 'max_depth': 2} -0.010545 (0.001083) with: {'n_estimators': 50, 'max_depth': 4} -0.001226 (0.001721) with: {'n_estimators': 100, 'max_depth': 4} -0.001150 (0.001704) with: {'n_estimators': 150, 'max_depth': 4} -0.001141 (0.001693) with: {'n_estimators': 200, 'max_depth': 4} -0.010341 (0.001059) with: {'n_estimators': 50, 'max_depth': 6} -0.001237 (0.001701) with: {'n_estimators': 100, 'max_depth': 6} -0.001163 (0.001688) with: {'n_estimators': 150, 'max_depth': 6} -0.001154 (0.001679) with: {'n_estimators': 200, 'max_depth': 6} -0.010342 (0.001059) with: {'n_estimators': 50, 'max_depth': 8} -0.001237 (0.001701) with: {'n_estimators': 100, 'max_depth': 8} -0.001161 (0.001688) with: {'n_estimators': 150, 'max_depth': 8} -0.001153 (0.001679) with: {'n_estimators': 200, 'max_depth': 8}

We can see that the best result was achieved with a **n_estimators=200** and **max_depth=4**, similar to the best values found from the previous two rounds of standalone parameter tuning (**n_estimators=250**, **max_depth=5**).

We can plot the relationship between each series of **max_depth** values for a given **n_estimators**.

The lines overlap making it hard to see the relationship, but generally we can see the interaction we expect. Fewer boosted trees are required with increased tree depth.

Further, we would expect the increase complexity provided by deeper individual trees to result in greater overfitting of the training data which would be exacerbated by having more trees, in turn resulting in a lower cross validation score. We don’t see this here as our trees are not that deep nor do we have too many. Exploring this expectation is left as an exercise you could explore yourself.

In this post, you discovered how to tune the number and depth of decision trees when using gradient boosting with XGBoost in Python.

Specifically, you learned:

- How to tune the number of decision trees in an XGBoost model.
- How to tune the depth of decision trees in an XGBoost model.
- How to jointly tune the number of trees and tree depth in an XGBoost model

Do you have any questions about the number or size of decision trees in your gradient boosting model or about this post? Ask your questions in the comments and I will do my best to answer.

The post How to Tune the Number and Size of Decision Trees with XGBoost in Python appeared first on Machine Learning Mastery.

]]>