The post Multi-Class Imbalanced Classification appeared first on Machine Learning Mastery.

]]>Most imbalanced classification examples focus on binary classification tasks, yet many of the tools and techniques for imbalanced classification also directly support multi-class classification problems.

In this tutorial, you will discover how to use the tools of imbalanced classification with a multi-class dataset.

After completing this tutorial, you will know:

- About the glass identification standard imbalanced multi-class prediction problem.
- How to use SMOTE oversampling for imbalanced multi-class classification.
- How to use cost-sensitive learning for imbalanced multi-class classification.

Discover SMOTE, one-class classification, cost-sensitive learning, threshold moving, and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Let’s get started.

This tutorial is divided into three parts; they are:

- Glass Multi-Class Classification Dataset
- SMOTE Oversampling for Multi-Class Classification
- Cost-Sensitive Learning for Multi-Class Classification

In this tutorial, we will focus on the standard imbalanced multi-class classification problem referred to as “**Glass Identification**” or simply “*glass*.”

The dataset describes the chemical properties of glass and involves classifying samples of glass using their chemical properties as one of six classes. The dataset was credited to Vina Spiehler in 1987.

Ignoring the sample identification number, there are nine input variables that summarize the properties of the glass dataset; they are:

- RI: Refractive Index
- Na: Sodium
- Mg: Magnesium
- Al: Aluminum
- Si: Silicon
- K: Potassium
- Ca: Calcium
- Ba: Barium
- Fe: Iron

The chemical compositions are measured as the weight percent in corresponding oxide.

There are seven types of glass listed; they are:

- Class 1: building windows (float processed)
- Class 2: building windows (non-float processed)
- Class 3: vehicle windows (float processed)
- Class 4: vehicle windows (non-float processed)
- Class 5: containers
- Class 6: tableware
- Class 7: headlamps

Float glass refers to the process used to make the glass.

There are 214 observations in the dataset and the number of observations in each class is imbalanced. Note that there are no examples for class 4 (non-float processed vehicle windows) in the dataset.

- Class 1: 70 examples
- Class 2: 76 examples
- Class 3: 17 examples
- Class 4: 0 examples
- Class 5: 13 examples
- Class 6: 9 examples
- Class 7: 29 examples

Although there are minority classes, all classes are equally important in this prediction problem.

The dataset can be divided into window glass (classes 1-4) and non-window glass (classes 5-7). There are 163 examples of window glass and 51 examples of non-window glass.

- Window Glass: 163 examples
- Non-Window Glass: 51 examples

Another division of the observations would be between float processed glass and non-float processed glass, in the case of window glass only. This division is more balanced.

- Float Glass: 87 examples
- Non-Float Glass: 76 examples

You can learn more about the dataset here:

No need to download the dataset; we will download it automatically as part of the worked examples.

Below is a sample of the first few rows of the data.

1.52101,13.64,4.49,1.10,71.78,0.06,8.75,0.00,0.00,1 1.51761,13.89,3.60,1.36,72.73,0.48,7.83,0.00,0.00,1 1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.00,0.00,1 1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.00,0.00,1 1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.00,0.00,1 ...

We can see that all inputs are numeric and the target variable in the final column is the integer encoded class label.

You can learn more about how to work through this dataset as part of a project in the tutorial:

Now that we are familiar with the glass multi-class classification dataset, let’s explore how we can use standard imbalanced classification tools with it.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Oversampling refers to copying or synthesizing new examples of the minority classes so that the number of examples in the minority class better resembles or matches the number of examples in the majority classes.

Perhaps the most widely used approach to synthesizing new examples is called the Synthetic Minority Oversampling TEchnique, or SMOTE for short. This technique was described by Nitesh Chawla, et al. in their 2002 paper named for the technique titled “SMOTE: Synthetic Minority Over-sampling Technique.”

You can learn more about SMOTE in the tutorial:

The imbalanced-learn library provides an implementation of SMOTE that we can use that is compatible with the popular scikit-learn library.

First, the library must be installed. We can install it using pip as follows:

sudo pip install imbalanced-learn

We can confirm that the installation was successful by printing the version of the installed library:

# check version number import imblearn print(imblearn.__version__)

Running the example will print the version number of the installed library; for example:

0.6.2

Before we apply SMOTE, let’s first load the dataset and confirm the number of examples in each class.

# load and summarize the dataset from pandas import read_csv from collections import Counter from matplotlib import pyplot from sklearn.preprocessing import LabelEncoder # define the dataset location url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/glass.csv' # load the csv file as a data frame df = read_csv(url, header=None) data = df.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable y = LabelEncoder().fit_transform(y) # summarize distribution counter = Counter(y) for k,v in counter.items(): per = v / len(y) * 100 print('Class=%d, n=%d (%.3f%%)' % (k, v, per)) # plot the distribution pyplot.bar(counter.keys(), counter.values()) pyplot.show()

Running the example first downloads the dataset and splits it into train and test sets.

The number of rows in each class is then reported, confirming that some classes, such as 0 and 1, have many more examples (more than 70) than other classes, such as 3 and 4 (less than 15).

Class=0, n=70 (32.710%) Class=1, n=76 (35.514%) Class=2, n=17 (7.944%) Class=3, n=13 (6.075%) Class=4, n=9 (4.206%) Class=5, n=29 (13.551%)

A bar chart is created providing a visualization of the class breakdown of the dataset.

This gives a clearer idea that classes 0 and 1 have many more examples than classes 2, 3, 4 and 5.

Next, we can apply SMOTE to oversample the dataset.

By default, SMOTE will oversample all classes to have the same number of examples as the class with the most examples.

In this case, class 1 has the most examples with 76, therefore, SMOTE will oversample all classes to have 76 examples.

The complete example of oversampling the glass dataset with SMOTE is listed below.

# example of oversampling a multi-class classification dataset from pandas import read_csv from imblearn.over_sampling import SMOTE from collections import Counter from matplotlib import pyplot from sklearn.preprocessing import LabelEncoder # define the dataset location url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/glass.csv' # load the csv file as a data frame df = read_csv(url, header=None) data = df.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable y = LabelEncoder().fit_transform(y) # transform the dataset oversample = SMOTE() X, y = oversample.fit_resample(X, y) # summarize distribution counter = Counter(y) for k,v in counter.items(): per = v / len(y) * 100 print('Class=%d, n=%d (%.3f%%)' % (k, v, per)) # plot the distribution pyplot.bar(counter.keys(), counter.values()) pyplot.show()

Running the example first loads the dataset and applies SMOTE to it.

The distribution of examples in each class is then reported, confirming that each class now has 76 examples, as we expected.

Class=0, n=76 (16.667%) Class=1, n=76 (16.667%) Class=2, n=76 (16.667%) Class=3, n=76 (16.667%) Class=4, n=76 (16.667%) Class=5, n=76 (16.667%)

A bar chart of the class distribution is also created, providing a strong visual indication that all classes now have the same number of examples.

Instead of using the default strategy of SMOTE to oversample all classes to the number of examples in the majority class, we could instead specify the number of examples to oversample in each class.

For example, we could oversample to 100 examples in classes 0 and 1 and 200 examples in remaining classes. This can be achieved by creating a dictionary that maps class labels to the number of desired examples in each class, then specifying this via the “*sampling_strategy*” argument to the SMOTE class.

... # transform the dataset strategy = {0:100, 1:100, 2:200, 3:200, 4:200, 5:200} oversample = SMOTE(sampling_strategy=strategy) X, y = oversample.fit_resample(X, y)

Tying this together, the complete example of using a custom oversampling strategy for SMOTE is listed below.

# example of oversampling a multi-class classification dataset with a custom strategy from pandas import read_csv from imblearn.over_sampling import SMOTE from collections import Counter from matplotlib import pyplot from sklearn.preprocessing import LabelEncoder # define the dataset location url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/glass.csv' # load the csv file as a data frame df = read_csv(url, header=None) data = df.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable y = LabelEncoder().fit_transform(y) # transform the dataset strategy = {0:100, 1:100, 2:200, 3:200, 4:200, 5:200} oversample = SMOTE(sampling_strategy=strategy) X, y = oversample.fit_resample(X, y) # summarize distribution counter = Counter(y) for k,v in counter.items(): per = v / len(y) * 100 print('Class=%d, n=%d (%.3f%%)' % (k, v, per)) # plot the distribution pyplot.bar(counter.keys(), counter.values()) pyplot.show()

Running the example creates the desired sampling and summarizes the effect on the dataset, confirming the intended result.

Class=0, n=100 (10.000%) Class=1, n=100 (10.000%) Class=2, n=200 (20.000%) Class=3, n=200 (20.000%) Class=4, n=200 (20.000%) Class=5, n=200 (20.000%)

Note: you may see warnings that can be safely ignored for the purposes of this example, such as:

UserWarning: After over-sampling, the number of samples (200) in class 5 will be larger than the number of samples in the majority class (class #1 -> 76)

A bar chart of the class distribution is also created confirming the specified class distribution after data sampling.

**Note**: when using data sampling like SMOTE, it must only be applied to the training dataset, not the entire dataset. I recommend using a Pipeline to ensure that the SMOTE method is correctly used when evaluating models and making predictions with models.

You can see an example of the correct usage of SMOTE in a Pipeline in this tutorial:

Most machine learning algorithms assume that all classes have an equal number of examples.

This is not the case in multi-class imbalanced classification. Algorithms can be modified to change the way learning is performed to bias towards those classes that have fewer examples in the training dataset. This is generally called cost-sensitive learning.

For more on cost-sensitive learning, see the tutorial:

The RandomForestClassifier class in scikit-learn supports cost-sensitive learning via the “*class_weight*” argument.

By default, the random forest class assigns equal weight to each class.

We can evaluate the classification accuracy of the default random forest class weighting on the glass imbalanced multi-class classification dataset.

The complete example is listed below.

# baseline model and test harness for the glass identification dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import RandomForestClassifier # load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X, y # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define the location of the dataset full_path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/glass.csv' # load the dataset X, y = load_dataset(full_path) # define the reference model model = RandomForestClassifier(n_estimators=1000) # evaluate the model scores = evaluate_model(X, y, model) # summarize performance print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the default random forest algorithm with 1,000 trees on the glass dataset using repeated stratified k-fold cross-validation.

The mean and standard deviation classification accuracy are reported at the end of the run.

Your specific results may vary given the stochastic nature of the learning algorithm, the evaluation procedure, and differences in precision across machines. Try running the example a few times.

In this case, we can see that the default model achieved a classification accuracy of about 79.6 percent.

Mean Accuracy: 0.796 (0.047)

We can specify the “*class_weight*” argument to the value “*balanced*” that will automatically calculates a class weighting that will ensure each class gets an equal weighting during the training of the model.

... # define the model model = RandomForestClassifier(n_estimators=1000, class_weight='balanced')

Tying this together, the complete example is listed below.

# cost sensitive random forest with default class weights from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import RandomForestClassifier # load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable y = LabelEncoder().fit_transform(y) return X, y # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define the location of the dataset full_path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/glass.csv' # load the dataset X, y = load_dataset(full_path) # define the model model = RandomForestClassifier(n_estimators=1000, class_weight='balanced') # evaluate the model scores = evaluate_model(X, y, model) # summarize performance print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example reports the mean and standard deviation classification accuracy of the cost-sensitive version of random forest on the glass dataset.

Your specific results may vary given the stochastic nature of the learning algorithm, the evaluation procedure, and differences in precision across machines. Try running the example a few times.

In this case, we can see that the default model achieved a lift in classification accuracy over the cost-insensitive version of the algorithm, with 80.2 percent classification accuracy vs. 79.6 percent.

Mean Accuracy: 0.802 (0.044)

The “*class_weight*” argument takes a dictionary of class labels mapped to a class weighting value.

We can use this to specify a custom weighting, such as a default weighting for classes 0 and 1.0 that have many examples and a double class weighting of 2.0 for the other classes.

... # define the model weights = {0:1.0, 1:1.0, 2:2.0, 3:2.0, 4:2.0, 5:2.0} model = RandomForestClassifier(n_estimators=1000, class_weight=weights)

Tying this together, the complete example of using a custom class weighting for cost-sensitive learning on the glass multi-class imbalanced classification problem is listed below.

# cost sensitive random forest with custom class weightings from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import RandomForestClassifier # load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable y = LabelEncoder().fit_transform(y) return X, y # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define the location of the dataset full_path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/glass.csv' # load the dataset X, y = load_dataset(full_path) # define the model weights = {0:1.0, 1:1.0, 2:2.0, 3:2.0, 4:2.0, 5:2.0} model = RandomForestClassifier(n_estimators=1000, class_weight=weights) # evaluate the model scores = evaluate_model(X, y, model) # summarize performance print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example reports the mean and standard deviation classification accuracy of the cost-sensitive version of random forest on the glass dataset with custom weights.

Your specific results may vary given the stochastic nature of the learning algorithm, the evaluation procedure, and differences in precision across machines. Try running the example a few times.

In this case, we can see that we achieved a further lift in accuracy from about 80.2 percent with balanced class weighting to 80.8 percent with a more biased class weighting.

Mean Accuracy: 0.808 (0.059)

This section provides more resources on the topic if you are looking to go deeper.

- Imbalanced Multiclass Classification with the Glass Identification Dataset
- SMOTE for Imbalanced Classification with Python
- Cost-Sensitive Logistic Regression for Imbalanced Classification
- Cost-Sensitive Learning for Imbalanced Classification

In this tutorial, you discovered how to use the tools of imbalanced classification with a multi-class dataset.

Specifically, you learned:

- About the glass identification standard imbalanced multi-class prediction problem.
- How to use SMOTE oversampling for imbalanced multi-class classification.
- How to use cost-sensitive learning for imbalanced multi-class classification.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Multi-Class Imbalanced Classification appeared first on Machine Learning Mastery.

]]>The post How to Use XGBoost for Time Series Forecasting appeared first on Machine Learning Mastery.

]]>It is both fast and efficient, performing well, if not the best, on a wide range of predictive modeling tasks and is a favorite among data science competition winners, such as those on Kaggle.

XGBoost can also be used for time series forecasting, although it requires that the time series dataset be transformed into a supervised learning problem first. It also requires the use of a specialized technique for evaluating the model called walk-forward validation, as evaluating the model using k-fold cross validation would result in optimistically biased results.

In this tutorial, you will discover how to develop an XGBoost model for time series forecasting.

After completing this tutorial, you will know:

- XGBoost is an implementation of the gradient boosting ensemble algorithm for classification and regression.
- Time series datasets can be transformed into supervised learning using a sliding-window representation.
- How to fit, evaluate, and make predictions with an XGBoost model for time series forecasting.

Let’s get started.

This tutorial is divided into three parts; they are:

- XGBoost Ensemble
- Time Series Data Preparation
- XGBoost for Time Series Forecasting

XGBoost is short for Extreme Gradient Boosting and is an efficient implementation of the stochastic gradient boosting machine learning algorithm.

The stochastic gradient boosting algorithm, also called gradient boosting machines or tree boosting, is a powerful machine learning technique that performs well or even best on a wide range of challenging machine learning problems.

Tree boosting has been shown to give state-of-the-art results on many standard classification benchmarks.

— XGBoost: A Scalable Tree Boosting System, 2016.

It is an ensemble of decision trees algorithm where new trees fix errors of those trees that are already part of the model. Trees are added until no further improvements can be made to the model.

XGBoost provides a highly efficient implementation of the stochastic gradient boosting algorithm and access to a suite of model hyperparameters designed to provide control over the model training process.

The most important factor behind the success of XGBoost is its scalability in all scenarios. The system runs more than ten times faster than existing popular solutions on a single machine and scales to billions of examples in distributed or memory-limited settings.

— XGBoost: A Scalable Tree Boosting System, 2016.

XGBoost is designed for classification and regression on tabular datasets, although it can be used for time series forecasting.

For more on the gradient boosting and XGBoost implementation, see the tutorial:

First, the XGBoost library must be installed.

You can install it using pip, as follows:

sudo pip install xgboost

Once installed, you can confirm that it was installed successfully and that you are using a modern version by running the following code:

# xgboost import xgboost print("xgboost", xgboost.__version__)

Running the code, you should see the following version number or higher.

xgboost 1.0.1

Although the XGBoost library has its own Python API, we can use XGBoost models with the scikit-learn API via the XGBRegressor wrapper class.

An instance of the model can be instantiated and used just like any other scikit-learn class for model evaluation. For example:

... # define model model = XGBRegressor()

Now that we are familiar with XGBoost, let’s look at how we can prepare a time series dataset for supervised learning.

Time series data can be phrased as supervised learning.

Given a sequence of numbers for a time series dataset, we can restructure the data to look like a supervised learning problem. We can do this by using previous time steps as input variables and use the next time step as the output variable.

Let’s make this concrete with an example. Imagine we have a time series as follows:

time, measure 1, 100 2, 110 3, 108 4, 115 5, 120

We can restructure this time series dataset as a supervised learning problem by using the value at the previous time step to predict the value at the next time-step.

Reorganizing the time series dataset this way, the data would look as follows:

X, y ?, 100 100, 110 110, 108 108, 115 115, 120 120, ?

Note that the time column is dropped and some rows of data are unusable for training a model, such as the first and the last.

This representation is called a sliding window, as the window of inputs and expected outputs is shifted forward through time to create new “*samples*” for a supervised learning model.

For more on the sliding window approach to preparing time series forecasting data, see the tutorial:

We can use the shift() function in Pandas to automatically create new framings of time series problems given the desired length of input and output sequences.

This would be a useful tool as it would allow us to explore different framings of a time series problem with machine learning algorithms to see which might result in better-performing models.

The function below will take a time series as a NumPy array time series with one or more columns and transform it into a supervised learning problem with the specified number of inputs and outputs.

# transform a time series dataset into a supervised learning dataset def series_to_supervised(data, n_in=1, n_out=1, dropnan=True): n_vars = 1 if type(data) is list else data.shape[1] df = DataFrame(data) cols = list() # input sequence (t-n, ... t-1) for i in range(n_in, 0, -1): cols.append(df.shift(i)) # forecast sequence (t, t+1, ... t+n) for i in range(0, n_out): cols.append(df.shift(-i)) # put it all together agg = concat(cols, axis=1) # drop rows with NaN values if dropnan: agg.dropna(inplace=True) return agg.values

We can use this function to prepare a time series dataset for XGBoost.

For more on the step-by-step development of this function, see the tutorial:

Once the dataset is prepared, we must be careful in how it is used to fit and evaluate a model.

For example, it would not be valid to fit the model on data from the future and have it predict the past. The model must be trained on the past and predict the future.

This means that methods that randomize the dataset during evaluation, like k-fold cross-validation, cannot be used. Instead, we must use a technique called walk-forward validation.

In walk-forward validation, the dataset is first split into train and test sets by selecting a cut point, e.g. all data except the last 12 months is used for training and the last 12 months is used for testing.

If we are interested in making a one-step forecast, e.g. one month, then we can evaluate the model by training on the training dataset and predicting the first step in the test dataset. We can then add the real observation from the test set to the training dataset, refit the model, then have the model predict the second step in the test dataset.

Repeating this process for the entire test dataset will give a one-step prediction for the entire test dataset from which an error measure can be calculated to evaluate the skill of the model.

For more on walk-forward validation, see the tutorial:

The function below performs walk-forward validation.

It takes the entire supervised learning version of the time series dataset and the number of rows to use as the test set as arguments.

It then steps through the test set, calling the *xgboost_forecast()* function to make a one-step forecast. An error measure is calculated and the details are returned for analysis.

# walk-forward validation for univariate data def walk_forward_validation(data, n_test): predictions = list() # split dataset train, test = train_test_split(data, n_test) # seed history with training dataset history = [x for x in train] # step over each time-step in the test set for i in range(len(test)): # split test row into input and output columns testX, testy = test[i, :-1], test[i, -1] # fit model on history and make a prediction yhat = xgboost_forecast(history, testX) # store forecast in list of predictions predictions.append(yhat) # add actual observation to history for the next loop history.append(test[i]) # summarize progress print('>expected=%.1f, predicted=%.1f' % (testy, yhat)) # estimate prediction error error = mean_absolute_error(test[:, 1], predictions) return error, test[:, 1], predictions

The *train_test_split()* function is called to split the dataset into train and test sets.

We can define this function below.

# split a univariate dataset into train/test sets def train_test_split(data, n_test): return data[:-n_test, :], data[-n_test:, :]

We can use the *XGBRegressor* class to make a one-step forecast.

The *xgboost_forecast()* function below implements this, taking the training dataset and test input row as input, fitting a model, and making a one-step prediction.

# fit an xgboost model and make a one step prediction def xgboost_forecast(train, testX): # transform list into array train = asarray(train) # split into input and output columns trainX, trainy = train[:, :-1], train[:, -1] # fit model model = XGBRegressor(objective='reg:squarederror', n_estimators=1000) model.fit(trainX, trainy) # make a one-step prediction yhat = model.predict([testX]) return yhat[0]

Now that we know how to prepare time series data for forecasting and evaluate an XGBoost model, next we can look at using XGBoost on a real dataset.

In this section, we will explore how to use XGBoost for time series forecasting.

We will use a standard univariate time series dataset with the intent of using the model to make a one-step forecast.

You can use the code in this section as the starting point in your own project and easily adapt it for multivariate inputs, multivariate forecasts, and multi-step forecasts.

We will use the daily female births dataset, that is the monthly births across three years.

You can download the dataset from here, place it in your current working directory with the filename “*daily-total-female-births.csv*“.

The first few lines of the dataset look as follows:

"Date","Births" "1959-01-01",35 "1959-01-02",32 "1959-01-03",30 "1959-01-04",31 "1959-01-05",44 ...

First, let’s load and plot the dataset.

The complete example is listed below.

# load and plot the time series dataset from pandas import read_csv from matplotlib import pyplot # load dataset series = read_csv('daily-total-female-births.csv', header=0, index_col=0) values = series.values # plot dataset pyplot.plot(values) pyplot.show()

Running the example creates a line plot of the dataset.

We can see there is no obvious trend or seasonality.

A persistence model can achieve a MAE of about 6.7 births when predicting the last 12 months. This provides a baseline in performance above which a model may be considered skillful.

Next, we can evaluate the XGBoost model on the dataset when making one-step forecasts for the last 12 months of data.

We will use only the previous three time steps as input to the model and default model hyperparameters, except we will change the loss to ‘*reg:squarederror*‘ (to avoid a warning message) and use a 1,000 trees in the ensemble (to avoid underlearning).

The complete example is listed below.

# forecast monthly births with xgboost from numpy import asarray from pandas import read_csv from pandas import DataFrame from pandas import concat from sklearn.metrics import mean_absolute_error from xgboost import XGBRegressor from matplotlib import pyplot # transform a time series dataset into a supervised learning dataset def series_to_supervised(data, n_in=1, n_out=1, dropnan=True): n_vars = 1 if type(data) is list else data.shape[1] df = DataFrame(data) cols = list() # input sequence (t-n, ... t-1) for i in range(n_in, 0, -1): cols.append(df.shift(i)) # forecast sequence (t, t+1, ... t+n) for i in range(0, n_out): cols.append(df.shift(-i)) # put it all together agg = concat(cols, axis=1) # drop rows with NaN values if dropnan: agg.dropna(inplace=True) return agg.values # split a univariate dataset into train/test sets def train_test_split(data, n_test): return data[:-n_test, :], data[-n_test:, :] # fit an xgboost model and make a one step prediction def xgboost_forecast(train, testX): # transform list into array train = asarray(train) # split into input and output columns trainX, trainy = train[:, :-1], train[:, -1] # fit model model = XGBRegressor(objective='reg:squarederror', n_estimators=1000) model.fit(trainX, trainy) # make a one-step prediction yhat = model.predict(asarray([testX])) return yhat[0] # walk-forward validation for univariate data def walk_forward_validation(data, n_test): predictions = list() # split dataset train, test = train_test_split(data, n_test) # seed history with training dataset history = [x for x in train] # step over each time-step in the test set for i in range(len(test)): # split test row into input and output columns testX, testy = test[i, :-1], test[i, -1] # fit model on history and make a prediction yhat = xgboost_forecast(history, testX) # store forecast in list of predictions predictions.append(yhat) # add actual observation to history for the next loop history.append(test[i]) # summarize progress print('>expected=%.1f, predicted=%.1f' % (testy, yhat)) # estimate prediction error error = mean_absolute_error(test[:, 1], predictions) return error, test[:, 1], predictions # load the dataset series = read_csv('daily-total-female-births.csv', header=0, index_col=0) values = series.values # transform the time series data into supervised learning data = series_to_supervised(values, n_in=3) # evaluate mae, y, yhat = walk_forward_validation(data, 12) print('MAE: %.3f' % mae) # plot expected vs preducted pyplot.plot(y, label='Expected') pyplot.plot(yhat, label='Predicted') pyplot.legend() pyplot.show()

Running the example reports the expected and predicted values for each step in the test set, then the MAE for all predicted values.

We can see that the model performs better than a persistence model, achieving a MAE of about 5.3 births, compared to 6.7 births.

**Can you do better?**

You can test different XGBoost hyperparameters and numbers of time steps as input to see if you can achieve better performance. Share your results in the comments below.

>expected=42.0, predicted=48.3 >expected=53.0, predicted=43.0 >expected=39.0, predicted=41.0 >expected=40.0, predicted=34.9 >expected=38.0, predicted=48.9 >expected=44.0, predicted=43.3 >expected=34.0, predicted=43.5 >expected=37.0, predicted=40.1 >expected=52.0, predicted=42.8 >expected=48.0, predicted=37.2 >expected=55.0, predicted=48.6 >expected=50.0, predicted=48.9 MAE: 5.356

A line plot is created comparing the series of expected values and predicted values for the last 12 months of the dataset.

This gives a geometric interpretation of how well the model performed on the test set.

Once a final XGBoost model configuration is chosen, a model can be finalized and used to make a prediction on new data.

This is called an **out-of-sample forecast**, e.g. predicting beyond the training dataset. This is identical to making a prediction during the evaluation of the model: as we always want to evaluate a model using the same procedure that we expect to use when the model is used to make prediction on new data.

The example below demonstrates fitting a final XGBoost model on all available data and making a one-step prediction beyond the end of the dataset.

# finalize model and make a prediction for monthly births with xgboost from numpy import asarray from pandas import read_csv from pandas import DataFrame from pandas import concat from xgboost import XGBRegressor # transform a time series dataset into a supervised learning dataset def series_to_supervised(data, n_in=1, n_out=1, dropnan=True): n_vars = 1 if type(data) is list else data.shape[1] df = DataFrame(data) cols = list() # input sequence (t-n, ... t-1) for i in range(n_in, 0, -1): cols.append(df.shift(i)) # forecast sequence (t, t+1, ... t+n) for i in range(0, n_out): cols.append(df.shift(-i)) # put it all together agg = concat(cols, axis=1) # drop rows with NaN values if dropnan: agg.dropna(inplace=True) return agg.values # load the dataset series = read_csv('daily-total-female-births.csv', header=0, index_col=0) values = series.values # transform the time series data into supervised learning train = series_to_supervised(values, n_in=3) # split into input and output columns trainX, trainy = train[:, :-1], train[:, -1] # fit model model = XGBRegressor(objective='reg:squarederror', n_estimators=1000) model.fit(trainX, trainy) # construct an input for a new preduction row = values[-3:].flatten() # make a one-step prediction yhat = model.predict(asarray([row])) print('Input: %s, Predicted: %.3f' % (row, yhat[0]))

Running the example fits an XGBoost model on all available data.

A new row of input is prepared using the last three months of known data and the next month beyond the end of the dataset is predicted.

Input: [48 55 50], Predicted: 46.736

This section provides more resources on the topic if you are looking to go deeper.

- A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning
- Time Series Forecasting as Supervised Learning
- How to Convert a Time Series to a Supervised Learning Problem in Python
- How To Backtest Machine Learning Models for Time Series Forecasting

In this tutorial, you discovered how to develop an XGBoost model for time series forecasting.

Specifically, you learned:

- XGBoost is an implementation of the gradient boosting ensemble algorithm for classification and regression.
- Time series datasets can be transformed into supervised learning using a sliding-window representation.
- How to fit, evaluate, and make predictions with an XGBoost model for time series forecasting.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Use XGBoost for Time Series Forecasting appeared first on Machine Learning Mastery.

]]>The post Repeated k-Fold Cross-Validation for Model Evaluation in Python appeared first on Machine Learning Mastery.

]]>A single run of the k-fold cross-validation procedure may result in a noisy estimate of model performance. Different splits of the data may result in very different results.

Repeated k-fold cross-validation provides a way to improve the estimated performance of a machine learning model. This involves simply repeating the cross-validation procedure multiple times and reporting the mean result across all folds from all runs. This mean result is expected to be a more accurate estimate of the true unknown underlying mean performance of the model on the dataset, as calculated using the standard error.

In this tutorial, you will discover repeated k-fold cross-validation for model evaluation.

After completing this tutorial, you will know:

- The mean performance reported from a single run of k-fold cross-validation may be noisy.
- Repeated k-fold cross-validation provides a way to reduce the error in the estimate of mean model performance.
- How to evaluate machine learning models using repeated k-fold cross-validation in Python.

Let’s get started.

This tutorial is divided into three parts; they are:

- k-Fold Cross-Validation
- Repeated k-Fold Cross-Validation
- Repeated k-Fold Cross-Validation in Python

It is common to evaluate machine learning models on a dataset using k-fold cross-validation.

The k-fold cross-validation procedure divides a limited dataset into k non-overlapping folds. Each of the k folds is given an opportunity to be used as a held back test set, whilst all other folds collectively are used as a training dataset. A total of k models are fit and evaluated on the k hold-out test sets and the mean performance is reported.

For more on the k-fold cross-validation procedure, see the tutorial:

The k-fold cross-validation procedure can be implemented easily using the scikit-learn machine learning library.

First, let’s define a synthetic classification dataset that we can use as the basis of this tutorial.

The make_classification() function can be used to create a synthetic binary classification dataset. We will configure it to generate 1,000 samples each with 20 input features, 15 of which contribute to the target variable.

The example below creates and summarizes the dataset.

# test classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and confirms that it contains 1,000 samples and 10 input variables.

The fixed seed for the pseudorandom number generator ensures that we get the same samples each time the dataset is generated.

(1000, 20) (1000,)

Next, we can evaluate a model on this dataset using k-fold cross-validation.

We will evaluate a LogisticRegression model and use the KFold class to perform the cross-validation, configured to shuffle the dataset and set k=10, a popular default.

The cross_val_score() function will be used to perform the evaluation, taking the dataset and cross-validation configuration and returning a list of scores calculated for each fold.

The complete example is listed below.

# evaluate a logistic regression model using k-fold cross-validation from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression # create dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) # prepare the cross-validation procedure cv = KFold(n_splits=10, random_state=1, shuffle=True) # create model model = LogisticRegression() # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report performance print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example creates the dataset, then evaluates a logistic regression model on it using 10-fold cross-validation. The mean classification accuracy on the dataset is then reported.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that the model achieved an estimated classification accuracy of about 86.8 percent.

Accuracy: 0.868 (0.032)

Now that we are familiar with k-fold cross-validation, let’s look at an extension that repeats the procedure.

The estimate of model performance via k-fold cross-validation can be noisy.

This means that each time the procedure is run, a different split of the dataset into k-folds can be implemented, and in turn, the distribution of performance scores can be different, resulting in a different mean estimate of model performance.

The amount of difference in the estimated performance from one run of k-fold cross-validation to another is dependent upon the model that is being used and on the dataset itself.

A noisy estimate of model performance can be frustrating as it may not be clear which result should be used to compare and select a final model to address the problem.

One solution to reduce the noise in the estimated model performance is to increase the k-value. This will reduce the bias in the model’s estimated performance, although it will increase the variance: e.g. tie the result more to the specific dataset used in the evaluation.

An alternate approach is to repeat the k-fold cross-validation process multiple times and report the mean performance across all folds and all repeats. This approach is generally referred to as repeated k-fold cross-validation.

… repeated k-fold cross-validation replicates the procedure […] multiple times. For example, if 10-fold cross-validation was repeated five times, 50 different held-out sets would be used to estimate model efficacy.

— Page 70, Applied Predictive Modeling, 2013.

Importantly, each repeat of the k-fold cross-validation process must be performed on the same dataset split into different folds.

Repeated k-fold cross-validation has the benefit of improving the estimate of the mean model performance at the cost of fitting and evaluating many more models.

Common numbers of repeats include 3, 5, and 10. For example, if 3 repeats of 10-fold cross-validation are used to estimate the model performance, this means that (3 * 10) or 30 different models would need to be fit and evaluated.

**Appropriate**: for small datasets and simple models (e.g. linear).

As such, the approach is suited for small- to modestly-sized datasets and/or models that are not too computationally costly to fit and evaluate. This suggests that the approach may be appropriate for linear models and not appropriate for slow-to-fit models like deep learning neural networks.

Like k-fold cross-validation itself, repeated k-fold cross-validation is easy to parallelize, where each fold or each repeated cross-validation process can be executed on different cores or different machines.

The scikit-learn Python machine learning library provides an implementation of repeated k-fold cross-validation via the RepeatedKFold class.

The main parameters are the number of folds (*n_splits*), which is the “*k*” in k-fold cross-validation, and the number of repeats (*n_repeats*).

A good default for k is k=10.

A good default for the number of repeats depends on how noisy the estimate of model performance is on the dataset. A value of 3, 5, or 10 repeats is probably a good start. More repeats than 10 are probably not required.

... # prepare the cross-validation procedure cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

The example below demonstrates repeated k-fold cross-validation of our test dataset.

# evaluate a logistic regression model using repeated k-fold cross-validation from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import RepeatedKFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression # create dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) # prepare the cross-validation procedure cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # create model model = LogisticRegression() # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report performance print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example creates the dataset, then evaluates a logistic regression model on it using 10-fold cross-validation with three repeats. The mean classification accuracy on the dataset is then reported.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that the model achieved an estimated classification accuracy of about 86.7 percent, which is lower than the single run result reported previously of 86.8 percent. This may suggest that the single run result may be optimistic and that the result from three repeats might be a better estimate of the true mean model performance.

Accuracy: 0.867 (0.031)

The expectation of repeated k-fold cross-validation is that the repeated mean would be a more reliable estimate of model performance than the result of a single k-fold cross-validation procedure.

This may mean less statistical noise.

One way this could be measured is by comparing the distributions of mean performance scores under differing numbers of repeats.

We can imagine that there is a true unknown underlying mean performance of a model on a dataset and that repeated k-fold cross-validation runs estimate this mean. We can estimate the error in the mean performance from the true unknown underlying mean performance using a statistical tool called the standard error.

The standard error can provide an indication for a given sample size of the amount of error or the spread of error that may be expected from the sample mean to the underlying and unknown population mean.

Standard error can be calculated as follows:

- standard_error = sample_standard_deviation / sqrt(number of repeats)

We can calculate the standard error for a sample using the sem() scipy function.

Ideally, we would like to select a number of repeats that shows both minimization of the standard error and stabilizing of the mean estimated performance compared to other numbers of repeats.

The example below demonstrates this by reporting model performance with 10-fold cross-validation with 1 to 15 repeats of the procedure.

We would expect that more repeats of the procedure would result in a more accurate estimate of the mean model performance, given the law of large numbers. Although, the trials are not independent, so the underlying statistical principles become challenging.

# compare the number of repeats for repeated k-fold cross-validation from scipy.stats import sem from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import RepeatedKFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression from matplotlib import pyplot # evaluate a model with a given number of repeats def evaluate_model(X, y, repeats): # prepare the cross-validation procedure cv = RepeatedKFold(n_splits=10, n_repeats=repeats, random_state=1) # create model model = LogisticRegression() # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # create dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) # configurations to test repeats = range(1,16) results = list() for r in repeats: # evaluate using a given number of repeats scores = evaluate_model(X, y, r) # summarize print('>%d mean=%.4f se=%.3f' % (r, mean(scores), sem(scores))) # store results.append(scores) # plot the results pyplot.boxplot(results, labels=[str(r) for r in repeats], showmeans=True) pyplot.show()

Running the example reports the mean and standard error classification accuracy using 10-fold cross-validation with different numbers of repeats.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that the default of one repeat appears optimistic compared to the other results with an accuracy of about 86.80 percent compared to 86.73 percent and lower with differing numbers of repeats.

We can see that the mean seems to coalesce around a value of about 86.5 percent. We might take this as the stable estimate of model performance and in turn, choose 5 or 6 repeats that seem to approximate this value first.

Looking at the standard error, we can see that it decreases with an increase in the number of repeats and stabilizes with a value around 0.003 at around 9 or 10 repeats, although 5 repeats achieve a standard error of 0.005, half of that achieved with a single repeat.

>1 mean=0.8680 se=0.011 >2 mean=0.8675 se=0.008 >3 mean=0.8673 se=0.006 >4 mean=0.8670 se=0.006 >5 mean=0.8658 se=0.005 >6 mean=0.8655 se=0.004 >7 mean=0.8651 se=0.004 >8 mean=0.8651 se=0.004 >9 mean=0.8656 se=0.003 >10 mean=0.8658 se=0.003 >11 mean=0.8655 se=0.003 >12 mean=0.8654 se=0.003 >13 mean=0.8652 se=0.003 >14 mean=0.8651 se=0.003 >15 mean=0.8653 se=0.003

A box and whisker plot is created to summarize the distribution of scores for each number of repeats.

The orange line indicates the median of the distribution and the green triangle represents the arithmetic mean. If these symbols (values) coincide, it suggests a reasonable symmetric distribution and that the mean may capture the central tendency well.

This might provide an additional heuristic for choosing an appropriate number of repeats for your test harness.

Taking this into consideration, using five repeats with this chosen test harness and algorithm appears to be a good choice.

This section provides more resources on the topic if you are looking to go deeper.

- A Gentle Introduction to k-fold Cross-Validation
- How to Fix k-Fold Cross-Validation for Imbalanced Classification

- sklearn.model_selection.KFold API.
- sklearn.model_selection.RepeatedKFold API.
- sklearn.model_selection.LeaveOneOut API.
- sklearn.model_selection.cross_val_score API.

In this tutorial, you discovered repeated k-fold cross-validation for model evaluation.

Specifically, you learned:

- The mean performance reported from a single run of k-fold cross-validation may be noisy.
- Repeated k-fold cross-validation provides a way to reduce the error in the estimate of mean model performance.
- How to evaluate machine learning models using repeated k-fold cross-validation in Python.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Repeated k-Fold Cross-Validation for Model Evaluation in Python appeared first on Machine Learning Mastery.

]]>The post How to Configure k-Fold Cross-Validation appeared first on Machine Learning Mastery.

]]>A common value for *k* is 10, although how do we know that this configuration is appropriate for our dataset and our algorithms?

One approach is to explore the effect of different *k* values on the estimate of model performance and compare this to an ideal test condition. This can help to choose an appropriate value for *k*.

Once a *k*-value is chosen, it can be used to evaluate a suite of different algorithms on the dataset and the distribution of results can be compared to an evaluation of the same algorithms using an ideal test condition to see if they are highly correlated or not. If correlated, it confirms the chosen configuration is a robust approximation for the ideal test condition.

In this tutorial, you will discover how to configure and evaluate configurations of k-fold cross-validation.

After completing this tutorial, you will know:

- How to evaluate a machine learning algorithm using k-fold cross-validation on a dataset.
- How to perform a sensitivity analysis of k-values for k-fold cross-validation.
- How to calculate the correlation between a cross-validation test harness and an ideal test condition.

Let’s get started.

This tutorial is divided into three parts; they are:

- k-Fold Cross-Validation
- Sensitivity Analysis for k
- Correlation of Test Harness With Target

It is common to evaluate machine learning models on a dataset using k-fold cross-validation.

The k-fold cross-validation procedure divides a limited dataset into k non-overlapping folds. Each of the k folds is given an opportunity to be used as a held-back test set, whilst all other folds collectively are used as a training dataset. A total of k models are fit and evaluated on the k hold-out test sets and the mean performance is reported.

For more on the k-fold cross-validation procedure, see the tutorial:

The k-fold cross-validation procedure can be implemented easily using the scikit-learn machine learning library.

First, let’s define a synthetic classification dataset that we can use as the basis of this tutorial.

The make_classification() function can be used to create a synthetic binary classification dataset. We will configure it to generate 100 samples each with 20 input features, 15 of which contribute to the target variable.

The example below creates and summarizes the dataset.

# test classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=100, n_features=20, n_informative=15, n_redundant=5, random_state=1) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and confirms that it contains 100 samples and 10 input variables.

The fixed seed for the pseudorandom number generator ensures that we get the same samples each time the dataset is generated.

(100, 20) (100,)

Next, we can evaluate a model on this dataset using k-fold cross-validation.

We will evaluate a LogisticRegression model and use the KFold class to perform the cross-validation, configured to shuffle the dataset and set k=10, a popular default.

The cross_val_score() function will be used to perform the evaluation, taking the dataset and cross-validation configuration and returning a list of scores calculated for each fold.

The complete example is listed below.

# evaluate a logistic regression model using k-fold cross-validation from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression # create dataset X, y = make_classification(n_samples=100, n_features=20, n_informative=15, n_redundant=5, random_state=1) # prepare the cross-validation procedure cv = KFold(n_splits=10, random_state=1, shuffle=True) # create model model = LogisticRegression() # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report performance print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example creates the dataset, then evaluates a logistic regression model on it using 10-fold cross-validation. The mean classification accuracy on the dataset is then reported.

In this case, we can see that the model achieved an estimated classification accuracy of about 85.0 percent.

Accuracy: 0.850 (0.128)

Now that we are familiar with k-fold cross-validation, let’s look at how we might configure the procedure.

The key configuration parameter for k-fold cross-validation is k that defines the number folds in which to split a given dataset.

Common values are k=3, k=5, and k=10, and by far the most popular value used in applied machine learning to evaluate models is k=10. The reason for this is studies were performed and k=10 was found to provide good trade-off of low computational cost and low bias in an estimate of model performance.

**How do we know what value of k to use when evaluating models on our own dataset?**

You can choose k=10, but how do you know this makes sense for your dataset?

One approach to answering this question is to perform a sensitivity analysis for different k values. That is, evaluate the performance of the same model on the same dataset with different values of k and see how they compare.

The expectation is that low values of k will result in a noisy estimate of model performance and large values of k will result in a less noisy estimate of model performance.

But noisy compared to what?

We don’t know the true performance of the model when making predictions on new/unseen data, as we don’t have access to new/unseen data. If we did, we would make use of it in the evaluation of the model.

Nevertheless, we can choose a test condition that represents an “*ideal*” or as-best-as-we-can-achieve “*ideal*” estimate of model performance.

One approach would be to train the model on all available data and estimate the performance on a separate large and representative hold-out dataset. The performance on this hold-out dataset would represent the “*true*” performance of the model and any cross-validation performances on the training dataset would represent an estimate of this score.

This is rarely possible as we often do not have enough data to hold some back and use it as a test set. Kaggle machine learning competitions are one exception to this, where we do have a hold-out test set, a sample of which is evaluated via submissions.

Instead, we can simulate this case using the leave-one-out cross-validation (LOOCV), a computationally expensive version of cross-validation where *k=N*, and *N* is the total number of examples in the training dataset. That is, each sample in the training set is given an example to be used alone as the test evaluation dataset. It is rarely used for large datasets as it is computationally expensive, although it can provide a good estimate of model performance given the available data.

We can then compare the mean classification accuracy for different k values to the mean classification accuracy from LOOCV on the same dataset. The difference between the scores provides a rough proxy for how well a k value approximates the ideal model evaluation test condition.

Let’s explore how to implement a sensitivity analysis of k-fold cross-validation.

First, let’s define a function to create the dataset. This allows you to change the dataset to your own if you desire.

# create the dataset def get_dataset(n_samples=100): X, y = make_classification(n_samples=n_samples, n_features=20, n_informative=15, n_redundant=5, random_state=1) return X, y

Next, we can define a dataset to create the model to evaluate.

Again, this separation allows you to change the model to your own if you desire.

# retrieve the model to be evaluate def get_model(): model = LogisticRegression() return model

Next, you can define a function to evaluate the model on the dataset given a test condition. The test condition could be an instance of the KFold configured with a given k-value, or it could be an instance of LeaveOneOut that represents our ideal test condition.

The function returns the mean classification accuracy as well as the min and max accuracy from the folds. We can use the min and max to summarize the distribution of scores.

# evaluate the model using a given test condition def evaluate_model(cv): # get the dataset X, y = get_dataset() # get the model model = get_model() # evaluate the model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # return scores return mean(scores), scores.min(), scores.max()

Next, we can calculate the model performance using the LOOCV procedure.

... # calculate the ideal test condition ideal, _, _ = evaluate_model(LeaveOneOut()) print('Ideal: %.3f' % ideal)

We can then define the k values to evaluate. In this case, we will test values between 2 and 30.

... # define folds to test folds = range(2,31)

We can then evaluate each value in turn and store the results as we go.

... # record mean and min/max of each set of results means, mins, maxs = list(),list(),list() # evaluate each k value for k in folds: # define the test condition cv = KFold(n_splits=k, shuffle=True, random_state=1) # evaluate k value k_mean, k_min, k_max = evaluate_model(cv) # report performance print('> folds=%d, accuracy=%.3f (%.3f,%.3f)' % (k, k_mean, k_min, k_max)) # store mean accuracy means.append(k_mean) # store min and max relative to the mean mins.append(k_mean - k_min) maxs.append(k_max - k_mean)

Finally, we can plot the results for interpretation.

... # line plot of k mean values with min/max error bars pyplot.errorbar(folds, means, yerr=[mins, maxs], fmt='o') # plot the ideal case in a separate color pyplot.plot(folds, [ideal for _ in range(len(folds))], color='r') # show the plot pyplot.show()

Tying this together, the complete example is listed below.

# sensitivity analysis of k in k-fold cross-validation from numpy import mean from sklearn.datasets import make_classification from sklearn.model_selection import LeaveOneOut from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression from matplotlib import pyplot # create the dataset def get_dataset(n_samples=100): X, y = make_classification(n_samples=n_samples, n_features=20, n_informative=15, n_redundant=5, random_state=1) return X, y # retrieve the model to be evaluate def get_model(): model = LogisticRegression() return model # evaluate the model using a given test condition def evaluate_model(cv): # get the dataset X, y = get_dataset() # get the model model = get_model() # evaluate the model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # return scores return mean(scores), scores.min(), scores.max() # calculate the ideal test condition ideal, _, _ = evaluate_model(LeaveOneOut()) print('Ideal: %.3f' % ideal) # define folds to test folds = range(2,31) # record mean and min/max of each set of results means, mins, maxs = list(),list(),list() # evaluate each k value for k in folds: # define the test condition cv = KFold(n_splits=k, shuffle=True, random_state=1) # evaluate k value k_mean, k_min, k_max = evaluate_model(cv) # report performance print('> folds=%d, accuracy=%.3f (%.3f,%.3f)' % (k, k_mean, k_min, k_max)) # store mean accuracy means.append(k_mean) # store min and max relative to the mean mins.append(k_mean - k_min) maxs.append(k_max - k_mean) # line plot of k mean values with min/max error bars pyplot.errorbar(folds, means, yerr=[mins, maxs], fmt='o') # plot the ideal case in a separate color pyplot.plot(folds, [ideal for _ in range(len(folds))], color='r') # show the plot pyplot.show()

Running the example first reports the LOOCV, then the mean, min, and max accuracy for each k value that was evaluated.

In this case, we can see that the LOOCV result was about 84 percent, slightly lower than the k=10 result of 85 percent.

Ideal: 0.840 > folds=2, accuracy=0.740 (0.700,0.780) > folds=3, accuracy=0.749 (0.697,0.824) > folds=4, accuracy=0.790 (0.640,0.920) > folds=5, accuracy=0.810 (0.600,0.950) > folds=6, accuracy=0.820 (0.688,0.941) > folds=7, accuracy=0.799 (0.571,1.000) > folds=8, accuracy=0.811 (0.385,0.923) > folds=9, accuracy=0.829 (0.636,1.000) > folds=10, accuracy=0.850 (0.600,1.000) > folds=11, accuracy=0.829 (0.667,1.000) > folds=12, accuracy=0.785 (0.250,1.000) > folds=13, accuracy=0.839 (0.571,1.000) > folds=14, accuracy=0.807 (0.429,1.000) > folds=15, accuracy=0.821 (0.571,1.000) > folds=16, accuracy=0.827 (0.500,1.000) > folds=17, accuracy=0.816 (0.600,1.000) > folds=18, accuracy=0.831 (0.600,1.000) > folds=19, accuracy=0.826 (0.600,1.000) > folds=20, accuracy=0.830 (0.600,1.000) > folds=21, accuracy=0.814 (0.500,1.000) > folds=22, accuracy=0.820 (0.500,1.000) > folds=23, accuracy=0.802 (0.250,1.000) > folds=24, accuracy=0.804 (0.250,1.000) > folds=25, accuracy=0.810 (0.250,1.000) > folds=26, accuracy=0.804 (0.250,1.000) > folds=27, accuracy=0.818 (0.250,1.000) > folds=28, accuracy=0.821 (0.250,1.000) > folds=29, accuracy=0.822 (0.250,1.000) > folds=30, accuracy=0.822 (0.333,1.000)

A line plot is created comparing the mean accuracy scores to the LOOCV result with the min and max of each result distribution indicated using error bars.

The results suggest that for this model on this dataset, most k values underestimate the performance of the model compared to the ideal case. The results suggest that perhaps k=10 alone is slightly optimistic and perhaps k=13 might be a more accurate estimate.

This provides a template that you can use to perform a sensitivity analysis of k values of your chosen model on your dataset against a given ideal test condition.

Once a test harness is chosen, another consideration is how well it matches the ideal test condition across different algorithms.

It is possible that for some algorithms and some configurations, the k-fold cross-validation will be a better approximation of the ideal test condition compared to other algorithms and algorithm configurations.

We can evaluate and report on this relationship explicitly. This can be achieved by calculating how well the k-fold cross-validation results across a range of algorithms match the evaluation of the same algorithms on the ideal test condition.

The Pearson’s correlation coefficient can be calculated between the two groups of scores to measure how closely they match. That is, do they change together in the same ways: when one algorithm looks better than another via k-fold cross-validation, does this hold on the ideal test condition?

We expect to see a strong positive correlation between the scores, such as 0.5 or higher. A low correlation suggests the need to change the k-fold cross-validation test harness to better match the ideal test condition.

First, we can define a function that will create a list of standard machine learning models to evaluate via each test harness.

# get a list of models to evaluate def get_models(): models = list() models.append(LogisticRegression()) models.append(RidgeClassifier()) models.append(SGDClassifier()) models.append(PassiveAggressiveClassifier()) models.append(KNeighborsClassifier()) models.append(DecisionTreeClassifier()) models.append(ExtraTreeClassifier()) models.append(LinearSVC()) models.append(SVC()) models.append(GaussianNB()) models.append(AdaBoostClassifier()) models.append(BaggingClassifier()) models.append(RandomForestClassifier()) models.append(ExtraTreesClassifier()) models.append(GaussianProcessClassifier()) models.append(GradientBoostingClassifier()) models.append(LinearDiscriminantAnalysis()) models.append(QuadraticDiscriminantAnalysis()) return models

We will use k=10 for the chosen test harness.

We can then enumerate each model and evaluate it using 10-fold cross-validation and our ideal test condition, in this case, LOOCV.

... # define test conditions ideal_cv = LeaveOneOut() cv = KFold(n_splits=10, shuffle=True, random_state=1) # get the list of models to consider models = get_models() # collect results ideal_results, cv_results = list(), list() # evaluate each model for model in models: # evaluate model using each test condition cv_mean = evaluate_model(cv, model) ideal_mean = evaluate_model(ideal_cv, model) # check for invalid results if isnan(cv_mean) or isnan(ideal_mean): continue # store results cv_results.append(cv_mean) ideal_results.append(ideal_mean) # summarize progress print('>%s: ideal=%.3f, cv=%.3f' % (type(model).__name__, ideal_mean, cv_mean))

We can then calculate the correlation between the mean classification accuracy from the 10-fold cross-validation test harness and the LOOCV test harness.

... # calculate the correlation between each test condition corr, _ = pearsonr(cv_results, ideal_results) print('Correlation: %.3f' % corr)

Finally, we can create a scatter plot of the two sets of results and draw a line of best fit to visually see how well they change together.

... # scatter plot of results pyplot.scatter(cv_results, ideal_results) # plot the line of best fit coeff, bias = polyfit(cv_results, ideal_results, 1) line = coeff * asarray(cv_results) + bias pyplot.plot(cv_results, line, color='r') # show the plot pyplot.show()

Tying all of this together, the complete example is listed below.

# correlation between test harness and ideal test condition from numpy import mean from numpy import isnan from numpy import asarray from numpy import polyfit from scipy.stats import pearsonr from matplotlib import pyplot from sklearn.datasets import make_classification from sklearn.model_selection import KFold from sklearn.model_selection import LeaveOneOut from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.linear_model import RidgeClassifier from sklearn.linear_model import SGDClassifier from sklearn.linear_model import PassiveAggressiveClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.tree import ExtraTreeClassifier from sklearn.svm import LinearSVC from sklearn.svm import SVC from sklearn.naive_bayes import GaussianNB from sklearn.ensemble import AdaBoostClassifier from sklearn.ensemble import BaggingClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import ExtraTreesClassifier from sklearn.gaussian_process import GaussianProcessClassifier from sklearn.ensemble import GradientBoostingClassifier from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis # create the dataset def get_dataset(n_samples=100): X, y = make_classification(n_samples=n_samples, n_features=20, n_informative=15, n_redundant=5, random_state=1) return X, y # get a list of models to evaluate def get_models(): models = list() models.append(LogisticRegression()) models.append(RidgeClassifier()) models.append(SGDClassifier()) models.append(PassiveAggressiveClassifier()) models.append(KNeighborsClassifier()) models.append(DecisionTreeClassifier()) models.append(ExtraTreeClassifier()) models.append(LinearSVC()) models.append(SVC()) models.append(GaussianNB()) models.append(AdaBoostClassifier()) models.append(BaggingClassifier()) models.append(RandomForestClassifier()) models.append(ExtraTreesClassifier()) models.append(GaussianProcessClassifier()) models.append(GradientBoostingClassifier()) models.append(LinearDiscriminantAnalysis()) models.append(QuadraticDiscriminantAnalysis()) return models # evaluate the model using a given test condition def evaluate_model(cv, model): # get the dataset X, y = get_dataset() # evaluate the model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # return scores return mean(scores) # define test conditions ideal_cv = LeaveOneOut() cv = KFold(n_splits=10, shuffle=True, random_state=1) # get the list of models to consider models = get_models() # collect results ideal_results, cv_results = list(), list() # evaluate each model for model in models: # evaluate model using each test condition cv_mean = evaluate_model(cv, model) ideal_mean = evaluate_model(ideal_cv, model) # check for invalid results if isnan(cv_mean) or isnan(ideal_mean): continue # store results cv_results.append(cv_mean) ideal_results.append(ideal_mean) # summarize progress print('>%s: ideal=%.3f, cv=%.3f' % (type(model).__name__, ideal_mean, cv_mean)) # calculate the correlation between each test condition corr, _ = pearsonr(cv_results, ideal_results) print('Correlation: %.3f' % corr) # scatter plot of results pyplot.scatter(cv_results, ideal_results) # plot the line of best fit coeff, bias = polyfit(cv_results, ideal_results, 1) line = coeff * asarray(cv_results) + bias pyplot.plot(cv_results, line, color='r') # label the plot pyplot.title('10-fold CV vs LOOCV Mean Accuracy') pyplot.xlabel('Mean Accuracy (10-fold CV)') pyplot.ylabel('Mean Accuracy (LOOCV)') # show the plot pyplot.show()

Running the example reports the mean classification accuracy for each algorithm calculated via each test harness.

You may see some warnings that you can safely ignore, such as:

Variables are collinear

We can see that for some algorithms, the test harness over-estimates the accuracy compared to LOOCV, and in other cases, it under-estimates the accuracy. This is to be expected.

At the end of the run, we can see that the correlation between the two sets of results is reported. In this case, we can see that a correlation of 0.746 is reported, which is a good strong positive correlation. The results suggest that 10-fold cross-validation does provide a good approximation for the LOOCV test harness on this dataset as calculated with 18 popular machine learning algorithms.

>LogisticRegression: ideal=0.840, cv=0.850 >RidgeClassifier: ideal=0.830, cv=0.830 >SGDClassifier: ideal=0.730, cv=0.790 >PassiveAggressiveClassifier: ideal=0.780, cv=0.760 >KNeighborsClassifier: ideal=0.760, cv=0.770 >DecisionTreeClassifier: ideal=0.690, cv=0.630 >ExtraTreeClassifier: ideal=0.710, cv=0.620 >LinearSVC: ideal=0.850, cv=0.830 >SVC: ideal=0.900, cv=0.880 >GaussianNB: ideal=0.730, cv=0.720 >AdaBoostClassifier: ideal=0.740, cv=0.740 >BaggingClassifier: ideal=0.770, cv=0.740 >RandomForestClassifier: ideal=0.810, cv=0.790 >ExtraTreesClassifier: ideal=0.820, cv=0.820 >GaussianProcessClassifier: ideal=0.790, cv=0.760 >GradientBoostingClassifier: ideal=0.820, cv=0.820 >LinearDiscriminantAnalysis: ideal=0.830, cv=0.830 >QuadraticDiscriminantAnalysis: ideal=0.610, cv=0.760 Correlation: 0.746

Finally, a scatter plot is created comparing the distribution of mean accuracy scores for the test harness (x-axis) vs. the accuracy scores via LOOCV (y-axis).

A red line of best fit is drawn through the results showing the strong linear correlation.

This provides a harness for comparing your chosen test harness to an ideal test condition on your own dataset.

This section provides more resources on the topic if you are looking to go deeper.

- A Gentle Introduction to k-fold Cross-Validation
- How to Fix k-Fold Cross-Validation for Imbalanced Classification

- sklearn.model_selection.KFold API.
- sklearn.model_selection.LeaveOneOut API.
- sklearn.model_selection.cross_val_score API.

In this tutorial, you discovered how to configure and evaluate configurations of k-fold cross-validation.

Specifically, you learned:

- How to evaluate a machine learning algorithm using k-fold cross-validation on a dataset.
- How to perform a sensitivity analysis of k-values for k-fold cross-validation.
- How to calculate the correlation between a cross-validation test harness and an ideal test condition.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Configure k-Fold Cross-Validation appeared first on Machine Learning Mastery.

]]>The post Nested Cross-Validation for Machine Learning with Python appeared first on Machine Learning Mastery.

]]>This procedure can be used both when optimizing the hyperparameters of a model on a dataset, and when comparing and selecting a model for the dataset. When the same cross-validation procedure and dataset are used to both tune and select a model, it is likely to lead to an optimistically biased evaluation of the model performance.

One approach to overcoming this bias is to nest the hyperparameter optimization procedure under the model selection procedure. This is called **double cross-validation** or **nested cross-validation** and is the preferred way to evaluate and compare tuned machine learning models.

In this tutorial, you will discover nested cross-validation for evaluating tuned machine learning models.

After completing this tutorial, you will know:

- Hyperparameter optimization can overfit a dataset and provide an optimistic evaluation of a model that should not be used for model selection.
- Nested cross-validation provides a way to reduce the bias in combined hyperparameter tuning and model selection.
- How to implement nested cross-validation for evaluating tuned machine learning algorithms in scikit-learn.

Let’s get started.

This tutorial is divided into three parts; they are:

- Combined Hyperparameter Tuning and Model Selection
- What Is Nested Cross-Validation
- Nested Cross-Validation With Scikit-Learn

It is common to evaluate machine learning models on a dataset using k-fold cross-validation.

The k-fold cross-validation procedure divides a limited dataset into k non-overlapping folds. Each of the k folds is given an opportunity to be used as a held back test set whilst all other folds collectively are used as a training dataset. A total of k models are fit and evaluated on the k holdout test sets and the mean performance is reported.

For more on the k-fold cross-validation procedure, see the tutorial:

The procedure provides an estimate of the model performance on the dataset when making a prediction on data not used during training. It is less biased than some other techniques, such as a single train-test split for small- to modestly-sized dataset. Common values for k are k=3, k=5, and k=10.

Each machine learning algorithm includes one or more hyperparameters that allow the algorithm behavior to be tailored to a specific dataset. The trouble is, there is rarely if ever good heuristics on how to configure the model hyperparameters for a dataset. Instead, an optimization procedure is used to discover a set of hyperparameters that perform well or best on the dataset. Common examples of optimization algorithms include grid search and random search, and each distinct set of model hyperparameters are typically evaluated using k-fold cross-validation.

This highlights that the k-fold cross-validation procedure is used both in the selection of model hyperparameters to configure each model and in the selection of configured models.

The k-fold cross-validation procedure is an effective approach for estimating the performance of a model. Nevertheless, a limitation of the procedure is that if it is used multiple times with the same algorithm, it can lead to overfitting.

Each time a model with different model hyperparameters is evaluated on a dataset, it provides information about the dataset. Specifically, an often noisy model performance score. This knowledge about the model on the dataset can be exploited in the model configuration procedure to find the best performing configuration for the dataset. The k-fold cross-validation procedure attempts to reduce this effect, yet it cannot be removed completely, and some form of hill-climbing or overfitting of the model hyperparameters to the dataset will be performed. This is the normal case for hyperparameter optimization.

The problem is that if this score alone is used to then select a model, or the same dataset is used to evaluate the tuned models, then the selection process will be biased by this inadvertent overfitting. The result is an overly optimistic estimate of model performance that does not generalize to new data.

A procedure is required that allows both the models to select well-performing hyperparameters for the dataset and select among a collection of well-configured models on a dataset.

One approach to this problem is called **nested cross-validation**.

Nested cross-validation is an approach to model hyperparameter optimization and model selection that attempts to overcome the problem of overfitting the training dataset.

In order to overcome the bias in performance evaluation, model selection should be viewed as an integral part of the model fitting procedure, and should be conducted independently in each trial in order to prevent selection bias and because it reflects best practice in operational use.

— On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation, 2010.

The procedure involves treating model hyperparameter optimization as part of the model itself and evaluating it within the broader k-fold cross-validation procedure for evaluating models for comparison and selection.

As such, the k-fold cross-validation procedure for model hyperparameter optimization is nested inside the *k*-fold cross-validation procedure for model selection. The use of two cross-validation loops also leads the procedure to be called “*double cross-validation*.”

Typically, the k-fold cross-validation procedure involves fitting a model on all folds but one and evaluating the fit model on the holdout fold. Let’s refer to the aggregate of folds used to train the model as the “*train dataset*” and the held-out fold as the “*test dataset*.”

Each training dataset is then provided to a hyperparameter optimized procedure, such as grid search or random search, that finds an optimal set of hyperparameters for the model. The evaluation of each set of hyperparameters is performed using k-fold cross-validation that splits up the provided train dataset into *k* folds, not the original dataset.

This is termed the “internal” protocol as the model selection process is performed independently within each fold of the resampling procedure.

— On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation, 2010.

Under this procedure, hyperparameter search does not have an opportunity to overfit the dataset as it is only exposed to a subset of the dataset provided by the outer cross-validation procedure. This reduces, if not eliminates, the risk of the search procedure overfitting the original dataset and should provide a less biased estimate of a tuned model’s performance on the dataset.

In this way, the performance estimate includes a component properly accounting for the error introduced by overfitting the model selection criterion.

— On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation, 2010.

A downside of nested cross-validation is the dramatic increase in the number of model evaluations performed.

If *n * k* models are fit and evaluated as part of a traditional cross-validation hyperparameter search for a given model, then this is increased to *k * n * k* as the procedure is then performed *k* more times for each fold in the outer loop of nested cross-validation.

To make this concrete, you might use *k=5* for the hyperparameter search and test 100 combinations of model hyperparameters. A traditional hyperparameter search would, therefore, fit and evaluate *5 * 100* or 500 models. Nested cross-validation with *k=10* folds in the outer loop would fit and evaluate 5,000 models. A 10x increase in this case.

The k value for the inner loop and the outer loop should be set as you would set the *k*-value for a single *k*-fold cross-validation procedure.

You must choose a *k*-value for your dataset that balances the computational cost of the evaluation procedure (not too many model evaluations) and unbiased estimate of model performance.

It is common to use *k=10* for the outer loop and a smaller value of k for the inner loop, such as *k=3* or *k=5*.

The final model is configured and fit using the procedure applied internally to the outer loop.

As follows:

- An algorithm is selected based on its performance on the outer loop of nested cross-validation.
- Then the inner-procedure is applied to the entire dataset.
- The hyperparameters found during this final search are then used to configure a final model.
- The final model is fit on the entire dataset.

This model can then be used to make predictions on new data. We know how well it will perform on average based on the score provided during the final model tuning procedure.

Now that we are familiar with nested-cross validation, let’s review how we can implement it in practice.

The k-fold cross-validation procedure is available in the scikit-learn Python machine learning library via the KFold class.

The class is configured with the number of folds (splits), then the *split()* function is called, passing in the dataset. The results of the *split()* function are enumerated to give the row indexes for the train and test sets for each fold.

For example:

... # configure the cross-validation procedure cv = KFold(n_splits=10, random_state=1) # perform cross-validation procedure for train_ix, test_ix in cv_outer.split(X): # split data X_train, X_test = X[train_ix, :], X[test_ix, :] y_train, y_test = y[train_ix], y[test_ix] # fit and evaluate a model ...

This class can be used to perform the outer-loop of the nested-cross validation procedure.

The scikit-learn library provides cross-validation random search and grid search hyperparameter optimization via the RandomizedSearchCV and GridSearchCV classes respectively. The procedure is configured by creating the class and specifying the model, dataset, hyperparameters to search, and cross-validation procedure.

For example:

... # configure the cross-validation procedure cv = KFold(n_splits=3, shuffle=True, random_state=1) # define search space space = dict() ... # define search search = GridSearchCV(model, space, scoring='accuracy', n_jobs=-1, cv=cv) # execute search result = search.fit(X, y)

These classes can be used for the inner loop of nested cross-validation where the train dataset defined by the outer loop is used as the dataset for the inner loop.

We can tie these elements together and implement the nested cross-validation procedure.

Importantly, we can configure the hyperparameter search to refit a final model with the entire training dataset using the best hyperparameters found during the search. This can be achieved by setting the “*refit*” argument to True, then retrieving the model via the “*best_estimator_*” attribute on the search result.

... # define search search = GridSearchCV(model, space, scoring='accuracy', n_jobs=-1, cv=cv_inner, refit=True) # execute search result = search.fit(X_train, y_train) # get the best performing model fit on the whole training set best_model = result.best_estimator_

This model can then be used to make predictions on the holdout data from the outer loop and estimate the performance of the model.

... # evaluate model on the hold out dataset yhat = best_model.predict(X_test)

Tying all of this together, we can demonstrate nested cross-validation for the RandomForestClassifier on a synthetic classification dataset.

We will keep things simple and tune just two hyperparameters with three values each, e.g. (*3 * 3*) 9 combinations. We will use 10 folds in the outer cross-validation and three folds for the inner cross-validation, resulting in (*10 * 9 * 3*) or 270 model evaluations.

The complete example is listed below.

# manual nested cross-validation for random forest on a classification dataset from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import KFold from sklearn.model_selection import GridSearchCV from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score # create dataset X, y = make_classification(n_samples=1000, n_features=20, random_state=1, n_informative=10, n_redundant=10) # configure the cross-validation procedure cv_outer = KFold(n_splits=10, shuffle=True, random_state=1) # enumerate splits outer_results = list() for train_ix, test_ix in cv_outer.split(X): # split data X_train, X_test = X[train_ix, :], X[test_ix, :] y_train, y_test = y[train_ix], y[test_ix] # configure the cross-validation procedure cv_inner = KFold(n_splits=3, shuffle=True, random_state=1) # define the model model = RandomForestClassifier(random_state=1) # define search space space = dict() space['n_estimators'] = [10, 100, 500] space['max_features'] = [2, 4, 6] # define search search = GridSearchCV(model, space, scoring='accuracy', cv=cv_inner, refit=True) # execute search result = search.fit(X_train, y_train) # get the best performing model fit on the whole training set best_model = result.best_estimator_ # evaluate model on the hold out dataset yhat = best_model.predict(X_test) # evaluate the model acc = accuracy_score(y_test, yhat) # store the result outer_results.append(acc) # report progress print('>acc=%.3f, est=%.3f, cfg=%s' % (acc, result.best_score_, result.best_params_)) # summarize the estimated performance of the model print('Accuracy: %.3f (%.3f)' % (mean(outer_results), std(outer_results)))

Running the example evaluates random forest using nested-cross validation on a synthetic classification dataset.

You can use the example as a starting point and adapt it to evaluate different algorithm hyperparameters, different algorithms, or a different dataset.

Each iteration of the outer cross-validation procedure reports the estimated performance of the best performing model (using 3-fold cross-validation) and the hyperparameters found to perform the best, as well as the accuracy on the holdout dataset.

This is insightful as we can see that the actual and estimated accuracies are different, but in this case, similar. We can also see that different hyperparameters are found on each iteration, showing that good hyperparameters on this dataset are dependent on the specifics of the dataset.

A final mean classification accuracy is then reported.

>acc=0.900, est=0.932, cfg={'max_features': 4, 'n_estimators': 100} >acc=0.940, est=0.924, cfg={'max_features': 4, 'n_estimators': 500} >acc=0.930, est=0.929, cfg={'max_features': 4, 'n_estimators': 500} >acc=0.930, est=0.927, cfg={'max_features': 6, 'n_estimators': 100} >acc=0.920, est=0.927, cfg={'max_features': 4, 'n_estimators': 100} >acc=0.950, est=0.927, cfg={'max_features': 4, 'n_estimators': 500} >acc=0.910, est=0.918, cfg={'max_features': 2, 'n_estimators': 100} >acc=0.930, est=0.924, cfg={'max_features': 6, 'n_estimators': 500} >acc=0.960, est=0.926, cfg={'max_features': 2, 'n_estimators': 500} >acc=0.900, est=0.937, cfg={'max_features': 4, 'n_estimators': 500} Accuracy: 0.927 (0.019)

A simpler way that we can perform the same procedure is by using the cross_val_score() function that will execute the outer cross-validation procedure. This can be performed on the configured *GridSearchCV* directly that will automatically use the refit best performing model on the test set from the outer loop.

This greatly reduces the amount of code required to perform the nested cross-validation.

The complete example is listed below.

# automatic nested cross-validation for random forest on a classification dataset from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import KFold from sklearn.model_selection import GridSearchCV from sklearn.ensemble import RandomForestClassifier # create dataset X, y = make_classification(n_samples=1000, n_features=20, random_state=1, n_informative=10, n_redundant=10) # configure the cross-validation procedure cv_inner = KFold(n_splits=3, shuffle=True, random_state=1) # define the model model = RandomForestClassifier(random_state=1) # define search space space = dict() space['n_estimators'] = [10, 100, 500] space['max_features'] = [2, 4, 6] # define search search = GridSearchCV(model, space, scoring='accuracy', n_jobs=1, cv=cv_inner, refit=True) # configure the cross-validation procedure cv_outer = KFold(n_splits=10, shuffle=True, random_state=1) # execute the nested cross-validation scores = cross_val_score(search, X, y, scoring='accuracy', cv=cv_outer, n_jobs=-1) # report performance print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the examples performs the nested cross-validation on the random forest algorithm, achieving a mean accuracy that matches our manual procedure.

Accuracy: 0.927 (0.019)

This section provides more resources on the topic if you are looking to go deeper.

- Cross-validatory choice and assessment of statistical predictions, 1974.
- On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation, 2010.
- Cross-validation pitfalls when selecting and assessing regression and classification models, 2014.
- Nested cross-validation when selecting classifiers is overzealous for most practical applications, 2018.

- Cross-validation: evaluating estimator performance, scikit-learn.
- Nested versus non-nested cross-validation, scikit-learn example.
- sklearn.model_selection.KFold API.
- sklearn.model_selection.GridSearchCV API.
- sklearn.ensemble.RandomForestClassifier API.
- sklearn.model_selection.cross_val_score API.

In this tutorial, you discovered nested cross-validation for evaluating tuned machine learning models.

Specifically, you learned:

- Hyperparameter optimization can overfit a dataset and provide an optimistic evaluation of a model that should not be used for model selection.
- Nested cross-validation provides a way to reduce the bias in combined hyperparameter tuning and model selection.
- How to implement nested cross-validation for evaluating tuned machine learning algorithms in scikit-learn.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Nested Cross-Validation for Machine Learning with Python appeared first on Machine Learning Mastery.

]]>The post LOOCV for Evaluating Machine Learning Algorithms appeared first on Machine Learning Mastery.

]]>It is a computationally expensive procedure to perform, although it results in a reliable and unbiased estimate of model performance. Although simple to use and no configuration to specify, there are times when the procedure should not be used, such as when you have a very large dataset or a computationally expensive model to evaluate.

In this tutorial, you will discover how to evaluate machine learning models using leave-one-out cross-validation.

After completing this tutorial, you will know:

- The leave-one-out cross-validation procedure is appropriate when you have a small dataset or when an accurate estimate of model performance is more important than the computational cost of the method.
- How to use the scikit-learn machine learning library to perform the leave-one-out cross-validation procedure.
- How to evaluate machine learning algorithms for classification and regression using leave-one-out cross-validation.

Let’s get started.

This tutorial is divided into three parts; they are:

- LOOCV Model Evaluation
- LOOCV Procedure in Scikit-Learn
- LOOCV to Evaluate Machine Learning Models
- LOOCV for Classification
- LOOCV for Regression

Cross-validation, or k-fold cross-validation, is a procedure used to estimate the performance of a machine learning algorithm when making predictions on data not used during the training of the model.

The cross-validation has a single hyperparameter “*k*” that controls the number of subsets that a dataset is split into. Once split, each subset is given the opportunity to be used as a test set while all other subsets together are used as a training dataset.

This means that k-fold cross-validation involves fitting and evaluating *k* models. This, in turn, provides k estimates of a model’s performance on the dataset, which can be reported using summary statistics such as the mean and standard deviation. This score can then be used to compare and ultimately select a model and configuration to use as the “*final model*” for a dataset.

Typical values for k are k=3, k=5, and k=10, with 10 representing the most common value. This is because, given extensive testing, 10-fold cross-validation provides a good balance of low computational cost and low bias in the estimate of model performance as compared to other k values and a single train-test split.

For more on k-fold cross-validation, see the tutorial:

Leave-one-out cross-validation, or LOOCV, is a configuration of k-fold cross-validation where *k* is set to the number of examples in the dataset.

LOOCV is an extreme version of k-fold cross-validation that has the maximum computational cost. It requires one model to be created and evaluated for each example in the training dataset.

The benefit of so many fit and evaluated models is a more robust estimate of model performance as each row of data is given an opportunity to represent the entirety of the test dataset.

Given the computational cost, LOOCV is not appropriate for very large datasets such as more than tens or hundreds of thousands of examples, or for models that are costly to fit, such as neural networks.

**Don’t Use LOOCV**: Large datasets or costly models to fit.

Given the improved estimate of model performance, LOOCV is appropriate when an accurate estimate of model performance is critical. This particularly case when the dataset is small, such as less than thousands of examples, can lead to model overfitting during training and biased estimates of model performance.

Further, given that no sampling of the training dataset is used, this estimation procedure is deterministic, unlike train-test splits and other k-fold cross-validation confirmations that provide a stochastic estimate of model performance.

**Use LOOCV**: Small datasets or when estimated model performance is critical.

Once models have been evaluated using LOOCV and a final model and configuration chosen, a final model is then fit on all available data and used to make predictions on new data.

Now that we are familiar with the LOOCV procedure, let’s look at how we can use the method in Python.

The scikit-learn Python machine learning library provides an implementation of the LOOCV via the LeaveOneOut class.

The method has no configuration, therefore, no arguments are provided to create an instance of the class.

... # create loocv procedure cv = LeaveOneOut()

Once created, the *split()* function can be called and provided the dataset to enumerate.

Each iteration will return the row indices that can be used for the train and test sets from the provided dataset.

... for train_ix, test_ix in cv.split(X): ...

These indices can be used on the input (*X*) and output (*y*) columns of the dataset array to split the dataset.

... # split data X_train, X_test = X[train_ix, :], X[test_ix, :] y_train, y_test = y[train_ix], y[test_ix]

The training set can be used to fit a model and the test set can be used to evaluate it by first making a prediction and calculating a performance metric on the predicted values versus the expected values.

... # fit model model = RandomForestClassifier(random_state=1) model.fit(X_train, y_train) # evaluate model yhat = model.predict(X_test)

Scores can be saved from each evaluation and a final mean estimate of model performance can be presented.

We can tie this together and demonstrate how to use LOOCV to evaluate a RandomForestClassifier model for a synthetic binary classification dataset created with the make_blobs() function.

The complete example is listed below.

# loocv to manually evaluate the performance of a random forest classifier from sklearn.datasets import make_blobs from sklearn.model_selection import LeaveOneOut from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score # create dataset X, y = make_blobs(n_samples=100, random_state=1) # create loocv procedure cv = LeaveOneOut() # enumerate splits y_true, y_pred = list(), list() for train_ix, test_ix in cv.split(X): # split data X_train, X_test = X[train_ix, :], X[test_ix, :] y_train, y_test = y[train_ix], y[test_ix] # fit model model = RandomForestClassifier(random_state=1) model.fit(X_train, y_train) # evaluate model yhat = model.predict(X_test) # store y_true.append(y_test[0]) y_pred.append(yhat[0]) # calculate accuracy acc = accuracy_score(y_true, y_pred) print('Accuracy: %.3f' % acc)

Running the example manually estimates the performance of the random forest classifier on the synthetic dataset.

Given that the dataset has 100 examples, it means that 100 train/test splits of the dataset were created, with each single row of the dataset given an opportunity to be used as the test set. Similarly, 100 models are created and evaluated.

The classification accuracy across all predictions is then reported, in this case as 99 percent.

Accuracy: 0.990

A downside of enumerating the folds manually is that it is slow and involves a lot of code that could introduce bugs.

An alternative to evaluating a model using LOOCV is to use the cross_val_score() function.

This function takes the model, the dataset, and the instantiated LOOCV object set via the “*cv*” argument. A sample of accuracy scores is then returned that can be summarized by calculating the mean and standard deviation.

We can also set the “*n_jobs*” argument to -1 to use all CPU cores, greatly decreasing the computational cost in fitting and evaluating so many models.

The example below demonstrates evaluating the *RandomForestClassifier* using LOOCV on the same synthetic dataset using the *cross_val_score()* function.

# loocv to automatically evaluate the performance of a random forest classifier from numpy import mean from numpy import std from sklearn.datasets import make_blobs from sklearn.model_selection import LeaveOneOut from sklearn.model_selection import cross_val_score from sklearn.ensemble import RandomForestClassifier # create dataset X, y = make_blobs(n_samples=100, random_state=1) # create loocv procedure cv = LeaveOneOut() # create model model = RandomForestClassifier(random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report performance print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example automatically estimates the performance of the random forest classifier on the synthetic dataset.

The mean classification accuracy across all folds matches our manual estimate previously.

Accuracy: 0.990 (0.099)

Now that we are familiar with how to use the LeaveOneOut class, let’s look at how we can use it to evaluate a machine learning model on real datasets.

In this section, we will explore using the LOOCV procedure to evaluate machine learning models on standard classification and regression predictive modeling datasets.

We will demonstrate how to use LOOCV to evaluate a random forest algorithm on the sonar dataset.

The sonar dataset is a standard machine learning dataset comprising 208 rows of data with 60 numerical input variables and a target variable with two class values, e.g. binary classification.

The dataset involves predicting whether sonar returns indicate a rock or simulated mine.

No need to download the dataset; we will download it automatically as part of our worked examples.

The example below downloads the dataset and summarizes its shape.

# summarize the sonar dataset from pandas import read_csv # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv' dataframe = read_csv(url, header=None) # split into input and output elements data = dataframe.values X, y = data[:, :-1], data[:, -1] print(X.shape, y.shape)

Running the example downloads the dataset and splits it into input and output elements. As expected, we can see that there are 208 rows of data with 60 input variables.

(208, 60) (208,)

We can now evaluate a model using LOOCV.

First, the loaded dataset must be split into input and output components.

... # split into inputs and outputs X, y = data[:, :-1], data[:, -1] print(X.shape, y.shape)

Next, we define the LOOCV procedure.

... # create loocv procedure cv = LeaveOneOut()

We can then define the model to evaluate.

... # create model model = RandomForestClassifier(random_state=1)

Then use the *cross_val_score()* function to enumerate the folds, fit models, then make and evaluate predictions. We can then report the mean and standard deviation of model performance.

... # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report performance print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Tying this together, the complete example is listed below.

# loocv evaluate random forest on the sonar dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import LeaveOneOut from sklearn.model_selection import cross_val_score from sklearn.ensemble import RandomForestClassifier # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv' dataframe = read_csv(url, header=None) data = dataframe.values # split into inputs and outputs X, y = data[:, :-1], data[:, -1] print(X.shape, y.shape) # create loocv procedure cv = LeaveOneOut() # create model model = RandomForestClassifier(random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report performance print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example first loads the dataset and confirms the number of rows in the input and output elements.

The model is then evaluated using LOOCV and the estimated performance when making predictions on new data has an accuracy of about 82.2 percent.

(208, 60) (208,) Accuracy: 0.822 (0.382)

We will demonstrate how to use LOOCV to evaluate a random forest algorithm on the housing dataset.

The housing dataset is a standard machine learning dataset comprising 506 rows of data with 13 numerical input variables and a numerical target variable.

The dataset involves predicting the house price given details of the house’s suburb in the American city of Boston.

No need to download the dataset; we will download it automatically as part of our worked examples.

The example below downloads and loads the dataset as a Pandas DataFrame and summarizes the shape of the dataset.

# load and summarize the housing dataset from pandas import read_csv # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) # summarize shape print(dataframe.shape)

Running the example confirms the 506 rows of data and 13 input variables and single numeric target variables (14 in total).

(506, 14)

We can now evaluate a model using LOOCV.

First, the loaded dataset must be split into input and output components.

... # split into inputs and outputs X, y = data[:, :-1], data[:, -1] print(X.shape, y.shape)

Next, we define the LOOCV procedure.

... # create loocv procedure cv = LeaveOneOut()

We can then define the model to evaluate.

... # create model model = RandomForestRegressor(random_state=1)

Then use the *cross_val_score()* function to enumerate the folds, fit models, then make and evaluate predictions. We can then report the mean and standard deviation of model performance.

In this case, we use the mean absolute error (MAE) performance metric appropriate for regression.

... # evaluate model scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1) # force positive scores = absolute(scores) # report performance print('MAE: %.3f (%.3f)' % (mean(scores), std(scores)))

Tying this together, the complete example is listed below.

# loocv evaluate random forest on the housing dataset from numpy import mean from numpy import std from numpy import absolute from pandas import read_csv from sklearn.model_selection import LeaveOneOut from sklearn.model_selection import cross_val_score from sklearn.ensemble import RandomForestRegressor # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values # split into inputs and outputs X, y = data[:, :-1], data[:, -1] print(X.shape, y.shape) # create loocv procedure cv = LeaveOneOut() # create model model = RandomForestRegressor(random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1) # force positive scores = absolute(scores) # report performance print('MAE: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example first loads the dataset and confirms the number of rows in the input and output elements.

The model is evaluated using LOOCV and the performance of the model when making predictions on new data is a mean absolute error of about 2.180 (thousands of dollars).

(506, 13) (506,) MAE: 2.180 (2.346)

This section provides more resources on the topic if you are looking to go deeper.

- Cross-validation: evaluating estimator performance, scikit-learn.
- sklearn.model_selection.LeaveOneOut API.
- sklearn.model_selection.cross_val_score API.

In this tutorial, you discovered how to evaluate machine learning models using leave-one-out cross-validation.

Specifically, you learned:

- The leave-one-out cross-validation procedure is appropriate when you have a small dataset or when an accurate estimate of model performance is more important than the computational cost of the method.
- How to use the scikit-learn machine learning library to perform the leave-one-out cross-validation procedure.
- How to evaluate machine learning algorithms for classification and regression using leave-one-out cross-validation.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post LOOCV for Evaluating Machine Learning Algorithms appeared first on Machine Learning Mastery.

]]>The post Train-Test Split for Evaluating Machine Learning Algorithms appeared first on Machine Learning Mastery.

]]>It is a fast and easy procedure to perform, the results of which allow you to compare the performance of machine learning algorithms for your predictive modeling problem. Although simple to use and interpret, there are times when the procedure should not be used, such as when you have a small dataset and situations where additional configuration is required, such as when it is used for classification and the dataset is not balanced.

In this tutorial, you will discover how to evaluate machine learning models using the train-test split.

After completing this tutorial, you will know:

- The train-test split procedure is appropriate when you have a very large dataset, a costly model to train, or require a good estimate of model performance quickly.
- How to use the scikit-learn machine learning library to perform the train-test split procedure.
- How to evaluate machine learning algorithms for classification and regression using the train-test split.

Let’s get started.

This tutorial is divided into three parts; they are:

- Train-Test Split Evaluation
- When to Use the Train-Test Split
- How to Configure the Train-Test Split

- Train-Test Split Procedure in Scikit-Learn
- Repeatable Train-Test Splits
- Stratified Train-Test Splits

- Train-Test Split to Evaluate Machine Learning Models
- Train-Test Split for Classification
- Train-Test Split for Regression

The train-test split is a technique for evaluating the performance of a machine learning algorithm.

It can be used for classification or regression problems and can be used for any supervised learning algorithm.

The procedure involves taking a dataset and dividing it into two subsets. The first subset is used to fit the model and is referred to as the training dataset. The second subset is not used to train the model; instead, the input element of the dataset is provided to the model, then predictions are made and compared to the expected values. This second dataset is referred to as the test dataset.

**Train Dataset**: Used to fit the machine learning model.**Test Dataset**: Used to evaluate the fit machine learning model.

The objective is to estimate the performance of the machine learning model on new data: data not used to train the model.

This is how we expect to use the model in practice. Namely, to fit it on available data with known inputs and outputs, then make predictions on new examples in the future where we do not have the expected output or target values.

The train-test procedure is appropriate when there is a sufficiently large dataset available.

The idea of “sufficiently large” is specific to each predictive modeling problem. It means that there is enough data to split the dataset into train and test datasets and each of the train and test datasets are suitable representations of the problem domain. This requires that the original dataset is also a suitable representation of the problem domain.

A suitable representation of the problem domain means that there are enough records to cover all common cases and most uncommon cases in the domain. This might mean combinations of input variables observed in practice. It might require thousands, hundreds of thousands, or millions of examples.

Conversely, the train-test procedure is not appropriate when the dataset available is small. The reason is that when the dataset is split into train and test sets, there will not be enough data in the training dataset for the model to learn an effective mapping of inputs to outputs. There will also not be enough data in the test set to effectively evaluate the model performance. The estimated performance could be overly optimistic (good) or overly pessimistic (bad).

If you have insufficient data, then a suitable alternate model evaluation procedure would be the k-fold cross-validation procedure.

In addition to dataset size, another reason to use the train-test split evaluation procedure is computational efficiency.

Some models are very costly to train, and in that case, repeated evaluation used in other procedures is intractable. An example might be deep neural network models. In this case, the train-test procedure is commonly used.

Alternately, a project may have an efficient model and a vast dataset, although may require an estimate of model performance quickly. Again, the train-test split procedure is approached in this situation.

Samples from the original training dataset are split into the two subsets using random selection. This is to ensure that the train and test datasets are representative of the original dataset.

The procedure has one main configuration parameter, which is the size of the train and test sets. This is most commonly expressed as a percentage between 0 and 1 for either the train or test datasets. For example, a training set with the size of 0.67 (67 percent) means that the remainder percentage 0.33 (33 percent) is assigned to the test set.

There is no optimal split percentage.

You must choose a split percentage that meets your project’s objectives with considerations that include:

- Computational cost in training the model.
- Computational cost in evaluating the model.
- Training set representativeness.
- Test set representativeness.

Nevertheless, common split percentages include:

- Train: 80%, Test: 20%
- Train: 67%, Test: 33%
- Train: 50%, Test: 50%

Now that we are familiar with the train-test split model evaluation procedure, let’s look at how we can use this procedure in Python.

The scikit-learn Python machine learning library provides an implementation of the train-test split evaluation procedure via the train_test_split() function.

The function takes a loaded dataset as input and returns the dataset split into two subsets.

... # split into train test sets train, test = train_test_split(dataset, ...)

Ideally, you can split your original dataset into input (*X*) and output (*y*) columns, then call the function passing both arrays and have them split appropriately into train and test subsets.

... # split into train test sets X_train, X_test, y_train, y_test = train_test_split(X, y, ...)

The size of the split can be specified via the “*test_size*” argument that takes a number of rows (integer) or a percentage (float) of the size of the dataset between 0 and 1.

The latter is the most common, with values used such as 0.33 where 33 percent of the dataset will be allocated to the test set and 67 percent will be allocated to the training set.

... # split into train test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

We can demonstrate this using a synthetic classification dataset with 1,000 examples.

The complete example is listed below.

# split a dataset into train and test sets from sklearn.datasets import make_blobs from sklearn.model_selection import train_test_split # create dataset X, y = make_blobs(n_samples=1000) # split into train test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33) print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

Running the example splits the dataset into train and test sets, then prints the size of the new dataset.

We can see that 670 examples (67 percent) were allocated to the training set and 330 examples (33 percent) were allocated to the test set, as we specified.

(670, 2) (330, 2) (670,) (330,)

Alternatively, the dataset can be split by specifying the “*train_size*” argument that can be either a number of rows (integer) or a percentage of the original dataset between 0 and 1, such as 0.67 for 67 percent.

... # split into train test sets X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.67)

Another important consideration is that rows are assigned to the train and test sets randomly.

This is done to ensure that datasets are a representative sample (e.g. random sample) of the original dataset, which in turn, should be a representative sample of observations from the problem domain.

When comparing machine learning algorithms, it is desirable (perhaps required) that they are fit and evaluated on the same subsets of the dataset.

This can be achieved by fixing the seed for the pseudo-random number generator used when splitting the dataset. If you are new to pseudo-random number generators, see the tutorial:

This can be achieved by setting the “*random_state*” to an integer value. Any value will do; it is not a tunable hyperparameter.

... # split again, and we should see the same split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

The example below demonstrates this and shows that two separate splits of the data result in the same result.

# demonstrate that the train-test split procedure is repeatable from sklearn.datasets import make_blobs from sklearn.model_selection import train_test_split # create dataset X, y = make_blobs(n_samples=100) # split into train test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # summarize first 5 rows print(X_train[:5, :]) # split again, and we should see the same split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # summarize first 5 rows print(X_train[:5, :])

Running the example splits the dataset and prints the first five rows of the training dataset.

The dataset is split again and the first five rows of the training dataset are printed showing identical values, confirming that when we fix the seed for the pseudorandom number generator, we get an identical split of the original dataset.

[[-2.54341511 4.98947608] [ 5.65996724 -8.50997751] [-2.5072835 10.06155749] [ 6.92679558 -5.91095498] [ 6.01313957 -7.7749444 ]] [[-2.54341511 4.98947608] [ 5.65996724 -8.50997751] [-2.5072835 10.06155749] [ 6.92679558 -5.91095498] [ 6.01313957 -7.7749444 ]]

One final consideration is for classification problems only.

Some classification problems do not have a balanced number of examples for each class label. As such, it is desirable to split the dataset into train and test sets in a way that preserves the same proportions of examples in each class as observed in the original dataset.

This is called a stratified train-test split.

We can achieve this by setting the “*stratify*” argument to the y component of the original dataset. This will be used by the *train_test_split()* function to ensure that both the train and test sets have the proportion of examples in each class that is present in the provided “*y*” array.

... # split into train test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)

We can demonstrate this with an example of a classification dataset with 94 examples in one class and six examples in a second class.

First, we can split the dataset into train and test sets without the “*stratify*” argument. The complete example is listed below.

# split imbalanced dataset into train and test sets without stratification from collections import Counter from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split # create dataset X, y = make_classification(n_samples=100, weights=[0.94], flip_y=0, random_state=1) print(Counter(y)) # split into train test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1) print(Counter(y_train)) print(Counter(y_test))

Running the example first reports the composition of the dataset by class label, showing the expected 94 percent vs. 6 percent.

Then the dataset is split and the composition of the train and test sets is reported. We can see that the train set has 45/5 examples in the test set has 49/1 examples. The composition of the train and test sets differ, and this is not desirable.

Counter({0: 94, 1: 6}) Counter({0: 45, 1: 5}) Counter({0: 49, 1: 1})

Next, we can stratify the train-test split and compare the results.

# split imbalanced dataset into train and test sets with stratification from collections import Counter from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split # create dataset X, y = make_classification(n_samples=100, weights=[0.94], flip_y=0, random_state=1) print(Counter(y)) # split into train test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) print(Counter(y_train)) print(Counter(y_test))

Given that we have used a 50 percent split for the train and test sets, we would expect both the train and test sets to have 47/3 examples in the train/test sets respectively.

Running the example, we can see that in this case, the stratified version of the train-test split has created both the train and test datasets with 47/3 examples in the train/test sets as we expected.

Counter({0: 94, 1: 6}) Counter({0: 47, 1: 3}) Counter({0: 47, 1: 3})

Now that we are familiar with the *train_test_split()* function, let’s look at how we can use it to evaluate a machine learning model.

In this section, we will explore using the train-test split procedure to evaluate machine learning models on standard classification and regression predictive modeling datasets.

We will demonstrate how to use the train-test split to evaluate a random forest algorithm on the sonar dataset.

The sonar dataset is a standard machine learning dataset composed of 208 rows of data with 60 numerical input variables and a target variable with two class values, e.g. binary classification.

The dataset involves predicting whether sonar returns indicate a rock or simulated mine.

No need to download the dataset; we will download it automatically as part of our worked examples.

The example below downloads the dataset and summarizes its shape.

# summarize the sonar dataset from pandas import read_csv # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv' dataframe = read_csv(url, header=None) # split into input and output elements data = dataframe.values X, y = data[:, :-1], data[:, -1] print(X.shape, y.shape)

Running the example downloads the dataset and splits it into input and output elements. As expected, we can see that there are 208 rows of data with 60 input variables.

(208, 60) (208,)

We can now evaluate a model using a train-test split.

First, the loaded dataset must be split into input and output components.

... # split into inputs and outputs X, y = data[:, :-1], data[:, -1] print(X.shape, y.shape)

Next, we can split the dataset so that 67 percent is used to train the model and 33 percent is used to evaluate it. This split was chosen arbitrarily.

... # split into train test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

We can then define and fit the model on the training dataset.

... # fit the model model = RandomForestClassifier(random_state=1) model.fit(X_train, y_train)

Then use the fit model to make predictions and evaluate the predictions using the classification accuracy performance metric.

... # make predictions yhat = model.predict(X_test) # evaluate predictions acc = accuracy_score(y_test, yhat) print('Accuracy: %.3f' % acc)

Tying this together, the complete example is listed below.

# train-test split evaluation random forest on the sonar dataset from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv' dataframe = read_csv(url, header=None) data = dataframe.values # split into inputs and outputs X, y = data[:, :-1], data[:, -1] print(X.shape, y.shape) # split into train test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) print(X_train.shape, X_test.shape, y_train.shape, y_test.shape) # fit the model model = RandomForestClassifier(random_state=1) model.fit(X_train, y_train) # make predictions yhat = model.predict(X_test) # evaluate predictions acc = accuracy_score(y_test, yhat) print('Accuracy: %.3f' % acc)

Running the example first loads the dataset and confirms the number of rows in the input and output elements.

The dataset is split into train and test sets and we can see that there are 139 rows for training and 69 rows for the test set.

Finally, the model is evaluated on the test set and the performance of the model when making predictions on new data has an accuracy of about 78.3 percent.

(208, 60) (208,) (139, 60) (69, 60) (139,) (69,) Accuracy: 0.783

We will demonstrate how to use the train-test split to evaluate a random forest algorithm on the housing dataset.

The housing dataset is a standard machine learning dataset composed of 506 rows of data with 13 numerical input variables and a numerical target variable.

The dataset involves predicting the house price given details of the house’s suburb in the American city of Boston.

No need to download the dataset; we will download it automatically as part of our worked examples.

The example below downloads and loads the dataset as a Pandas DataFrame and summarizes the shape of the dataset.

# load and summarize the housing dataset from pandas import read_csv # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) # summarize shape print(dataframe.shape)

Running the example confirms the 506 rows of data and 13 input variables and single numeric target variables (14 in total).

(506, 14)

We can now evaluate a model using a train-test split.

First, the loaded dataset must be split into input and output components.

... # split into inputs and outputs X, y = data[:, :-1], data[:, -1] print(X.shape, y.shape)

Next, we can split the dataset so that 67 percent is used to train the model and 33 percent is used to evaluate it. This split was chosen arbitrarily.

... # split into train test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

We can then define and fit the model on the training dataset.

... # fit the model model = RandomForestRegressor(random_state=1) model.fit(X_train, y_train)

Then use the fit model to make predictions and evaluate the predictions using the mean absolute error (MAE) performance metric.

... # make predictions yhat = model.predict(X_test) # evaluate predictions mae = mean_absolute_error(y_test, yhat) print('MAE: %.3f' % mae)

Tying this together, the complete example is listed below.

# train-test split evaluation random forest on the housing dataset from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_absolute_error # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values # split into inputs and outputs X, y = data[:, :-1], data[:, -1] print(X.shape, y.shape) # split into train test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) print(X_train.shape, X_test.shape, y_train.shape, y_test.shape) # fit the model model = RandomForestRegressor(random_state=1) model.fit(X_train, y_train) # make predictions yhat = model.predict(X_test) # evaluate predictions mae = mean_absolute_error(y_test, yhat) print('MAE: %.3f' % mae)

The dataset is split into train and test sets and we can see that there are 339 rows for training and 167 rows for the test set.

Finally, the model is evaluated on the test set and the performance of the model when making predictions on new data is a mean absolute error of about 2.211 (thousands of dollars).

(506, 13) (506,) (339, 13) (167, 13) (339,) (167,) MAE: 2.157

This section provides more resources on the topic if you are looking to go deeper.

- sklearn.model_selection.train_test_split API.
- sklearn.datasets.make_classification API.
- sklearn.datasets.make_blobs API.

In this tutorial, you discovered how to evaluate machine learning models using the train-test split.

Specifically, you learned:

- The train-test split procedure is appropriate when you have a very large dataset, a costly model to train, or require a good estimate of model performance quickly.
- How to use the scikit-learn machine learning library to perform the train-test split procedure.
- How to evaluate machine learning algorithms for classification and regression using the train-test split.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Train-Test Split for Evaluating Machine Learning Algorithms appeared first on Machine Learning Mastery.

]]>The post How to Selectively Scale Numerical Input Variables for Machine Learning appeared first on Machine Learning Mastery.

]]>It is convenient, and therefore common, to apply the same data transforms, such as standardization and normalization, equally to all input variables. This can achieve good results on many problems. Nevertheless, better results may be achieved by carefully **selecting which data transform to apply to each input variable** prior to modeling.

In this tutorial, you will discover how to apply selective scaling of numerical input variables.

After completing this tutorial, you will know:

- How to load and calculate a baseline predictive performance for the diabetes classification dataset.
- How to evaluate modeling pipelines with data transforms applied blindly to all numerical input variables.
- How to evaluate modeling pipelines with selective normalization and standardization applied to subsets of input variables.

Discover data cleaning, feature selection, data transforms, dimensionality reduction and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Let’s get started.

This tutorial is divided into three parts; they are:

- Diabetes Numerical Dataset
- Non-Selective Scaling of Numerical Inputs
- Normalize All Input Variables
- Standardize All Input Variables

- Selective Scaling of Numerical Inputs
- Normalize Only Non-Gaussian Input Variables
- Standardize Only Gaussian-Like Input Variables
- Selectively Normalize and Standardize Input Variables

As the basis of this tutorial, we will use the so-called “diabetes” dataset that has been widely studied as a machine learning dataset since the 1990s.

The dataset classifies patients’ data as either an onset of diabetes within five years or not. There are 768 examples and eight input variables. It is a binary classification problem.

You can learn more about the dataset here:

- Diabetes Dataset (pima-indians-diabetes.csv)
- Diabetes Dataset Description (pima-indians-diabetes.names)

No need to download the dataset; we will download it automatically as part of the worked examples that follow.

Looking at the data, we can see that all nine input variables are numerical.

6,148,72,35,0,33.6,0.627,50,1 1,85,66,29,0,26.6,0.351,31,0 8,183,64,0,0,23.3,0.672,32,1 1,89,66,23,94,28.1,0.167,21,0 0,137,40,35,168,43.1,2.288,33,1 ...

We can load this dataset into memory using the Pandas library.

The example below downloads and summarizes the diabetes dataset.

# load and summarize the diabetes dataset from pandas import read_csv from pandas.plotting import scatter_matrix from matplotlib import pyplot # Load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv" dataset = read_csv(url, header=None) # summarize the shape of the dataset print(dataset.shape) # histograms of the variables dataset.hist() pyplot.show()

Running the example first downloads the dataset and loads it as a DataFrame.

The shape of the dataset is printed, confirming the number of rows, and nine variables, eight input, and one target.

(768, 9)

Finally, a plot is created showing a histogram for each variable in the dataset.

This is useful as we can see that some variables have a Gaussian or Gaussian-like distribution (1, 2, 5) and others have an exponential-like distribution (0, 3, 4, 6, 7). This may suggest the need for different numerical data transforms for the different types of input variables.

Now that we are a little familiar with the dataset, let’s try fitting and evaluating a model on the raw dataset.

We will use a logistic regression model as they are a robust and effective linear model for binary classification tasks. We will evaluate the model using repeated stratified k-fold cross-validation, a best practice, and use 10 folds and three repeats.

The complete example is listed below.

# evaluate a logistic regression model on the raw diabetes dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.linear_model import LogisticRegression # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv' dataframe = read_csv(url, header=None) data = dataframe.values # separate into input and output elements X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) # define the model model = LogisticRegression(solver='liblinear') # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model m_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # summarize the result print('Accuracy: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example evaluates the model and reports the mean and standard deviation accuracy for fitting a logistic regression model on the raw dataset.

Your specific results may differ given the stochastic nature of the learning algorithm, the stochastic nature of the evaluation procedure, and differences in precision across machines and library versions. Try running the example a few times.

In this case, we can see that the model achieved an accuracy of about 76.8 percent.

Accuracy: 0.768 (0.040)

Now that we have established a baseline in performance on the dataset, let’s see if we can improve the performance using data scaling.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Many algorithms prefer or require that input variables are scaled to a consistent range prior to fitting a model.

This includes the logistic regression model that assumes input variables have a Gaussian probability distribution. It may also provide a more numerically stable model if the input variables are standardized. Nevertheless, even when these expectations are violated, the logistic regression can perform well or best for a given dataset as may be the case for the diabetes dataset.

Two common techniques for scaling numerical input variables are normalization and standardization.

Normalization scales each input variable to the range 0-1 and can be implemented using the MinMaxScaler class in scikit-learn. Standardization scales each input variable to have a mean of 0.0 and a standard deviation of 1.0 and can be implemented using the StandardScaler class in scikit-learn.

To learn more about normalization, standardization, and how to use these methods in scikit-learn, see the tutorial:

A naive approach to data scaling applies a single transform to all input variables, regardless of their scale or probability distribution. And this is often effective.

Let’s try normalizing and standardizing all input variables directly and compare the performance to the baseline logistic regression model fit on the raw data.

We can update the baseline code example to use a modeling pipeline where the first step is to apply a scaler and the final step is to fit the model.

This ensures that the scaling operation is fit or prepared on the training set only and then applied to the train and test sets during the cross-validation process, avoiding data leakage. Data leakage can result in an optimistically biased estimate of model performance.

This can be achieved using the Pipeline class where each step in the pipeline is defined as a tuple with a name and the instance of the transform or model to use.

... # define the modeling pipeline scaler = MinMaxScaler() model = LogisticRegression(solver='liblinear') pipeline = Pipeline([('s',scaler),('m',model)])

Tying this together, the complete example of evaluating a logistic regression on diabetes dataset with all input variables normalized is listed below.

# evaluate a logistic regression model on the normalized diabetes dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import MinMaxScaler # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv' dataframe = read_csv(url, header=None) data = dataframe.values # separate into input and output elements X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) # define the modeling pipeline model = LogisticRegression(solver='liblinear') scaler = MinMaxScaler() pipeline = Pipeline([('s',scaler),('m',model)]) # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model m_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # summarize the result print('Accuracy: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example evaluates the modeling pipeline and reports the mean and standard deviation accuracy for fitting a logistic regression model on the normalized dataset.

Your specific results may differ given the stochastic nature of the learning algorithm, the stochastic nature of the evaluation procedure, and differences in precision across machines and library versions. Try running the example a few times.

In this case, we can see that the normalization of the input variables has resulted in a drop in the mean classification accuracy from 76.8 percent with a model fit on the raw data to about 76.4 percent for the pipeline with normalization.

Accuracy: 0.764 (0.045)

Next, let’s try standardizing all input variables.

We can update the modeling pipeline to use standardization instead of normalization for all input variables prior to fitting and evaluating the logistic regression model.

This might be an appropriate transform for those input variables with a Gaussian-like distribution, but perhaps not the other variables.

... # define the modeling pipeline scaler = StandardScaler() model = LogisticRegression(solver='liblinear') pipeline = Pipeline([('s',scaler),('m',model)])

Tying this together, the complete example of evaluating a logistic regression model on diabetes dataset with all input variables standardized is listed below.

# evaluate a logistic regression model on the standardized diabetes dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv' dataframe = read_csv(url, header=None) data = dataframe.values # separate into input and output elements X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) # define the modeling pipeline scaler = StandardScaler() model = LogisticRegression(solver='liblinear') pipeline = Pipeline([('s',scaler),('m',model)]) # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model m_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # summarize the result print('Accuracy: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example evaluates the modeling pipeline and reports the mean and standard deviation accuracy for fitting a logistic regression model on the standardized dataset.

Your specific results may differ given the stochastic nature of the learning algorithm, the stochastic nature of the evaluation procedure, and differences in precision across machines and library versions. Try running the example a few times.

In this case, we can see that standardizing all numerical input variables has resulted in a lift in mean classification accuracy from 76.8 percent with a model evaluated on the raw dataset to about 77.2 percent for a model evaluated on the dataset with standardized input variables.

Accuracy: 0.772 (0.043)

So far, we have learned that normalizing all variables does not help performance, but standardizing all input variables does help performance.

Next, let’s explore if selectively applying scaling to the input variables can offer further improvement.

Data transforms can be applied selectively to input variables using the ColumnTransformer class in scikit-learn.

It allows you to specify the transform (or pipeline of transforms) to apply and the column indexes to apply them to. This can then be used as part of a modeling pipeline and evaluated using cross-validation.

You can learn more about how to use the ColumnTransformer in the tutorial:

We can explore using the ColumnTransformer to selectively apply normalization and standardization to the numerical input variables of the diabetes dataset in order to see if we can achieve further performance improvements.

First, let’s try normalizing just those input variables that do not have a Gaussian-like probability distribution and leave the rest of the input variables alone in the raw state.

We can define two groups of input variables using the column indexes, one for the variables with a Gaussian-like distribution, and one for the input variables with the exponential-like distribution.

... # define column indexes for the variables with "normal" and "exponential" distributions norm_ix = [1, 2, 5] exp_ix = [0, 3, 4, 6, 7]

We can then selectively normalize the “*exp_ix*” group and let the other input variables pass through without any data preparation.

... # define the selective transforms t = [('e', MinMaxScaler(), exp_ix)] selective = ColumnTransformer(transformers=t, remainder='passthrough')

The selective transform can then be used as part of our modeling pipeline.

... # define the modeling pipeline model = LogisticRegression(solver='liblinear') pipeline = Pipeline([('s',selective),('m',model)])

Tying this together, the complete example of evaluating a logistic regression model on data with selective normalization of some input variables is listed below.

# evaluate a logistic regression model on the diabetes dataset with selective normalization from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import MinMaxScaler from sklearn.compose import ColumnTransformer # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv' dataframe = read_csv(url, header=None) data = dataframe.values # separate into input and output elements X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) # define column indexes for the variables with "normal" and "exponential" distributions norm_ix = [1, 2, 5] exp_ix = [0, 3, 4, 6, 7] # define the selective transforms t = [('e', MinMaxScaler(), exp_ix)] selective = ColumnTransformer(transformers=t, remainder='passthrough') # define the modeling pipeline model = LogisticRegression(solver='liblinear') pipeline = Pipeline([('s',selective),('m',model)]) # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model m_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # summarize the result print('Accuracy: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example evaluates the modeling pipeline and reports the mean and standard deviation accuracy.

In this case, we can see slightly better performance, increasing mean accuracy with the baseline model fit on the raw dataset with 76.8 percent to about 76.9 with selective normalization of some input variables.

The results are not as good as standardizing all input variables though.

Accuracy: 0.769 (0.043)

We can repeat the experiment from the previous section, although in this case, selectively standardize those input variables that have a Gaussian-like distribution and leave the remaining input variables untouched.

... # define the selective transforms t = [('n', StandardScaler(), norm_ix)] selective = ColumnTransformer(transformers=t, remainder='passthrough')

Tying this together, the complete example of evaluating a logistic regression model on data with selective standardizing of some input variables is listed below.

# evaluate a logistic regression model on the diabetes dataset with selective standardization from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler from sklearn.compose import ColumnTransformer # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv' dataframe = read_csv(url, header=None) data = dataframe.values # separate into input and output elements X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) # define column indexes for the variables with "normal" and "exponential" distributions norm_ix = [1, 2, 5] exp_ix = [0, 3, 4, 6, 7] # define the selective transforms t = [('n', StandardScaler(), norm_ix)] selective = ColumnTransformer(transformers=t, remainder='passthrough') # define the modeling pipeline model = LogisticRegression(solver='liblinear') pipeline = Pipeline([('s',selective),('m',model)]) # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model m_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # summarize the result print('Accuracy: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example evaluates the modeling pipeline and reports the mean and standard deviation accuracy.

In this case, we can see that we achieved a lift in performance over both the baseline model fit on the raw dataset with 76.8 percent and over the standardization of all input variables that achieved 77.2 percent. With selective standardization, we have achieved a mean accuracy of about 77.3 percent, a modest but measurable bump.

Accuracy: 0.773 (0.041)

The results so far raise the question as to whether we can get a further lift by combining the use of selective normalization and standardization on the dataset at the same time.

This can be achieved by defining both transforms and their respective column indexes for the ColumnTransformer class, with no remaining variables being passed through.

... # define the selective transforms t = [('e', MinMaxScaler(), exp_ix), ('n', StandardScaler(), norm_ix)] selective = ColumnTransformer(transformers=t)

Tying this together, the complete example of evaluating a logistic regression model on data with selective normalization and standardization of the input variables is listed below.

# evaluate a logistic regression model on the diabetes dataset with selective scaling from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import StandardScaler from sklearn.compose import ColumnTransformer # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv' dataframe = read_csv(url, header=None) data = dataframe.values # separate into input and output elements X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) # define column indexes for the variables with "normal" and "exponential" distributions norm_ix = [1, 2, 5] exp_ix = [0, 3, 4, 6, 7] # define the selective transforms t = [('e', MinMaxScaler(), exp_ix), ('n', StandardScaler(), norm_ix)] selective = ColumnTransformer(transformers=t) # define the modeling pipeline model = LogisticRegression(solver='liblinear') pipeline = Pipeline([('s',selective),('m',model)]) # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model m_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # summarize the result print('Accuracy: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example evaluates the modeling pipeline and reports the mean and standard deviation accuracy.

In this case, interestingly, we can see that we have achieved the same performance as standardizing all input variables with 77.2 percent.

Further, the results suggest that the chosen model performs better when the non-Gaussian like variables are left as-is than being standardized or normalized.

I would not have guessed at this finding, which highlights the importance of careful experimentation.

Accuracy: 0.772 (0.040)

**Can you do better?**

Try other transforms or combinations of transforms and see if you can achieve better results.

Share your findings in the comments below.

This section provides more resources on the topic if you are looking to go deeper.

- Best Results for Standard Machine Learning Datasets
- How to Use the ColumnTransformer for Data Preparation
- How to Use StandardScaler and MinMaxScaler Transforms in Python

In this tutorial, you discovered how to apply selective scaling of numerical input variables.

Specifically, you learned:

- How to load and calculate a baseline predictive performance for the diabetes classification dataset.
- How to evaluate modeling pipelines with data transforms applied blindly to all numerical input variables.
- How to evaluate modeling pipelines with selective normalization and standardization applied to subsets of input variables.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Selectively Scale Numerical Input Variables for Machine Learning appeared first on Machine Learning Mastery.

]]>The post Add Binary Flags for Missing Values for Machine Learning appeared first on Machine Learning Mastery.

]]>A common approach is to replace missing values with a calculated statistic, such as the mean of the column. This allows the dataset to be modeled as per normal but gives no indication to the model that the row original contained missing values.

One approach to address this issue is to include additional binary flag input features that indicate whether a row or a column contained a missing value that was imputed. This additional information may or may not be helpful to the model in predicting the target value.

In this tutorial, you will discover how to **add binary flags for missing values** for modeling.

After completing this tutorial, you will know:

- How to load and evaluate models with statistical imputation on a classification dataset with missing values.
- How to add a flag that indicates if a row has one more missing values and evaluate models with this new feature.
- How to add a flag for each input variable that has missing values and evaluate models with these new features.

Discover data cleaning, feature selection, data transforms, dimensionality reduction and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Let’s get started.

**Updated Jul/2020**: Fixed bug in the creation of the flag variable.

This tutorial is divided into three parts; they are:

- Imputing the Horse Colic Dataset
- Model With a Binary Flag for Missing Values
- Model With Indicators of All Missing Values

The horse colic dataset describes medical characteristics of horses with colic and whether they lived or died.

There are 300 rows and 26 input variables with one output variable. It is a binary classification prediction task that involves predicting 1 if the horse lived and 2 if the horse died.

There are many fields we could select to predict in this dataset. In this case, we will predict whether the problem was surgical or not (column index 23), making it a binary classification problem.

The dataset has numerous missing values for many of the columns where each missing value is marked with a question mark character (“?”).

Below provides an example of rows from the dataset with marked missing values.

2,1,530101,38.50,66,28,3,3,?,2,5,4,4,?,?,?,3,5,45.00,8.40,?,?,2,2,11300,00000,00000,2 1,1,534817,39.2,88,20,?,?,4,1,3,4,2,?,?,?,4,2,50,85,2,2,3,2,02208,00000,00000,2 2,1,530334,38.30,40,24,1,1,3,1,3,3,1,?,?,?,1,1,33.00,6.70,?,?,1,2,00000,00000,00000,1 1,9,5290409,39.10,164,84,4,1,6,2,2,4,4,1,2,5.00,3,?,48.00,7.20,3,5.30,2,1,02208,00000,00000,1 ...

You can learn more about the dataset here:

No need to download the dataset as we will download it automatically in the worked examples.

Marking missing values with a NaN (not a number) value in a loaded dataset using Python is a best practice.

We can load the dataset using the read_csv() Pandas function and specify the “*na_values*” to load values of ‘?’ as missing, marked with a NaN value.

The example below downloads the dataset, marks “?” values as NaN (missing) and summarizes the shape of the dataset.

# summarize the horse colic dataset from pandas import read_csv # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv' dataframe = read_csv(url, header=None, na_values='?') data = dataframe.values # split into input and output elements ix = [i for i in range(data.shape[1]) if i != 23] X, y = data[:, ix], data[:, 23] print(X.shape, y.shape)

Running the example downloads the dataset and reports the number of rows and columns, matching our expectations.

(300, 27) (300,)

Next, we can evaluate a model on this dataset.

We can use the SimpleImputer class to perform statistical imputation and replace the missing values with the mean of each column. We can then fit a random forest model on the dataset.

For more on how to use the SimpleImputer class, see the tutorial:

To achieve this, we will define a pipeline that first performs imputation, then fits the model and evaluates this modeling pipeline using repeated stratified k-fold cross-validation with three repeats and 10 folds.

The complete example is listed below.

# evaluate mean imputation and random forest for the horse colic dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.ensemble import RandomForestClassifier from sklearn.impute import SimpleImputer from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv' dataframe = read_csv(url, header=None, na_values='?') # split into input and output elements data = dataframe.values ix = [i for i in range(data.shape[1]) if i != 23] X, y = data[:, ix], data[:, 23] # define modeling pipeline model = RandomForestClassifier() imputer = SimpleImputer() pipeline = Pipeline(steps=[('i', imputer), ('m', model)]) # define model evaluation cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the random forest with mean statistical imputation on the horse colic dataset.

Your specific results may vary given the stochastic nature of the learning algorithm, the stochastic nature of the evaluation procedure, and differences in precision across machines. Try running the example a few times.

In this case, the pipeline achieved an estimated classification accuracy of about 86.2 percent.

Mean Accuracy: 0.862 (0.056)

Next, let’s see if we can improve the performance of the model by providing more information about missing values.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In the previous section, we replaced missing values with a calculated statistic.

The model is unaware that missing values were replaced.

It is possible that knowledge of whether a row contains a missing value or not will be useful to the model when making a prediction.

One approach to exposing the model to this knowledge is by providing an additional column that is a binary flag indicating whether the row had a missing value or not.

- 0: Row does not contain a missing value.
- 1: Row contains a missing value (which was/will be imputed).

This can be achieved directly on the loaded dataset. First, we can sum the values for each row to create a new column where if the row contains at least one NaN, then the sum will be a NaN.

We can then mark all values in the new column as 1 if they contain a NaN, or 0 otherwise.

Finally, we can add this column to the loaded dataset.

Tying this together, the complete example of adding a binary flag to indicate one or more missing values in each row is listed below.

# add a binary flag that indicates if a row contains a missing value from numpy import isnan from numpy import hstack from pandas import read_csv # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv' dataframe = read_csv(url, header=None, na_values='?') # split into input and output elements data = dataframe.values ix = [i for i in range(data.shape[1]) if i != 23] X, y = data[:, ix], data[:, 23] print(X.shape) # sum each row where rows with a nan will sum to nan a = X.sum(axis=1) # mark all non-nan as 0 a[~isnan(a)] = 0 # mark all nan as 1 a[isnan(a)] = 1 a = a.reshape((len(a), 1)) # add to the dataset as another column X = hstack((X, a)) print(X.shape)

Running the example first downloads the dataset and reports the number of rows and columns, as expected.

Then the new binary variable indicating whether a row contains a missing value is created and added to the end of the input variables. The shape of the input data is then reported, confirming the addition of the feature, from 27 to 28 columns.

(300, 27) (300, 28)

We can then evaluate the model as we did in the previous section with the additional binary flag and see if it impacts model performance.

The complete example is listed below.

# evaluate model performance with a binary flag for missing values and imputed missing from numpy import isnan from numpy import hstack from numpy import mean from numpy import std from pandas import read_csv from sklearn.ensemble import RandomForestClassifier from sklearn.impute import SimpleImputer from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv' dataframe = read_csv(url, header=None, na_values='?') # split into input and output elements data = dataframe.values ix = [i for i in range(data.shape[1]) if i != 23] X, y = data[:, ix], data[:, 23] # sum each row where rows with a nan will sum to nan a = X.sum(axis=1) # mark all non-nan as 0 a[~isnan(a)] = 0 # mark all nan as 1 a[isnan(a)] = 1 a = a.reshape((len(a), 1)) # add to the dataset as another column X = hstack((X, a)) # define modeling pipeline model = RandomForestClassifier() imputer = SimpleImputer() pipeline = Pipeline(steps=[('i', imputer), ('m', model)]) # define model evaluation cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example reports the mean and standard deviation classification accuracy on the horse colic dataset with the additional feature and imputation.

Your specific results may vary given the stochastic nature of the learning algorithm, the stochastic nature of the evaluation procedure, and differences in precision across machines. Try running the example a few times.

In this case, we see a modest lift in performance from 86.2 percent to 86.3 percent. The difference is small and may not be statistically significant.

Mean Accuracy: 0.863 (0.055)

Most rows in this dataset have a missing value, and this approach might be more beneficial on datasets with fewer missing values.

Next, let’s see if we can provide even more information about the missing values to the model.

In the previous section, we added one additional column to indicate whether a row contains a missing value or not.

One step further is to indicate whether each input value was missing and imputed or not. This effectively adds one additional column for each input variable that contains missing values and may offer benefit to the model.

This can be achieved by setting the “*add_indicator*” argument to *True* when defining the SimpleImputer instance.

... # impute and mark missing values X = SimpleImputer(add_indicator=True).fit_transform(X)

We can demonstrate this with a worked example.

The example below loads the horse colic dataset as before, then imputes the missing values on the entire dataset and adds indicators variables for each input variable that has missing values

# impute and add indicators for columns with missing values from pandas import read_csv from sklearn.impute import SimpleImputer # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv' dataframe = read_csv(url, header=None, na_values='?') data = dataframe.values # split into input and output elements ix = [i for i in range(data.shape[1]) if i != 23] X, y = data[:, ix], data[:, 23] print(X.shape) # impute and mark missing values X = SimpleImputer(strategy='mean', add_indicator=True).fit_transform(X) print(X.shape)

Running the example first downloads and summarizes the shape of the dataset as expected, then applies the imputation and adds the binary (1 and 0 values) columns indicating whether each row contains a missing value for a given input variable.

We can see that the number of input variables has increased from 27 to 48, indicating the addition of 21 binary input variables, and in turn, that 21 of the 27 input variables must contain at least one missing value.

(300, 27) (300, 48)

Next, we can evaluate the model with this additional information.

The complete example below demonstrates this.

# evaluate imputation with added indicators features on the horse colic dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.ensemble import RandomForestClassifier from sklearn.impute import SimpleImputer from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv' dataframe = read_csv(url, header=None, na_values='?') # split into input and output elements data = dataframe.values ix = [i for i in range(data.shape[1]) if i != 23] X, y = data[:, ix], data[:, 23] # define modeling pipeline model = RandomForestClassifier() imputer = SimpleImputer(add_indicator=True) pipeline = Pipeline(steps=[('i', imputer), ('m', model)]) # define model evaluation cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example reports the mean and standard deviation classification accuracy on the horse colic dataset with the additional indicators features and imputation.

Your specific results may vary given the stochastic nature of the learning algorithm, the stochastic nature of the evaluation procedure, and differences in precision across machines. Try running the example a few times.

In this case, we see a nice lift in performance from 86.3 percent in the previous section to 86.7 percent.

This may provide strong evidence that adding one flag per column that was inputted is a better strategy on this dataset and chosen model.

Mean Accuracy: 0.867 (0.055)

This section provides more resources on the topic if you are looking to go deeper.

- Best Results for Standard Machine Learning Datasets
- Statistical Imputation for Missing Values in Machine Learning
- How to Handle Missing Data with Python

In this tutorial, you discovered how to add binary flags for missing values for modeling.

Specifically, you learned:

- How to load and evaluate models with statistical imputation on a classification dataset with missing values.
- How to add a flag that indicates if a row has one more missing values and evaluate models with this new feature.
- How to add a flag for each input variable that has missing values and evaluate models with these new features.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Add Binary Flags for Missing Values for Machine Learning appeared first on Machine Learning Mastery.

]]>The post How to Create Custom Data Transforms for Scikit-Learn appeared first on Machine Learning Mastery.

]]>There are many simple data cleaning operations, such as removing outliers and removing columns with few observations, that are often performed manually to the data, requiring custom code.

The scikit-learn library provides a way to wrap these **custom data transforms** in a standard way so they can be used just like any other transform, either on data directly or as a part of a modeling pipeline.

In this tutorial, you will discover how to define and use custom data transforms for scikit-learn.

After completing this tutorial, you will know:

- That custom data transforms can be created for scikit-learn using the FunctionTransformer class.
- How to develop and apply a custom transform to remove columns with few unique values.
- How to develop and apply a custom transform that replaces outliers for each column.

Discover data cleaning, feature selection, data transforms, dimensionality reduction and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Let’s get started.

This tutorial is divided into four parts; they are:

- Custom Data Transforms in Scikit-Learn
- Oil Spill Dataset
- Custom Transform to Remove Columns
- Custom Transform to Replace Outliers

Data preparation refers to changing the raw data in some way that makes it more appropriate for predictive modeling with machine learning algorithms.

The scikit-learn Python machine learning library offers many different data preparation techniques directly, such as techniques for scaling numerical input variables and changing the probability distribution of variables.

These transforms can be fit and then applied on a dataset or used as part of a predictive modeling pipeline, allowing a sequence of transforms to be applied correctly without data leakage when evaluating model performance with data sampling techniques, such as k-fold cross-validation.

Although the data preparation techniques available in scikit-learn are extensive, there may be additional data preparation steps that are required.

Typically, these additional steps are performed manually prior to modeling and require writing custom code. The risk is that these data preparation steps may be performed inconsistently.

The solution is to create a custom data transform in scikit-learn using the FunctionTransformer class.

This class allows you to specify a function that is called to transform the data. You can define the function and perform any valid change, such as changing values or removing columns of data (not removing rows).

The class can then be used just like any other data transform in scikit-learn, e.g. to transform data directly, or used in a modeling pipeline.

The catch is that **the transform is stateless**, meaning that no state can be kept.

This means that the transform cannot be used to calculate statistics on the training dataset that are then used to transform the train and test datasets.

In addition to custom scaling operations, this can be helpful for standard data cleaning operations, such as identifying and removing columns with few unique values and identifying and removing relative outliers.

We will explore both of these cases, but first, let’s define a dataset that we can use as the basis for exploration.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The so-called “oil spill” dataset is a standard machine learning dataset.

The task involves predicting whether a patch contains an oil spill or not, e.g. from the illegal or accidental dumping of oil in the ocean, given a vector that describes the contents of a patch of a satellite image.

There are 937 cases. Each case is composed of 48 numerical computer vision derived features, a patch number, and a class label.

The normal case is no oil spill assigned the class label of 0, whereas an oil spill is indicated by a class label of 1. There are 896 cases for no oil spill and 41 cases of an oil spill.

You can access the entire dataset here:

Review the contents of the file.

The first few lines of the file should look as follows:

1,2558,1506.09,456.63,90,6395000,40.88,7.89,29780,0.19,214.7,0.21,0.26,0.49,0.1,0.4,99.59,32.19,1.84,0.16,0.2,87.65,0,0.47,132.78,-0.01,3.78,0.22,3.2,-3.71,-0.18,2.19,0,2.19,310,16110,0,138.68,89,69,2850,1000,763.16,135.46,3.73,0,33243.19,65.74,7.95,1 2,22325,79.11,841.03,180,55812500,51.11,1.21,61900,0.02,901.7,0.02,0.03,0.11,0.01,0.11,6058.23,4061.15,2.3,0.02,0.02,87.65,0,0.58,132.78,-0.01,3.78,0.84,7.09,-2.21,0,0,0,0,704,40140,0,68.65,89,69,5750,11500,9593.48,1648.8,0.6,0,51572.04,65.73,6.26,0 3,115,1449.85,608.43,88,287500,40.42,7.34,3340,0.18,86.1,0.21,0.32,0.5,0.17,0.34,71.2,16.73,1.82,0.19,0.29,87.65,0,0.46,132.78,-0.01,3.78,0.7,4.79,-3.36,-0.23,1.95,0,1.95,29,1530,0.01,38.8,89,69,1400,250,150,45.13,9.33,1,31692.84,65.81,7.84,1 4,1201,1562.53,295.65,66,3002500,42.4,7.97,18030,0.19,166.5,0.21,0.26,0.48,0.1,0.38,120.22,33.47,1.91,0.16,0.21,87.65,0,0.48,132.78,-0.01,3.78,0.84,6.78,-3.54,-0.33,2.2,0,2.2,183,10080,0,108.27,89,69,6041.52,761.58,453.21,144.97,13.33,1,37696.21,65.67,8.07,1 5,312,950.27,440.86,37,780000,41.43,7.03,3350,0.17,232.8,0.15,0.19,0.35,0.09,0.26,289.19,48.68,1.86,0.13,0.16,87.65,0,0.47,132.78,-0.01,3.78,0.02,2.28,-3.44,-0.44,2.19,0,2.19,45,2340,0,14.39,89,69,1320.04,710.63,512.54,109.16,2.58,0,29038.17,65.66,7.35,0 ...

We can see that the first column contains integers for the patch number. We can also see that the computer vision derived features are real-valued with differing scales, such as thousands in the second column and fractions in other columns.

This dataset contains columns with very few unique values and columns with outliers that provide a good basis for data cleaning.

The example below downloads the dataset and loads it as a numPy array and summarizes the number of rows and columns.

# load the oil dataset from pandas import read_csv # define the location of the dataset path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/oil-spill.csv' # load the dataset df = read_csv(path, header=None) # split data into inputs and outputs data = df.values X = data[:, :-1] y = data[:, -1] print(X.shape, y.shape)

Running the example loads the dataset and confirms the expected number of rows and columns.

(937, 49) (937,)

Now that we have a dataset that we can use as the basis for data transforms, let’s look at how we can define some custom data cleaning transforms using the *FunctionTransformer* class.

Columns that have few unique values are probably not contributing anything useful to predicting the target value.

This is not absolutely true, but it is true enough that you should test the performance of your model fit on a dataset with columns of this type removed.

This is a type of data cleaning, and there is a data transform provided in scikit-learn called the VarianceThreshold that attempts to address this using the variance of each column.

Another approach is to remove columns that have fewer than a specified number of unique values, such as 1.

We can develop a function that applies this transform and use the minimum number of unique values as a configurable default argument. We will also add some debugging to confirm it is working as we expect.

First, the number of unique values for each column can be calculated. Ten columns with equal or fewer than the minimum number of unique values can be identified. Finally, those identified columns can be removed from the dataset.

The *cust_transform()* function below implements this.

# remove columns with few unique values def cust_transform(X, min_values=1, verbose=True): # get number of unique values for each column counts = [len(unique(X[:, i])) for i in range(X.shape[1])] if verbose: print('Unique Values: %s' % counts) # select columns to delete to_del = [i for i,v in enumerate(counts) if v <= min_values] if verbose: print('Deleting: %s' % to_del) if len(to_del) is 0: return X # select all but the columns that are being removed ix = [i for i in range(X.shape[1]) if i not in to_del] result = X[:, ix] return result

We can then use this function in the FunctionTransformer.

A limitation of this transform is that it selects columns to delete based on the provided data. This means if a train and test dataset differ greatly, then it is possible for different columns to be removed from each, making model evaluation challenging (*unstable*!?). As such, it is best to keep the minimum number of unique values small, such as 1.

We can use this transform on the oil spill dataset. The complete example is listed below.

# custom data transform for removing columns with few unique values from numpy import unique from pandas import read_csv from sklearn.preprocessing import FunctionTransformer from sklearn.preprocessing import LabelEncoder # load a dataset def load_dataset(path): # load the dataset df = read_csv(path, header=None) data = df.values # split data into inputs and outputs X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) return X, y # remove columns with few unique values def cust_transform(X, min_values=1, verbose=True): # get number of unique values for each column counts = [len(unique(X[:, i])) for i in range(X.shape[1])] if verbose: print('Unique Values: %s' % counts) # select columns to delete to_del = [i for i,v in enumerate(counts) if v <= min_values] if verbose: print('Deleting: %s' % to_del) if len(to_del) is 0: return X # select all but the columns that are being removed ix = [i for i in range(X.shape[1]) if i not in to_del] result = X[:, ix] return result # define the location of the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/oil-spill.csv' # load the dataset X, y = load_dataset(url) print(X.shape, y.shape) # define the transformer trans = FunctionTransformer(cust_transform) # apply the transform X = trans.fit_transform(X) # summarize new shape print(X.shape)

Running the example first reports the number of rows and columns in the raw dataset.

Next, a list is printed that shows the number of unique values observed for each column in the dataset. We can see that many columns have very few unique values.

The columns with one (or fewer) unique values are then identified and reported. In this case, column index 22. This column is removed from the dataset.

Finally, the shape of the transformed dataset is reported, showing 48 instead of 49 columns, confirming that the column with a single unique value was deleted.

(937, 49) (937,) Unique Values: [238, 297, 927, 933, 179, 375, 820, 618, 561, 57, 577, 59, 73, 107, 53, 91, 893, 810, 170, 53, 68, 9, 1, 92, 9, 8, 9, 308, 447, 392, 107, 42, 4, 45, 141, 110, 3, 758, 9, 9, 388, 220, 644, 649, 499, 2, 937, 169, 286] Deleting: [22] (937, 48)

There are many extensions you could explore to this transform, such as:

- Ensure that it is only applied to numerical input variables.
- Experiment with a different minimum number of unique values.
- Use a percentage rather than an absolute number of unique values.

If you explore any of these extensions, let me know in the comments below.

Next, let’s look at a transform that replaces values in the dataset.

Outliers are observations that are different or unlike the other observations.

If we consider one variable at a time, an outlier would be a value that is far from the center of mass (the rest of the values), meaning it is rare or has a low probability of being observed.

There are standard ways for identifying outliers for common probability distributions. For Gaussian data, we can identify outliers as observations that are three or more standard deviations from the mean.

This may or may not be a desirable way to identify outliers for data that has many input variables, yet can be effective in some cases.

We can identify outliers in this way and replace their value with a correction, such as the mean.

Each column is considered one at a time and mean and standard deviation statistics are calculated. Using these statistics, upper and lower bounds of “*normal*” values are defined, then all values that fall outside these bounds can be identified. If one or more outliers are identified, their values are then replaced with the mean value that was already calculated.

The *cust_transform()* function below implements this as a function applied to the dataset, where we parameterize the number of standard deviations from the mean and whether or not debug information will be displayed.

# replace outliers def cust_transform(X, n_stdev=3, verbose=True): # copy the array result = X.copy() # enumerate each column for i in range(result.shape[1]): # retrieve values for column col = X[:, i] # calculate statistics mu, sigma = mean(col), std(col) # define bounds lower, upper = mu-(sigma*n_stdev), mu+(sigma*n_stdev) # select indexes that are out of bounds ix = where(logical_or(col < lower, col > upper))[0] if verbose and len(ix) > 0: print('>col=%d, outliers=%d' % (i, len(ix))) # replace values result[ix, i] = mu return result

We can then use this function in the FunctionTransformer.

The method of outlier detection assumes a Gaussian probability distribution and applies to each variable independently, both of which are strong assumptions.

An additional limitation of this implementation is that the mean and standard deviation statistics are calculated on the provided dataset, meaning that the definition of an outlier and its replacement value are both relative to the dataset. This means that different definitions of outliers and different replacement values could be used if the transform is used on the train and test sets.

We can use this transform on the oil spill dataset. The complete example is listed below.

# custom data transform for replacing outliers from numpy import mean from numpy import std from numpy import where from numpy import logical_or from pandas import read_csv from sklearn.preprocessing import FunctionTransformer from sklearn.preprocessing import LabelEncoder # load a dataset def load_dataset(path): # load the dataset df = read_csv(path, header=None) data = df.values # split data into inputs and outputs X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) return X, y # replace outliers def cust_transform(X, n_stdev=3, verbose=True): # copy the array result = X.copy() # enumerate each column for i in range(result.shape[1]): # retrieve values for column col = X[:, i] # calculate statistics mu, sigma = mean(col), std(col) # define bounds lower, upper = mu-(sigma*n_stdev), mu+(sigma*n_stdev) # select indexes that are out of bounds ix = where(logical_or(col < lower, col > upper))[0] if verbose and len(ix) > 0: print('>col=%d, outliers=%d' % (i, len(ix))) # replace values result[ix, i] = mu return result # define the location of the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/oil-spill.csv' # load the dataset X, y = load_dataset(url) print(X.shape, y.shape) # define the transformer trans = FunctionTransformer(cust_transform) # apply the transform X = trans.fit_transform(X) # summarize new shape print(X.shape)

Running the example first reports the shape of the dataset prior to any change.

Next, the number of outliers for each column is calculated and only those columns with one or more outliers are reported in the output. We can see that a total of 32 columns in the dataset have one or more outliers.

The outliers are then removed and the shape of the resulting dataset is reported, confirming no change in the number of rows or columns.

(937, 49) (937,) >col=0, outliers=10 >col=1, outliers=8 >col=3, outliers=8 >col=5, outliers=7 >col=6, outliers=1 >col=7, outliers=12 >col=8, outliers=15 >col=9, outliers=14 >col=10, outliers=19 >col=11, outliers=17 >col=12, outliers=22 >col=13, outliers=2 >col=14, outliers=16 >col=15, outliers=8 >col=16, outliers=8 >col=17, outliers=6 >col=19, outliers=12 >col=20, outliers=20 >col=27, outliers=14 >col=28, outliers=18 >col=29, outliers=2 >col=30, outliers=13 >col=32, outliers=3 >col=34, outliers=14 >col=35, outliers=15 >col=37, outliers=13 >col=40, outliers=18 >col=41, outliers=13 >col=42, outliers=12 >col=43, outliers=12 >col=44, outliers=19 >col=46, outliers=21 (937, 49)

There are many extensions you could explore to this transform, such as:

- Ensure that it is only applied to numerical input variables.
- Experiment with a different number of standard deviations from the mean, such as 2 or 4.
- Use a different definition of outlier, such as the IQR or a model.

If you explore any of these extensions, let me know in the comments below.

This section provides more resources on the topic if you are looking to go deeper.

- How to Remove Outliers for Machine Learning
- How to Perform Data Cleaning for Machine Learning with Python

In this tutorial, you discovered how to define and use custom data transforms for scikit-learn.

Specifically, you learned:

- That custom data transforms can be created for scikit-learn using the FunctionTransformer class.
- How to develop and apply a custom transform to remove columns with few unique values.
- How to develop and apply a custom transform that replaces outliers for each column.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Create Custom Data Transforms for Scikit-Learn appeared first on Machine Learning Mastery.

]]>