The post How to Perform Feature Selection With Numerical Input Data appeared first on Machine Learning Mastery.

]]>Feature selection is often straightforward when working with real-valued input and output data, such as using the Pearson’s correlation coefficient, but can be challenging when working with numerical input data and a categorical target variable.

The two most commonly used feature selection methods for numerical input data when the target variable is categorical (e.g. classification predictive modeling) are the ANOVA f-test statistic and the mutual information statistic.

In this tutorial, you will discover how to perform feature selection with numerical input data for classification.

After completing this tutorial, you will know:

- The diabetes predictive modeling problem with numerical inputs and binary classification target variables.
- How to evaluate the importance of numerical features using the ANOVA f-test and mutual information statistics.
- How to perform feature selection for numerical data when fitting and evaluating a classification model.

Let’s get started.

This tutorial is divided into four parts; they are:

- Diabetes Numerical Dataset
- Numerical Feature Selection
- ANOVA f-test Feature Selection
- Mutual Information Feature Selection

- Modeling With Selected Features
- Model Built Using All Features
- Model Built Using ANOVA f-test Features
- Model Built Using Mutual Information Features

- Tune the Number of Selected Features

As the basis of this tutorial, we will use the so-called “*diabetes*” dataset that has been widely studied as a machine learning dataset since 1990.

The dataset classifies patients’ data as either an onset of diabetes within five years or not. There are 768 examples and eight input variables. It is a binary classification problem.

A naive model can achieve an accuracy of about 65 percent on this dataset. A good score is about 77 percent +/- 5 percent. We will aim for this region but note that the models in this tutorial are not optimized; they are designed to demonstrate feature selection schemes.

You can download the dataset and save the file as “pima-indians-diabetes.csv” in your current working directory.

- Diabetes Dataset (pima-indians-diabetes.csv)
- Diabetes Dataset Description (pima-indians-diabetes.names)

Looking at the data, we can see that all nine input variables are numerical.

6,148,72,35,0,33.6,0.627,50,1 1,85,66,29,0,26.6,0.351,31,0 8,183,64,0,0,23.3,0.672,32,1 1,89,66,23,94,28.1,0.167,21,0 0,137,40,35,168,43.1,2.288,33,1 ...

We can load this dataset into memory using the Pandas library.

... # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values

Once loaded, we can split the columns into input (X) and output (y) for modeling.

... # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1]

We can tie all of this together into a helpful function that we can reuse later.

# load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] return X, y

Once loaded, we can split the data into training and test sets so we can fit and evaluate a learning model.

We will use the train_test_split() function form scikit-learn and use 67 percent of the data for training and 33 percent for testing.

... # load the dataset X, y = load_dataset('pima-indians-diabetes.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

Tying all of these elements together, the complete example of loading, splitting, and summarizing the raw categorical dataset is listed below.

# load and summarize the dataset from pandas import read_csv from sklearn.model_selection import train_test_split # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] return X, y # load the dataset X, y = load_dataset('pima-indians-diabetes.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # summarize print('Train', X_train.shape, y_train.shape) print('Test', X_test.shape, y_test.shape)

Running the example reports the size of the input and output elements of the train and test sets.

We can see that we have 514 examples for training and 254 for testing.

Train (514, 8) (514, 1) Test (254, 8) (254, 1)

Now that we have loaded and prepared the diabetes dataset, we can explore feature selection.

There are two popular feature selection techniques that can be used for numerical input data and a categorical (class) target variable.

They are:

- ANOVA-f Statistic.
- Mutual Information Statistics.

Let’s take a closer look at each in turn.

ANOVA is an acronym for “analysis of variance” and is a parametric statistical hypothesis test for determining whether the means from two or more samples of data (often three or more) come from the same distribution or not.

An F-statistic, or F-test, is a class of statistical tests that calculate the ratio between variances values, such as the variance from two different samples or the explained and unexplained variance by a statistical test, like ANOVA. The ANOVA method is a type of F-statistic referred to here as an ANOVA f-test.

Importantly, ANOVA is used when one variable is numeric and one is categorical, such as numerical input variables and a classification target variable in a classification task.

The results of this test can be used for feature selection where those features that are independent of the target variable can be removed from the dataset.

When the outcome is numeric, and […] the predictor has more than two levels, the traditional ANOVA F-statistic can be calculated.

— Page 242, Feature Engineering and Selection, 2019.

The scikit-learn machine library provides an implementation of the ANOVA f-test in the f_classif() function. This function can be used in a feature selection strategy, such as selecting the top *k* most relevant features (largest values) via the SelectKBest class.

For example, we can define the *SelectKBest* class to use the *f_classif()* function and select all features, then transform the train and test sets.

... # configure to select all features fs = SelectKBest(score_func=f_classif, k='all') # learn relationship from training data fs.fit(X_train, y_train) # transform train input data X_train_fs = fs.transform(X_train) # transform test input data X_test_fs = fs.transform(X_test)

We can then print the scores for each variable (larger is better) and plot the scores for each variable as a bar graph to get an idea of how many features we should select.

... # what are scores for the features for i in range(len(fs.scores_)): print('Feature %d: %f' % (i, fs.scores_[i])) # plot the scores pyplot.bar([i for i in range(len(fs.scores_))], fs.scores_) pyplot.show()

Tying this together with the data preparation for the diabetes dataset in the previous section, the complete example is listed below.

# example of anova f-test feature selection for numerical data from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import f_classif from matplotlib import pyplot # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] return X, y # feature selection def select_features(X_train, y_train, X_test): # configure to select all features fs = SelectKBest(score_func=f_classif, k='all') # learn relationship from training data fs.fit(X_train, y_train) # transform train input data X_train_fs = fs.transform(X_train) # transform test input data X_test_fs = fs.transform(X_test) return X_train_fs, X_test_fs, fs # load the dataset X, y = load_dataset('pima-indians-diabetes.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # feature selection X_train_fs, X_test_fs, fs = select_features(X_train, y_train, X_test) # what are scores for the features for i in range(len(fs.scores_)): print('Feature %d: %f' % (i, fs.scores_[i])) # plot the scores pyplot.bar([i for i in range(len(fs.scores_))], fs.scores_) pyplot.show()

Running the example first prints the scores calculated for each input feature and the target variable.

Note that your specific results may differ given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that some features stand out as perhaps being more relevant than others, with much larger test statistic values.

Perhaps features 1, 5, and 7 are most relevant.

Feature 0: 16.527385 Feature 1: 131.325562 Feature 2: 0.042371 Feature 3: 1.415216 Feature 4: 12.778966 Feature 5: 49.209523 Feature 6: 13.377142 Feature 7: 25.126440

A bar chart of the feature importance scores for each input feature is created.

This clearly shows that feature 1 might be the most relevant (according to test) and that perhaps six of the eight input features are the more relevant.

We could set k=6 when configuring the *SelectKBest* to select these top four features.

Mutual information from the field of information theory is the application of information gain (typically used in the construction of decision trees) to feature selection.

Mutual information is calculated between two variables and measures the reduction in uncertainty for one variable given a known value of the other variable.

You can learn more about mutual information in the following tutorial.

Mutual information is straightforward when considering the distribution of two discrete (categorical or ordinal) variables, such as categorical input and categorical output data. Nevertheless, it can be adapted for use with numerical input and categorical output.

For technical details on how this can be achieved, see the 2014 paper titled “Mutual Information between Discrete and Continuous Data Sets.”

The scikit-learn machine learning library provides an implementation of mutual information for feature selection with numeric input and categorical output variables via the *mutual_info_classif()* function.

Like *f_classif()*, it can be used in the *SelectKBest* feature selection strategy (and other strategies).

... # configure to select all features fs = SelectKBest(score_func=mutual_info_classif, k='all') # learn relationship from training data fs.fit(X_train, y_train) # transform train input data X_train_fs = fs.transform(X_train) # transform test input data X_test_fs = fs.transform(X_test)

We can perform feature selection using mutual information on the diabetes dataset and print and plot the scores (larger is better) as we did in the previous section.

The complete example of using mutual information for numerical feature selection is listed below.

# example of mutual information feature selection for numerical input data from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import mutual_info_classif from matplotlib import pyplot # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] return X, y # feature selection def select_features(X_train, y_train, X_test): # configure to select all features fs = SelectKBest(score_func=mutual_info_classif, k='all') # learn relationship from training data fs.fit(X_train, y_train) # transform train input data X_train_fs = fs.transform(X_train) # transform test input data X_test_fs = fs.transform(X_test) return X_train_fs, X_test_fs, fs # load the dataset X, y = load_dataset('pima-indians-diabetes.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # feature selection X_train_fs, X_test_fs, fs = select_features(X_train, y_train, X_test) # what are scores for the features for i in range(len(fs.scores_)): print('Feature %d: %f' % (i, fs.scores_[i])) # plot the scores pyplot.bar([i for i in range(len(fs.scores_))], fs.scores_) pyplot.show()

Running the example first prints the scores calculated for each input feature and the target variable.

Note: your specific results may differ. Try running the example a few times.

In this case, we can see that some of the features have a modestly low score, suggesting that perhaps they can be removed.

Perhaps features 1 and 5 are most relevant.

Feature 1: 0.118431 Feature 2: 0.019966 Feature 3: 0.041791 Feature 4: 0.019858 Feature 5: 0.084719 Feature 6: 0.018079 Feature 7: 0.033098

A bar chart of the feature importance scores for each input feature is created.

Importantly, a different mixture of features is promoted.

Now that we know how to perform feature selection on numerical input data for a classification predictive modeling problem, we can try developing a model using the selected features and compare the results.

There are many different techniques for scoring features and selecting features based on scores; how do you know which one to use?

A robust approach is to evaluate models using different feature selection methods (and numbers of features) and select the method that results in a model with the best performance.

In this section, we will evaluate a Logistic Regression model with all features compared to a model built from features selected by ANOVA f-test and those features selected via mutual information.

Logistic regression is a good model for testing feature selection methods as it can perform better if irrelevant features are removed from the model.

As a first step, we will evaluate a LogisticRegression model using all the available features.

The model is fit on the training dataset and evaluated on the test dataset.

The complete example is listed below.

# evaluation of a model using all input features from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] return X, y # load the dataset X, y = load_dataset('pima-indians-diabetes.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # fit the model model = LogisticRegression(solver='liblinear') model.fit(X_train, y_train) # evaluate the model yhat = model.predict(X_test) # evaluate predictions accuracy = accuracy_score(y_test, yhat) print('Accuracy: %.2f' % (accuracy*100))

Running the example prints the accuracy of the model on the training dataset.

Note: your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that the model achieves a classification accuracy of about 77 percent.

We would prefer to use a subset of features that achieves a classification accuracy that is as good or better than this.

Accuracy: 77.56

We can use the ANOVA f-test to score the features and select the four most relevant features.

The select_features() function below is updated to achieve this.

# feature selection def select_features(X_train, y_train, X_test): # configure to select a subset of features fs = SelectKBest(score_func=f_classif, k=4) # learn relationship from training data fs.fit(X_train, y_train) # transform train input data X_train_fs = fs.transform(X_train) # transform test input data X_test_fs = fs.transform(X_test) return X_train_fs, X_test_fs, fs

The complete example of evaluating a logistic regression model fit and evaluated on data using this feature selection method is listed below.

# evaluation of a model using 4 features chosen with anova f-test from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import f_classif from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] return X, y # feature selection def select_features(X_train, y_train, X_test): # configure to select a subset of features fs = SelectKBest(score_func=f_classif, k=4) # learn relationship from training data fs.fit(X_train, y_train) # transform train input data X_train_fs = fs.transform(X_train) # transform test input data X_test_fs = fs.transform(X_test) return X_train_fs, X_test_fs, fs # load the dataset X, y = load_dataset('pima-indians-diabetes.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # feature selection X_train_fs, X_test_fs, fs = select_features(X_train, y_train, X_test) # fit the model model = LogisticRegression(solver='liblinear') model.fit(X_train_fs, y_train) # evaluate the model yhat = model.predict(X_test_fs) # evaluate predictions accuracy = accuracy_score(y_test, yhat) print('Accuracy: %.2f' % (accuracy*100))

Running the example reports the performance of the model on just four of the eight input features selected using the ANOVA f-test statistic.

Note: your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we see that the model achieved an accuracy of about 78.74 percent, a lift in performance compared to the baseline that achieved 77.56 percent.

Accuracy: 78.74

We can repeat the experiment and select the top four features using a mutual information statistic.

The updated version of the *select_features()* function to achieve this is listed below.

# feature selection def select_features(X_train, y_train, X_test): # configure to select a subset of features fs = SelectKBest(score_func=mutual_info_classif, k=4) # learn relationship from training data fs.fit(X_train, y_train) # transform train input data X_train_fs = fs.transform(X_train) # transform test input data X_test_fs = fs.transform(X_test) return X_train_fs, X_test_fs, fs

The complete example of using mutual information for feature selection to fit a logistic regression model is listed below.

# evaluation of a model using 4 features chosen with mutual information from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import mutual_info_classif from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] return X, y # feature selection def select_features(X_train, y_train, X_test): # configure to select a subset of features fs = SelectKBest(score_func=mutual_info_classif, k=4) # learn relationship from training data fs.fit(X_train, y_train) # transform train input data X_train_fs = fs.transform(X_train) # transform test input data X_test_fs = fs.transform(X_test) return X_train_fs, X_test_fs, fs # load the dataset X, y = load_dataset('pima-indians-diabetes.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # feature selection X_train_fs, X_test_fs, fs = select_features(X_train, y_train, X_test) # fit the model model = LogisticRegression(solver='liblinear') model.fit(X_train_fs, y_train) # evaluate the model yhat = model.predict(X_test_fs) # evaluate predictions accuracy = accuracy_score(y_test, yhat) print('Accuracy: %.2f' % (accuracy*100))

Running the example fits the model on the four top selected features chosen using mutual information.

Note that your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can make no difference compared to the baseline model. This is interesting as we know the method chose a different four features compared to the previous method.

Accuracy: 77.56

In the previous example, we selected four features, but how do we know that is a good or best number of features to select?

Instead of guessing, we can systematically test a range of different numbers of selected features and discover which results in the best performing model. This is called a grid search, where the *k* argument to the *SelectKBest* class can be tuned.

It is good practice to evaluate model configurations on classification tasks using repeated stratified k-fold cross-validation. We will use three repeats of 10-fold cross-validation via the RepeatedStratifiedKFold class.

... # define the evaluation method cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

We can define a Pipeline that correctly prepares the feature selection transform on the training set and applies it to the train set and test set for each fold of the cross-validation.

In this case, we will use the ANOVA f-test statistical method for selecting features.

... # define the pipeline to evaluate model = LogisticRegression(solver='liblinear') fs = SelectKBest(score_func=f_classif) pipeline = Pipeline(steps=[('anova',fs), ('lr', model)])

We can then define the grid of values to evaluate as 1 to 8.

Note that the grid is a dictionary of parameters to values to search, and given that we are using a *Pipeline*, we can access the *SelectKBest* object via the name we gave it, ‘*anova*‘, and then the parameter name ‘*k*‘, separated by two underscores, or ‘*anova__k*‘.

... # define the grid grid = dict() grid['anova__k'] = [i+1 for i in range(X.shape[1])]

We can then define and run the search.

... # define the grid search search = GridSearchCV(pipeline, grid, scoring='accuracy', n_jobs=-1, cv=cv) # perform the search results = search.fit(X, y)

Tying this together, the complete example is listed below.

# compare different numbers of features selected using anova f-test from pandas import read_csv from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import f_classif from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline from sklearn.model_selection import GridSearchCV from matplotlib import pyplot # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] return X, y # define dataset X, y = load_dataset('pima-indians-diabetes.csv') # define the evaluation method cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define the pipeline to evaluate model = LogisticRegression(solver='liblinear') fs = SelectKBest(score_func=f_classif) pipeline = Pipeline(steps=[('anova',fs), ('lr', model)]) # define the grid grid = dict() grid['anova__k'] = [i+1 for i in range(X.shape[1])] # define the grid search search = GridSearchCV(pipeline, grid, scoring='accuracy', n_jobs=-1, cv=cv) # perform the search results = search.fit(X, y) # summarize best print('Best Mean Accuracy: %.3f' % results.best_score_) print('Best Config: %s' % results.best_params_)

Running the example grid searches different numbers of selected features using ANOVA f-test, where each modeling pipeline is evaluated using repeated cross-validation.

Your specific results may vary given the stochastic nature of the learning algorithm and evaluating procedure. Try running the example a few times.

In this case, we can see that the best number of selected features is seven; that achieves an accuracy of about 77 percent.

Best Mean Accuracy: 0.770 Best Config: {'anova__k': 7}

We might want to see the relationship between the number of selected features and classification accuracy. In this relationship, we may expect that more features result in a better performance to a point.

This relationship can be explored by manually evaluating each configuration of *k* for the *SelectKBest* from 1 to 8, gathering the sample of accuracy scores, and plotting the results using box and whisker plots side-by-side. The spread and mean of these box plots would be expected to show any interesting relationship between the number of selected features and the classification accuracy of the pipeline.

The complete example of achieving this is listed below.

# compare different numbers of features selected using anova f-test from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import f_classif from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline from matplotlib import pyplot # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] return X, y # evaluate a give model using cross-validation def evaluate_model(model): cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') return scores # define dataset X, y = load_dataset('pima-indians-diabetes.csv') # define number of features to evaluate num_features = [i+1 for i in range(X.shape[1])] # enumerate each number of features results = list() for k in num_features: # create pipeline model = LogisticRegression(solver='liblinear') fs = SelectKBest(score_func=f_classif, k=k) pipeline = Pipeline(steps=[('anova',fs), ('lr', model)]) # evaluate the model scores = evaluate_model(pipeline) results.append(scores) # summarize the results print('>%d %.3f (%.3f)' % (k, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=num_features, showmeans=True) pyplot.show()

Running the example first reports the mean and standard deviation accuracy for each number of selected features.

Your specific results may vary given the stochastic nature of the learning algorithm and evaluation procedure. Try running the example a few times.

In this case, it looks like selecting five and seven features results in roughly the same accuracy.

>1 0.748 (0.048) >2 0.756 (0.042) >3 0.761 (0.044) >4 0.759 (0.042) >5 0.770 (0.041) >6 0.766 (0.042) >7 0.770 (0.042) >8 0.768 (0.040)

Box and whisker plots are created side-by-side showing the trend of increasing mean accuracy with the number of selected features to five features, after which it may become less stable.

Selecting five features might be an appropriate configuration in this case.

This section provides more resources on the topic if you are looking to go deeper.

- How to Choose a Feature Selection Method For Machine Learning
- How to Perform Feature Selection with Categorical Data
- What Is Information Gain and Mutual Information for Machine Learning

- Feature selection, Scikit-Learn User Guide.
- sklearn.feature_selection.f_classif API.
- sklearn.feature_selection.mutual_info_classif API.
- sklearn.feature_selection.SelectKBest API.

- Diabetes Dataset (pima-indians-diabetes.csv)
- Diabetes Dataset Description (pima-indians-diabetes.names)

In this tutorial, you discovered how to perform feature selection with numerical input data for classification.

Specifically, you learned:

- The diabetes predictive modeling problem with numerical inputs and binary classification target variables.
- How to evaluate the importance of numerical features using the ANOVA f-test and mutual information statistics.
- How to perform feature selection for numerical data when fitting and evaluating a classification model.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Perform Feature Selection With Numerical Input Data appeared first on Machine Learning Mastery.

]]>The post Iterative Imputation for Missing Values in Machine Learning appeared first on Machine Learning Mastery.

]]>As such, it is good practice to identify and replace missing values for each column in your input data prior to modeling your prediction task. This is called missing data imputation, or imputing for short.

A sophisticated approach involves defining a model to predict each missing feature as a function of all other features and to repeat this process of estimating feature values multiple times. The repetition allows the refined estimated values for other features to be used as input in subsequent iterations of predicting missing values. This is generally referred to as iterative imputation.

In this tutorial, you will discover how to use iterative imputation strategies for missing data in machine learning.

After completing this tutorial, you will know:

- Missing values must be marked with NaN values and can be replaced with iteratively estimated values.
- How to load a CSV value with missing values and mark the missing values with NaN values and report the number and percentage of missing values for each column.
- How to impute missing values with iterative models as a data preparation method when evaluating models and when fitting a final model to make predictions on new data.

Let’s get started.

This tutorial is divided into three parts; they are:

- Iterative Imputation
- Horse Colic Dataset
- Iterative Imputation With IterativeImputer
- IterativeImputer Data Transform
- IterativeImputer and Model Evaluation
- IterativeImputer and Different Imputation Order
- IterativeImputer and Different Number of Iterations
- IterativeImputer Transform When Making a Prediction

A dataset may have missing values.

These are rows of data where one or more values or columns in that row are not present. The values may be missing completely or they may be marked with a special character or value, such as a question mark “?”.

Values could be missing for many reasons, often specific to the problem domain, and might include reasons such as corrupt measurements or unavailability.

Most machine learning algorithms require numeric input values, and a value to be present for each row and column in a dataset. As such, missing values can cause problems for machine learning algorithms.

As such, it is common to identify missing values in a dataset and replace them with a numeric value. This is called data imputing, or missing data imputation.

One approach to imputing missing values is to use an **iterative imputation model**.

Iterative imputation refers to a process where each feature is modeled as a function of the other features, e.g. a regression problem where missing values are predicted. Each feature is imputed sequentially, one after the other, allowing prior imputed values to be used as part of a model in predicting subsequent features.

It is iterative because this process is repeated multiple times, allowing ever improved estimates of missing values to be calculated as missing values across all features are estimated.

This approach may be generally referred to as fully conditional specification (FCS) or multivariate imputation by chained equations (MICE).

This methodology is attractive if the multivariate distribution is a reasonable description of the data. FCS specifies the multivariate imputation model on a variable-by-variable basis by a set of conditional densities, one for each incomplete variable. Starting from an initial imputation, FCS draws imputations by iterating over the conditional densities. A low number of iterations (say 10–20) is often sufficient.

— mice: Multivariate Imputation by Chained Equations in R, 2009.

Different regression algorithms can be used to estimate the missing values for each feature, although linear methods are often used for simplicity. The number of iterations of the procedure is often kept small, such as 10. Finally, the order that features are processed sequentially can be considered, such as from the feature with the least missing values to the feature with the most missing values.

Now that we are familiar with iterative methods for missing value imputation, let’s take a look at a dataset with missing values.

The horse colic dataset describes medical characteristics of horses with colic and whether they lived or died.

There are 300 rows and 26 input variables with one output variable. It is a binary classification prediction task that involves predicting 1 if the horse lived and 2 if the horse died.

A naive model can achieve a classification accuracy of about 67 percent, and a top performing model can achieve an accuracy of about 85.2 percent using three repeats of 10-fold cross-validation. This defines the range of expected modeling performance on the dataset.

The dataset has many missing values for many of the columns where each missing value is marked with a question mark character (“?”).

Below provides an example of rows from the dataset with marked missing values.

2,1,530101,38.50,66,28,3,3,?,2,5,4,4,?,?,?,3,5,45.00,8.40,?,?,2,2,11300,00000,00000,2 1,1,534817,39.2,88,20,?,?,4,1,3,4,2,?,?,?,4,2,50,85,2,2,3,2,02208,00000,00000,2 2,1,530334,38.30,40,24,1,1,3,1,3,3,1,?,?,?,1,1,33.00,6.70,?,?,1,2,00000,00000,00000,1 1,9,5290409,39.10,164,84,4,1,6,2,2,4,4,1,2,5.00,3,?,48.00,7.20,3,5.30,2,1,02208,00000,00000,1 ...

You can learn more about the dataset here:

No need to download the dataset as we will download it automatically in the worked examples.

Marking missing values with a NaN (not a number) value in a loaded dataset using Python is a best practice.

We can load the dataset using the read_csv() Pandas function and specify the “na_values” to load values of ‘?’ as missing, marked with a NaN value.

... # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv' dataframe = read_csv(url, header=None, na_values='?')

Once loaded, we can review the loaded data to confirm that “?” values are marked as NaN.

... # summarize the first few rows print(dataframe.head())

We can then enumerate each column and report the number of rows with missing values for the column.

... # summarize the number of rows with missing values for each column for i in range(dataframe.shape[1]): # count number of rows with missing values n_miss = dataframe[[i]].isnull().sum() perc = n_miss / dataframe.shape[0] * 100 print('> %d, Missing: %d (%.1f%%)' % (i, n_miss, perc))

Tying this together, the complete example of loading and summarizing the dataset is listed below.

# summarize the horse colic dataset from pandas import read_csv # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv' dataframe = read_csv(url, header=None, na_values='?') # summarize the first few rows print(dataframe.head()) # summarize the number of rows with missing values for each column for i in range(dataframe.shape[1]): # count number of rows with missing values n_miss = dataframe[[i]].isnull().sum() perc = n_miss / dataframe.shape[0] * 100 print('> %d, Missing: %d (%.1f%%)' % (i, n_miss, perc))

Running the example first loads the dataset and summarizes the first five rows.

We can see that the missing values that were marked with a “?” character have been replaced with NaN values.

0 1 2 3 4 5 6 ... 21 22 23 24 25 26 27 0 2.0 1 530101 38.5 66.0 28.0 3.0 ... NaN 2.0 2 11300 0 0 2 1 1.0 1 534817 39.2 88.0 20.0 NaN ... 2.0 3.0 2 2208 0 0 2 2 2.0 1 530334 38.3 40.0 24.0 1.0 ... NaN 1.0 2 0 0 0 1 3 1.0 9 5290409 39.1 164.0 84.0 4.0 ... 5.3 2.0 1 2208 0 0 1 4 2.0 1 530255 37.3 104.0 35.0 NaN ... NaN 2.0 2 4300 0 0 2 [5 rows x 28 columns]

Next, we can see the list of all columns in the dataset and the number and percentage of missing values.

We can see that some columns (e.g. column indexes 1 and 2) have no missing values and other columns (e.g. column indexes 15 and 21) have many or even a majority of missing values.

> 0, Missing: 1 (0.3%) > 1, Missing: 0 (0.0%) > 2, Missing: 0 (0.0%) > 3, Missing: 60 (20.0%) > 4, Missing: 24 (8.0%) > 5, Missing: 58 (19.3%) > 6, Missing: 56 (18.7%) > 7, Missing: 69 (23.0%) > 8, Missing: 47 (15.7%) > 9, Missing: 32 (10.7%) > 10, Missing: 55 (18.3%) > 11, Missing: 44 (14.7%) > 12, Missing: 56 (18.7%) > 13, Missing: 104 (34.7%) > 14, Missing: 106 (35.3%) > 15, Missing: 247 (82.3%) > 16, Missing: 102 (34.0%) > 17, Missing: 118 (39.3%) > 18, Missing: 29 (9.7%) > 19, Missing: 33 (11.0%) > 20, Missing: 165 (55.0%) > 21, Missing: 198 (66.0%) > 22, Missing: 1 (0.3%) > 23, Missing: 0 (0.0%) > 24, Missing: 0 (0.0%) > 25, Missing: 0 (0.0%) > 26, Missing: 0 (0.0%) > 27, Missing: 0 (0.0%)

Now that we are familiar with the horse colic dataset that has missing values, let’s look at how we can use iterative imputation.

The scikit-learn machine learning library provides the IterativeImputer class that supports iterative imputation.

In this section, we will explore how to effectively use the *IterativeImputer* class.

It is a data transform that is first configured based on the method used to estimate the missing values. By default, a BayesianRidge model is employed that uses a function of all other input features. Features are filled in ascending order, from those with the fewest missing values to those with the most.

... # define imputer imputer = IterativeImputer(estimator=BayesianRidge(), n_nearest_features=None, imputation_order='ascending')

Then the imputer is fit on a dataset.

... # fit on the dataset imputer.fit(X)

The fit imputer is then applied to a dataset to create a copy of the dataset with all missing values for each column replaced with an estimated value.

... # transform the dataset Xtrans = imputer.transform(X)

The *IterativeImputer* class cannot be used directly because it is experimental.

If you try to use it directly, you will get an error as follows:

ImportError: cannot import name 'IterativeImputer'

Instead, you must add an additional import statement to add support for the IterativeImputer class, as follows:

... from sklearn.experimental import enable_iterative_imputer

We can demonstrate its usage on the horse colic dataset and confirm it works by summarizing the total number of missing values in the dataset before and after the transform.

The complete example is listed below.

# iterative imputation transform for the horse colic dataset from numpy import isnan from pandas import read_csv from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv' dataframe = read_csv(url, header=None, na_values='?') # split into input and output elements data = dataframe.values X, y = data[:, :-1], data[:, -1] # print total missing print('Missing: %d' % sum(isnan(X).flatten())) # define imputer imputer = IterativeImputer() # fit on the dataset imputer.fit(X) # transform the dataset Xtrans = imputer.transform(X) # print total missing print('Missing: %d' % sum(isnan(Xtrans).flatten()))

Running the example first loads the dataset and reports the total number of missing values in the dataset as 1,605.

The transform is configured, fit, and performed and the resulting new dataset has no missing values, confirming it was performed as we expected.

Each missing value was replaced with a value estimated by the model.

Missing: 1605 Missing: 0

It is a good practice to evaluate machine learning models on a dataset using k-fold cross-validation.

To correctly apply iterative missing data imputation and avoid data leakage, it is required that the models for each column are calculated on the training dataset only, then applied to the train and test sets for each fold in the dataset.

This can be achieved by creating a modeling pipeline where the first step is the iterative imputation, then the second step is the model. This can be achieved using the Pipeline class.

For example, the *Pipeline* below uses an *IterativeImputer* with the default strategy, followed by a random forest model.

... # define modeling pipeline model = RandomForestClassifier() imputer = IterativeImputer() pipeline = Pipeline(steps=[('i', imputer), ('m', model)])

We can evaluate the imputed dataset and random forest modeling pipeline for the horse colic dataset with repeated 10-fold cross-validation.

The complete example is listed below.

# evaluate iterative imputation and random forest for the horse colic dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.ensemble import RandomForestClassifier from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv' dataframe = read_csv(url, header=None, na_values='?') # split into input and output elements data = dataframe.values X, y = data[:, :-1], data[:, -1] # define modeling pipeline model = RandomForestClassifier() imputer = IterativeImputer() pipeline = Pipeline(steps=[('i', imputer), ('m', model)]) # define model evaluation cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example correctly applies data imputation to each fold of the cross-validation procedure.

The pipeline is evaluated using three repeats of 10-fold cross-validation and reports the mean classification accuracy on the dataset as about 81.4 percent which is a good score.

Mean Accuracy: 0.814 (0.063)

How do we know that using a default iterative strategy is good or best for this dataset?

The answer is that we don’t.

By default, imputation is performed in ascending order from the feature with the least missing values to the feature with the most.

This makes sense as we want to have more complete data when it comes time to estimating missing values for columns where the majority of values are missing.

Nevertheless, we can experiment with different imputation order strategies, such as descending, right-to-left (Arabic), left-to-right (Roman), and random.

The example below evaluates and compares each available imputation order configuration.

# compare iterative imputation strategies for the horse colic dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.ensemble import RandomForestClassifier from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from matplotlib import pyplot # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv' dataframe = read_csv(url, header=None, na_values='?') # split into input and output elements data = dataframe.values X, y = data[:, :-1], data[:, -1] # evaluate each strategy on the dataset results = list() strategies = ['ascending', 'descending', 'roman', 'arabic', 'random'] for s in strategies: # create the modeling pipeline pipeline = Pipeline(steps=[('i', IterativeImputer(imputation_order=s)), ('m', RandomForestClassifier())]) # evaluate the model cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # store results results.append(scores) print('>%s %.3f (%.3f)' % (s, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=strategies, showmeans=True) pyplot.xticks(rotation=45) pyplot.show()

Running the example evaluates each imputation order on the horse colic dataset using repeated cross-validation.

Your specific results may vary given the stochastic nature of the learning algorithm; consider running the example a few times.

The mean accuracy of each strategy is reported along the way. The results suggest little difference between most of the methods, with descending (opposite of the default) performing the best. The results suggest that right-to-left (Arabic) order might be better for this dataset with an accuracy of about 80.4 percent.

>ascending 0.801 (0.071) >descending 0.797 (0.059) >roman 0.802 (0.060) >arabic 0.804 (0.068) >random 0.802 (0.061)

At the end of the run, a box and whisker plot is created for each set of results, allowing the distribution of results to be compared.

By default, the IterativeImputer will repeat the number of iterations 10 times.

It is possible that a large number of iterations may begin to bias or skew the estimate and that few iterations may be preferred. The number of iterations of the procedure can be specified via the “*max_iter*” argument.

It may be interesting to evaluate different numbers of iterations. The example below compares different values for “*max_iter*” from 1 to 20.

# compare iterative imputation number of iterations for the horse colic dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.ensemble import RandomForestClassifier from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from matplotlib import pyplot # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv' dataframe = read_csv(url, header=None, na_values='?') # split into input and output elements data = dataframe.values X, y = data[:, :-1], data[:, -1] # evaluate each strategy on the dataset results = list() strategies = [str(i) for i in range(1, 21)] for s in strategies: # create the modeling pipeline pipeline = Pipeline(steps=[('i', IterativeImputer(max_iter=int(s))), ('m', RandomForestClassifier())]) # evaluate the model cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # store results results.append(scores) print('>%s %.3f (%.3f)' % (s, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=strategies, showmeans=True) pyplot.xticks(rotation=45) pyplot.show()

Running the example evaluates each number of iterations on the horse colic dataset using repeated cross-validation.

Your specific results may vary given the stochastic nature of the learning algorithm; consider running the example a few times.

The results suggest that very few iterations, such as 1 or 2, might be as or more effective than 9-12 iterations on this dataset.

>1 0.820 (0.072) >2 0.813 (0.078) >3 0.801 (0.066) >4 0.817 (0.067) >5 0.808 (0.071) >6 0.799 (0.059) >7 0.804 (0.058) >8 0.809 (0.070) >9 0.812 (0.068) >10 0.800 (0.058) >11 0.818 (0.064) >12 0.810 (0.073) >13 0.808 (0.073) >14 0.799 (0.067) >15 0.812 (0.075) >16 0.814 (0.057) >17 0.812 (0.060) >18 0.810 (0.069) >19 0.810 (0.057) >20 0.802 (0.067)

At the end of the run, a box and whisker plot is created for each set of results, allowing the distribution of results to be compared.

We may wish to create a final modeling pipeline with the iterative imputation and random forest algorithm, then make a prediction for new data.

This can be achieved by defining the pipeline and fitting it on all available data, then calling the *predict()* function, passing new data in as an argument.

Importantly, the row of new data must mark any missing values using the NaN value.

... # define new data row = [2,1,530101,38.50,66,28,3,3,nan,2,5,4,4,nan,nan,nan,3,5,45.00,8.40,nan,nan,2,2,11300,00000,00000]

The complete example is listed below.

# iterative imputation strategy and prediction for the hose colic dataset from numpy import nan from pandas import read_csv from sklearn.ensemble import RandomForestClassifier from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer from sklearn.pipeline import Pipeline # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv' dataframe = read_csv(url, header=None, na_values='?') # split into input and output elements data = dataframe.values X, y = data[:, :-1], data[:, -1] # create the modeling pipeline pipeline = Pipeline(steps=[('i', IterativeImputer()), ('m', RandomForestClassifier())]) # fit the model pipeline.fit(X, y) # define new data row = [2,1,530101,38.50,66,28,3,3,nan,2,5,4,4,nan,nan,nan,3,5,45.00,8.40,nan,nan,2,2,11300,00000,00000] # make a prediction yhat = pipeline.predict([row]) # summarize prediction print('Predicted Class: %d' % yhat[0])

Running the example fits the modeling pipeline on all available data.

A new row of data is defined with missing values marked with NaNs and a classification prediction is made.

Predicted Class: 2

This section provides more resources on the topic if you are looking to go deeper.

- Results for Standard Classification and Regression Machine Learning Datasets
- How to Handle Missing Data with Python

- mice: Multivariate Imputation by Chained Equations in R, 2009.
- A Method of Estimation of Missing Values in Multivariate Data Suitable for use with an Electronic Computer, 1960.

In this tutorial, you discovered how to use iterative imputation strategies for missing data in machine learning.

Specifically, you learned:

- Missing values must be marked with NaN values and can be replaced with iteratively estimated values.
- How to load a CSV value with missing values and mark the missing values with NaN values and report the number and percentage of missing values for each column.
- How to impute missing values with iterative models as a data preparation method when evaluating models and when fitting a final model to make predictions on new data.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Iterative Imputation for Missing Values in Machine Learning appeared first on Machine Learning Mastery.

]]>The post Test-Time Augmentation For Tabular Data With Scikit-Learn appeared first on Machine Learning Mastery.

]]>It is typically used to improve the predictive performance of deep learning models on image datasets where predictions are averaged across multiple augmented versions of each image in the test dataset.

Although popular with image datasets and neural network models, test-time augmentation can be used with any machine learning algorithm on tabular datasets, such as those often seen in regression and classification predictive modeling problems.

In this tutorial, you will discover how to use test-time augmentation for tabular data in scikit-learn.

After completing this tutorial, you will know:

- Test-time augmentation is a technique for improving model performance and is commonly used for deep learning models on image datasets.
- How to implement test-time augmentation for regression and classification tabular datasets in Python with scikit-learn.
- How to tune the number of synthetic examples and amount of statistical noise used in test-time augmentation.

Let’s get started.

This tutorial is divided into three parts; they are:

- Test-Time Augmentation
- Standard Model Evaluation
- Test-Time Augmentation Example

Test-time augmentation, or TTA for short, is a technique for improving the skill of a predictive model.

It is a procedure implemented when using a fit model to make predictions, such as on a test dataset or on new data. The procedure involves creating multiple slightly modified copies of each example in the dataset. A prediction is made for each modified example and the predictions are averaged to give a more accurate prediction for the original example.

TTA is often used with image classification, where image data augmentation is used to create multiple modified versions of each image, such as crops, zooms, rotations, and other image-specific modifications. As such, the technique results in a lift in the performance of image classification algorithms on standard datasets.

In their 2015 paper that achieved then state-of-the-art results on the ILSVRC dataset titled “*Very Deep Convolutional Networks for Large-Scale Image Recognition*,” the authors use horizontal flip test-time augmentation:

We also augment the test set by horizontal flipping of the images; the soft-max class posteriors of the original and flipped images are averaged to obtain the final scores for the image.

— Very Deep Convolutional Networks for Large-Scale Image Recognition, 2015.

For more on test-time augmentation with image data, see the tutorial:

Although often used for image data, test-time augmentation can also be used for other data types, such as tabular data (e.g. rows and columns of numbers).

There are many ways that TTA can be used with tabular data. One simple approach involves creating copies of rows of data with small Gaussian noise added. The predictions from the copied rows can then be averaged to result in an improved prediction for regression or classification.

We will explore how this might be achieved using the scikit-learn Python machine learning library.

First, let’s define a standard approach for evaluating a model.

In this section, we will explore the typical way of evaluating a machine learning model before we introduce test-time augmentation in the next section.

First, let’s define a synthetic classification dataset.

We will use the make_classification() function to create a dataset with 100 examples, each with 20 input variables.

The example creates and summarizes the dataset.

# test classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=100, n_features=20, n_informative=15, n_redundant=5, random_state=1) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and confirms the number of rows and columns of the dataset.

(100, 20) (100,)

This is a binary classification task and we will fit and evaluate a linear model, specifically, a logistic regression model.

A good practice when evaluating machine learning models is to use repeated k-fold cross-validation. When the dataset is a classification problem, it is important to ensure that a stratified version of k-fold cross-validation is used. As such, we will use repeated stratified k-fold cross-validation with 10 folds and 5 repeats.

... # prepare the cross-validation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=5, random_state=1)

We will enumerate the folds and repeats manually so that later we can perform test-time augmentation.

Each loop, we must define and fit the model, then use the fit model to make a prediction, evaluate the predictions, and store the result.

... scores = list() for train_ix, test_ix in cv.split(X, y): # split the data X_train, X_test = X[train_ix], X[test_ix] y_train, y_test = y[train_ix], y[test_ix] # fit model model = LogisticRegression() model.fit(X_train, y_train) # evaluate model y_hat = model.predict(X_test) acc = accuracy_score(y_test, y_hat) scores.append(acc)

At the end, we can report the mean classification accuracy across all folds and repeats.

... # report performance print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Tying this together, the complete example of evaluating a logistic regression model on the synthetic binary classification dataset is listed below.

# evaluate logistic regression using repeated stratified k-fold cross-validation from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score # create dataset X, y = make_classification(n_samples=100, n_features=20, n_informative=15, n_redundant=5, random_state=1) # prepare the cross-validation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=5, random_state=1) scores = list() for train_ix, test_ix in cv.split(X, y): # split the data X_train, X_test = X[train_ix], X[test_ix] y_train, y_test = y[train_ix], y[test_ix] # fit model model = LogisticRegression() model.fit(X_train, y_train) # evaluate model y_hat = model.predict(X_test) acc = accuracy_score(y_test, y_hat) scores.append(acc) # report performance print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the logistic regression using repeated stratified k-fold cross-validation.

Your specific results may differ given the stochastic nature of the learning algorithm. Consider running the example a few times,

In this case, we can see that the model achieved the mean classification accuracy of 79.8 percent.

Accuracy: 0.798 (0.110)

Next, let’s explore how we might update this example to use test-time augmentation.

Implementing test-time augmentation involves two steps.

The first step is to select a method for creating modified versions of each row in the test set.

In this tutorial, we will add Gaussian random noise to each feature. An alternate approach might be to add uniformly random noise or even copy feature values from examples in the test dataset.

The normal() NumPy function will be used to create a vector of random Gaussian values with a zero mean and small standard deviation. The standard deviation should be proportional to the distribution for each variable in the training dataset. In this case, we will keep the example simple and use a value of 0.02.

... # create vector of random gaussians gauss = normal(loc=0.0, scale=feature_scale, size=len(row)) # add to test case new_row = row + gauss

Given a row of data from the test set, we can create a given number of modified copies. It is a good idea to use an odd number of copies, such as 3, 5, or 7, as when we average the labels assigned to each later, we want to break ties automatically.

The *create_test_set()* function below implements this; given a row of data, it will return a test set that contains the row as well as “*n_cases*” modified copies, defaulting to 3 (so the test set size is 4).

# create a test set for a row of real data with an unknown label def create_test_set(row, n_cases=3, feature_scale=0.2): test_set = list() test_set.append(row) # make copies of row for _ in range(n_cases): # create vector of random gaussians gauss = normal(loc=0.0, scale=feature_scale, size=len(row)) # add to test case new_row = row + gauss # store in test set test_set.append(new_row) return test_set

An improvement to this approach would be to standardize or normalize the train and test datasets each loop and then use a standard deviation for the normal() that is consistent across features meaningful to the standard normal. This is left as an exercise for the reader.

The second setup is to make use of the *create_test_set()* for each example in the test set, make a prediction for the constructed test set, and record the predicted label using a summary statistic across the predictions. Given that the prediction is categorical, the statistical mode would be appropriate, via the mode() scipy function. If the dataset was regression or we were predicting probabilities, the mean or median would be more appropriate.

... # create the test set test_set = create_test_set(row) # make a prediction for all examples in the test set labels = model.predict(test_set) # select the label as the mode of the distribution label, _ = mode(labels)

The *test_time_augmentation()* function below implements this; given a model and a test set, it returns an array of predictions where each prediction was made using test-time augmentation.

# make predictions using test-time augmentation def test_time_augmentation(model, X_test): # evaluate model y_hat = list() for i in range(X_test.shape[0]): # retrieve the row row = X_test[i] # create the test set test_set = create_test_set(row) # make a prediction for all examples in the test set labels = model.predict(test_set) # select the label as the mode of the distribution label, _ = mode(labels) # store the prediction y_hat.append(label) return y_hat

Tying all of this together, the complete example of evaluating the logistic regression model on the dataset using test-time augmentation is listed below.

# evaluate logistic regression using test-time augmentation from numpy.random import seed from numpy.random import normal from numpy import mean from numpy import std from scipy.stats import mode from sklearn.datasets import make_classification from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score # create a test set for a row of real data with an unknown label def create_test_set(row, n_cases=3, feature_scale=0.2): test_set = list() test_set.append(row) # make copies of row for _ in range(n_cases): # create vector of random gaussians gauss = normal(loc=0.0, scale=feature_scale, size=len(row)) # add to test case new_row = row + gauss # store in test set test_set.append(new_row) return test_set # make predictions using test-time augmentation def test_time_augmentation(model, X_test): # evaluate model y_hat = list() for i in range(X_test.shape[0]): # retrieve the row row = X_test[i] # create the test set test_set = create_test_set(row) # make a prediction for all examples in the test set labels = model.predict(test_set) # select the label as the mode of the distribution label, _ = mode(labels) # store the prediction y_hat.append(label) return y_hat # initialize numpy random number generator seed(1) # create dataset X, y = make_classification(n_samples=100, n_features=20, n_informative=15, n_redundant=5, random_state=1) # prepare the cross-validation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=5, random_state=1) scores = list() for train_ix, test_ix in cv.split(X, y): # split the data X_train, X_test = X[train_ix], X[test_ix] y_train, y_test = y[train_ix], y[test_ix] # fit model model = LogisticRegression() model.fit(X_train, y_train) # make predictions using test-time augmentation y_hat = test_time_augmentation(model, X_test) # calculate the accuracy for this iteration acc = accuracy_score(y_test, y_hat) # store the result scores.append(acc) # report performance print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the logistic regression using repeated stratified k-fold cross-validation and test-time augmentation.

Your specific results may differ given the stochastic nature of the learning algorithm. Consider running the example a few times.

In this case, we can see that the model achieved the mean classification accuracy of 81.0 percent, which is better than the test harness that does not use test-time augmentation that achieved an accuracy of 79.8 percent.

Accuracy: 0.810 (0.114)

It might be interesting to grid search the number of synthetic examples created each time a prediction is made during test-time augmentation.

The example below explores values between 1 and 20 and plots the results.

# compare the number of synthetic examples created during the test-time augmentation from numpy.random import seed from numpy.random import normal from numpy import mean from numpy import std from scipy.stats import mode from sklearn.datasets import make_classification from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score from matplotlib import pyplot # create a test set for a row of real data with an unknown label def create_test_set(row, n_cases=3, feature_scale=0.2): test_set = list() test_set.append(row) # make copies of row for _ in range(n_cases): # create vector of random gaussians gauss = normal(loc=0.0, scale=feature_scale, size=len(row)) # add to test case new_row = row + gauss # store in test set test_set.append(new_row) return test_set # make predictions using test-time augmentation def test_time_augmentation(model, X_test, cases): # evaluate model y_hat = list() for i in range(X_test.shape[0]): # retrieve the row row = X_test[i] # create the test set test_set = create_test_set(row, n_cases=cases) # make a prediction for all examples in the test set labels = model.predict(test_set) # select the label as the mode of the distribution label, _ = mode(labels) # store the prediction y_hat.append(label) return y_hat # evaluate different number of synthetic examples created at test time examples = range(1, 21) results = list() for e in examples: # initialize numpy random number generator seed(1) # create dataset X, y = make_classification(n_samples=100, n_features=20, n_informative=15, n_redundant=5, random_state=1) # prepare the cross-validation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=5, random_state=1) scores = list() for train_ix, test_ix in cv.split(X, y): # split the data X_train, X_test = X[train_ix], X[test_ix] y_train, y_test = y[train_ix], y[test_ix] # fit model model = LogisticRegression() model.fit(X_train, y_train) # make predictions using test-time augmentation y_hat = test_time_augmentation(model, X_test, e) # calculate the accuracy for this iteration acc = accuracy_score(y_test, y_hat) # store the result scores.append(acc) # report performance print('>%d, acc: %.3f (%.3f)' % (e, mean(scores), std(scores))) results.append(mean(scores)) # plot the results pyplot.plot(examples, results) pyplot.show()

Running the example reports the accuracy for different numbers of synthetic examples created during test-time augmentation.

Your specific results may differ given the stochastic nature of the learning algorithm. Consider running the example a few times.

Recall that we used three examples in the previous example.

In this case, it looks like a value of three might be optimal for this test harness, as all other values seem to result in lower performance.

>1, acc: 0.800 (0.118) >2, acc: 0.806 (0.114) >3, acc: 0.810 (0.114) >4, acc: 0.798 (0.105) >5, acc: 0.802 (0.109) >6, acc: 0.798 (0.107) >7, acc: 0.800 (0.111) >8, acc: 0.802 (0.110) >9, acc: 0.806 (0.105) >10, acc: 0.802 (0.110) >11, acc: 0.798 (0.112) >12, acc: 0.806 (0.110) >13, acc: 0.802 (0.110) >14, acc: 0.802 (0.109) >15, acc: 0.798 (0.110) >16, acc: 0.796 (0.111) >17, acc: 0.806 (0.112) >18, acc: 0.796 (0.111) >19, acc: 0.800 (0.113) >20, acc: 0.804 (0.109)

A line plot of number of examples vs. classification accuracy is created showing that perhaps odd numbers of examples generally result in better performance than even numbers of examples.

This might be expected due to their ability to break ties when using the mode of the predictions.

We can also perform the same sensitivity analysis with the amount of random noise added to examples in the test set during test-time augmentation.

The example below demonstrates this with noise values between 0.01 and 0.3 with a grid of 0.01.

# compare amount of noise added to examples created during the test-time augmentation from numpy.random import seed from numpy.random import normal from numpy import arange from numpy import mean from numpy import std from scipy.stats import mode from sklearn.datasets import make_classification from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score from matplotlib import pyplot # create a test set for a row of real data with an unknown label def create_test_set(row, n_cases=3, feature_scale=0.2): test_set = list() test_set.append(row) # make copies of row for _ in range(n_cases): # create vector of random gaussians gauss = normal(loc=0.0, scale=feature_scale, size=len(row)) # add to test case new_row = row + gauss # store in test set test_set.append(new_row) return test_set # make predictions using test-time augmentation def test_time_augmentation(model, X_test, noise): # evaluate model y_hat = list() for i in range(X_test.shape[0]): # retrieve the row row = X_test[i] # create the test set test_set = create_test_set(row, feature_scale=noise) # make a prediction for all examples in the test set labels = model.predict(test_set) # select the label as the mode of the distribution label, _ = mode(labels) # store the prediction y_hat.append(label) return y_hat # evaluate different number of synthetic examples created at test time noise = arange(0.01, 0.31, 0.01) results = list() for n in noise: # initialize numpy random number generator seed(1) # create dataset X, y = make_classification(n_samples=100, n_features=20, n_informative=15, n_redundant=5, random_state=1) # prepare the cross-validation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=5, random_state=1) scores = list() for train_ix, test_ix in cv.split(X, y): # split the data X_train, X_test = X[train_ix], X[test_ix] y_train, y_test = y[train_ix], y[test_ix] # fit model model = LogisticRegression() model.fit(X_train, y_train) # make predictions using test-time augmentation y_hat = test_time_augmentation(model, X_test, n) # calculate the accuracy for this iteration acc = accuracy_score(y_test, y_hat) # store the result scores.append(acc) # report performance print('>noise=%.3f, acc: %.3f (%.3f)' % (n, mean(scores), std(scores))) results.append(mean(scores)) # plot the results pyplot.plot(noise, results) pyplot.show()

Running the example reports the accuracy for different amounts of statistical noise added to examples created during test-time augmentation.

Your specific results may differ given the stochastic nature of the learning algorithm. Consider running the example a few times.

Recall that we used a standard deviation of 0.02 in the first example.

In this case, it looks like a value of about 0.230 might be optimal for this test harness, resulting in a slightly higher accuracy of 81.2 percent.

>noise=0.010, acc: 0.798 (0.110) >noise=0.020, acc: 0.798 (0.110) >noise=0.030, acc: 0.798 (0.110) >noise=0.040, acc: 0.800 (0.113) >noise=0.050, acc: 0.802 (0.112) >noise=0.060, acc: 0.804 (0.111) >noise=0.070, acc: 0.806 (0.108) >noise=0.080, acc: 0.806 (0.108) >noise=0.090, acc: 0.806 (0.108) >noise=0.100, acc: 0.806 (0.108) >noise=0.110, acc: 0.806 (0.108) >noise=0.120, acc: 0.806 (0.108) >noise=0.130, acc: 0.806 (0.108) >noise=0.140, acc: 0.806 (0.108) >noise=0.150, acc: 0.808 (0.111) >noise=0.160, acc: 0.808 (0.111) >noise=0.170, acc: 0.808 (0.111) >noise=0.180, acc: 0.810 (0.114) >noise=0.190, acc: 0.810 (0.114) >noise=0.200, acc: 0.810 (0.114) >noise=0.210, acc: 0.810 (0.114) >noise=0.220, acc: 0.810 (0.114) >noise=0.230, acc: 0.812 (0.114) >noise=0.240, acc: 0.812 (0.114) >noise=0.250, acc: 0.812 (0.114) >noise=0.260, acc: 0.812 (0.114) >noise=0.270, acc: 0.810 (0.114) >noise=0.280, acc: 0.808 (0.116) >noise=0.290, acc: 0.808 (0.116) >noise=0.300, acc: 0.808 (0.116)

A line plot of the amount of noise added to examples vs. classification accuracy is created, showing that perhaps a small range of noise around a standard deviation of 0.250 might be optimal on this test harness.

SMOTE is a popular oversampling method for rebalancing observations for each class in a training dataset. It can create synthetic examples but requires knowledge of the class labels which does not make it easy for use in test-time augmentation.

One approach might be to take a given example for which a prediction is required and assume it belongs to a given class. Then generate synthetic samples from the training dataset using the new example as the focal point of the synthesis, and classify them. This is then repeated for each class label. The total or average classification response (perhaps probability) can be tallied for each class group and the group with the largest response can be taken as the prediction.

This is just off the cuff, I have not actually tried this approach. Have a go and let me know if it works.

This section provides more resources on the topic if you are looking to go deeper.

- How to Use Test-Time Augmentation to Make Better predictions
- How to Generate Random Numbers in Python

- sklearn.datasets.make_classification API.
- sklearn.model_selection.RepeatedStratifiedKFold API.
- numpy.random.normal API.
- scipy.stats.mode API.

In this tutorial, you discovered how to use test-time augmentation for tabular data in scikit-learn.

Specifically, you learned:

- Test-time augmentation is a technique for improving model performance and is commonly used for deep learning models on image datasets.
- How to implement test-time augmentation for regression and classification tabular datasets in Python with scikit-learn.
- How to tune the number of synthetic examples and amount of statistical noise used in test-time augmentation.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Test-Time Augmentation For Tabular Data With Scikit-Learn appeared first on Machine Learning Mastery.

]]>The post How to Use Polynomial Feature Transforms for Machine Learning appeared first on Machine Learning Mastery.

]]>These interactions can be identified and modeled by a learning algorithm. Another approach is to engineer new features that expose these interactions and see if they improve model performance. Additionally, transforms like raising input variables to a power can help to better expose the important relationships between input variables and the target variable.

These features are called interaction and polynomial features and allow the use of simpler modeling algorithms as some of the complexity of interpreting the input variables and their relationships is pushed back to the data preparation stage. Sometimes these features can result in improved modeling performance, although at the cost of adding thousands or even millions of additional input variables.

In this tutorial, you will discover how to use polynomial feature transforms for feature engineering with numerical input variables.

After completing this tutorial, you will know:

- Some machine learning algorithms prefer or perform better with polynomial input features.
- How to use the polynomial features transform to create new versions of input variables for predictive modeling.
- How the degree of the polynomial impacts the number of input features created by the transform.

Let’s get started.

This tutorial is divided into five parts; they are:

- Polynomial Features
- Polynomial Feature Transform
- Sonar Dataset
- Polynomial Feature Transform Example
- Effect of Polynomial Degree

Polynomial features are those features created by raising existing features to an exponent.

For example, if a dataset had one input feature X, then a polynomial feature would be the addition of a new feature (column) where values were calculated by squaring the values in X, e.g. X^2. This process can be repeated for each input variable in the dataset, creating a transformed version of each.

As such, polynomial features are a type of feature engineering, e.g. the creation of new input features based on the existing features.

The “*degree*” of the polynomial is used to control the number of features added, e.g. a degree of 3 will add two new variables for each input variable. Typically a small degree is used such as 2 or 3.

Generally speaking, it is unusual to use d greater than 3 or 4 because for large values of d, the polynomial curve can become overly flexible and can take on some very strange shapes.

— Page 266, An Introduction to Statistical Learning with Applications in R, 2014.

It is also common to add new variables that represent the interaction between features, e.g a new column that represents one variable multiplied by another. This too can be repeated for each input variable creating a new “*interaction*” variable for each pair of input variables.

A squared or cubed version of an input variable will change the probability distribution, separating the small and large values, a separation that is increased with the size of the exponent.

This separation can help some machine learning algorithms make better predictions and is common for regression predictive modeling tasks and generally tasks that have numerical input variables.

Typically linear algorithms, such as linear regression and logistic regression, respond well to the use of polynomial input variables.

Linear regression is linear in the model parameters and adding polynomial terms to the model can be an effective way of allowing the model to identify nonlinear patterns.

— Page 11, Feature Engineering and Selection, 2019.

For example, when used as input to a linear regression algorithm, the method is more broadly referred to as polynomial regression.

Polynomial regression extends the linear model by adding extra predictors, obtained by raising each of the original predictors to a power. For example, a cubic regression uses three variables, X, X2, and X3, as predictors. This approach provides a simple way to provide a non-linear fit to data.

— Page 265, An Introduction to Statistical Learning with Applications in R, 2014.

The polynomial features transform is available in the scikit-learn Python machine learning library via the PolynomialFeatures class.

The features created include:

- The bias (the value of 1.0)
- Values raised to a power for each degree (e.g. x^1, x^2, x^3, …)
- Interactions between all pairs of features (e.g. x1 * x2, x1 * x3, …)

For example, with two input variables with values 2 and 3 and a degree of 2, the features created would be:

- 1 (the bias)
- 2^1 = 2
- 3^1 = 3
- 2^2 = 4
- 3^2 = 9
- 2 * 3 = 6

We can demonstrate this with an example:

# demonstrate the types of features created from numpy import asarray from sklearn.preprocessing import PolynomialFeatures # define the dataset data = asarray([[2,3],[2,3],[2,3]]) print(data) # perform a polynomial features transform of the dataset trans = PolynomialFeatures(degree=2) data = trans.fit_transform(data) print(data)

Running the example first reports the raw data with two features (columns) and each feature has the same value, either 2 or 3.

Then the polynomial features are created, resulting in six features, matching what was described above.

[[2 3] [2 3] [2 3]] [[1. 2. 3. 4. 6. 9.] [1. 2. 3. 4. 6. 9.] [1. 2. 3. 4. 6. 9.]]

The “*degree*” argument controls the number of features created and defaults to 2.

The “*interaction_only*” argument means that only the raw values (degree 1) and the interaction (pairs of values multiplied with each other) are included, defaulting to *False*.

The “*include_bias*” argument defaults to *True* to include the bias feature.

We will take a closer look at how to use the polynomial feature transforms on a real dataset.

First, let’s introduce a real dataset.

The sonar dataset is a standard machine learning dataset for binary classification.

It involves 60 real-valued inputs and a two-class target variable. There are 208 examples in the dataset and the classes are reasonably balanced.

A baseline classification algorithm can achieve a classification accuracy of about 53.4 percent using repeated stratified 10-fold cross-validation. Top performance on this dataset is about 88 percent using repeated stratified 10-fold cross-validation.

The dataset describes radar returns of rocks or simulated mines.

You can learn more about the dataset from here:

No need to download the dataset; we will download it automatically from our worked examples.

First, let’s load and summarize the dataset. The complete example is listed below.

# load and summarize the sonar dataset from pandas import read_csv from pandas.plotting import scatter_matrix from matplotlib import pyplot # Load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv" dataset = read_csv(url, header=None) # summarize the shape of the dataset print(dataset.shape) # summarize each variable print(dataset.describe()) # histograms of the variables dataset.hist() pyplot.show()

Running the example first summarizes the shape of the loaded dataset.

This confirms the 60 input variables, one output variable, and 208 rows of data.

A statistical summary of the input variables is provided showing that values are numeric and range approximately from 0 to 1.

(208, 61) 0 1 2 ... 57 58 59 count 208.000000 208.000000 208.000000 ... 208.000000 208.000000 208.000000 mean 0.029164 0.038437 0.043832 ... 0.007949 0.007941 0.006507 std 0.022991 0.032960 0.038428 ... 0.006470 0.006181 0.005031 min 0.001500 0.000600 0.001500 ... 0.000300 0.000100 0.000600 25% 0.013350 0.016450 0.018950 ... 0.003600 0.003675 0.003100 50% 0.022800 0.030800 0.034300 ... 0.005800 0.006400 0.005300 75% 0.035550 0.047950 0.057950 ... 0.010350 0.010325 0.008525 max 0.137100 0.233900 0.305900 ... 0.044000 0.036400 0.043900 [8 rows x 60 columns]

Finally, a histogram is created for each input variable.

If we ignore the clutter of the plots and focus on the histograms themselves, we can see that many variables have a skewed distribution.

Next, let’s fit and evaluate a machine learning model on the raw dataset.

We will use a k-nearest neighbor algorithm with default hyperparameters and evaluate it using repeated stratified k-fold cross-validation. The complete example is listed below.

# evaluate knn on the raw sonar dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import LabelEncoder from matplotlib import pyplot # load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv" dataset = read_csv(url, header=None) data = dataset.values # separate into input and output columns X, y = data[:, :-1], data[:, -1] # ensure inputs are floats and output is an integer label X = X.astype('float32') y = LabelEncoder().fit_transform(y.astype('str')) # define and configure the model model = KNeighborsClassifier() # evaluate the model cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') # report model performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example evaluates a KNN model on the raw sonar dataset.

We can see that the model achieved a mean classification accuracy of about 79.7 percent, showing that it has skill (better than 53.4 percent) and is in the ball-park of good performance (88 percent).

Accuracy: 0.797 (0.073)

Next, let’s explore a polynomial features transform of the dataset.

We can apply the polynomial features transform to the Sonar dataset directly.

In this case, we will use a degree of 3.

... # perform a polynomial features transform of the dataset trans = PolynomialFeatures(degree=3) data = trans.fit_transform(data)

Let’s try it on our sonar dataset.

The complete example of creating a polynomial features transform of the sonar dataset and summarizing the created features is below.

# visualize a polynomial features transform of the sonar dataset from pandas import read_csv from pandas import DataFrame from pandas.plotting import scatter_matrix from sklearn.preprocessing import PolynomialFeatures from matplotlib import pyplot # load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv" dataset = read_csv(url, header=None) # retrieve just the numeric input values data = dataset.values[:, :-1] # perform a polynomial features transform of the dataset trans = PolynomialFeatures(degree=3) data = trans.fit_transform(data) # convert the array back to a dataframe dataset = DataFrame(data) # summarize print(dataset.shape)

Running the example performs the polynomial features transform on the sonar dataset.

We can see that our features increased from 61 (60 input features) for the raw dataset to 39,711 features (39,710 input features).

(208, 39711)

Next, let’s evaluate the same KNN model as the previous section, but in this case on a polynomial features transform of the dataset.

The complete example is listed below.

# evaluate knn on the sonar dataset with polynomial features transform from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import PolynomialFeatures from sklearn.pipeline import Pipeline from matplotlib import pyplot # load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv" dataset = read_csv(url, header=None) data = dataset.values # separate into input and output columns X, y = data[:, :-1], data[:, -1] # ensure inputs are floats and output is an integer label X = X.astype('float32') y = LabelEncoder().fit_transform(y.astype('str')) # define the pipeline trans = PolynomialFeatures(degree=3) model = KNeighborsClassifier() pipeline = Pipeline(steps=[('t', trans), ('m', model)]) # evaluate the pipeline cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') # report pipeline performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example, we can see that the polynomial features transform results in a lift in performance from 79.7 percent accuracy without the transform to about 80.0 percent with the transform.

Accuracy: 0.800 (0.077)

Next, let’s explore the effect of different scaling ranges.

The degree of the polynomial dramatically increases the number of input features.

To get an idea of how much this impacts the number of features, we can perform the transform with a range of different degrees and compare the number of features in the dataset.

The complete example is listed below.

# compare the effect of the degree on the number of created features from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import PolynomialFeatures from matplotlib import pyplot # get the dataset def get_dataset(): # load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv" dataset = read_csv(url, header=None) data = dataset.values # separate into input and output columns X, y = data[:, :-1], data[:, -1] # ensure inputs are floats and output is an integer label X = X.astype('float32') y = LabelEncoder().fit_transform(y.astype('str')) return X, y # define dataset X, y = get_dataset() # calculate change in number of features num_features = list() degress = [i for i in range(1, 6)] for d in degress: # create transform trans = PolynomialFeatures(degree=d) # fit and transform data = trans.fit_transform(X) # record number of features num_features.append(data.shape[1]) # summarize print('Degree: %d, Features: %d' % (d, data.shape[1])) # plot degree vs number of features pyplot.plot(degress, num_features) pyplot.show()

Running the example first reports the degree from 1 to 5 and the number of features in the dataset.

We can see that a degree of 1 has no effect and that the number of features dramatically increases from 2 through to 5.

This highlights that for anything other than very small datasets, a degree of 2 or 3 should be used to avoid a dramatic increase in input variables.

Degree: 1, Features: 61 Degree: 2, Features: 1891 Degree: 3, Features: 39711 Degree: 4, Features: 635376 Degree: 5, Features: 8259888

More features may result in more overfitting, and in turn, worse results.

It may be a good idea to treat the degree for the polynomial features transform as a hyperparameter and test different values for your dataset.

The example below explores degree values from 1 to 4 and evaluates their effect on classification accuracy with the chosen model.

# explore the effect of degree on accuracy for the polynomial features transform from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import PolynomialFeatures from sklearn.preprocessing import LabelEncoder from sklearn.pipeline import Pipeline from matplotlib import pyplot # get the dataset def get_dataset(): # load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv" dataset = read_csv(url, header=None) data = dataset.values # separate into input and output columns X, y = data[:, :-1], data[:, -1] # ensure inputs are floats and output is an integer label X = X.astype('float32') y = LabelEncoder().fit_transform(y.astype('str')) return X, y # get a list of models to evaluate def get_models(): models = dict() for d in range(1,5): # define the pipeline trans = PolynomialFeatures(degree=d) model = KNeighborsClassifier() models[str(d)] = Pipeline(steps=[('t', trans), ('m', model)]) return models # evaluate a give model using cross-validation def evaluate_model(model): cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): scores = evaluate_model(model) results.append(scores) names.append(name) print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example reports the mean classification accuracy for each polynomial degree.

In this case, we can see that performance is generally worse than no transform (degree 1) except for a degree 3.

It might be interesting to explore scaling the data before or after performing the transform to see how it impacts model performance.

>1 0.797 (0.073) >2 0.793 (0.085) >3 0.800 (0.077) >4 0.795 (0.079)

Box and whisker plots are created to summarize the classification accuracy scores for each polynomial degree.

We can see that performance remains flat, perhaps with the first signs of overfitting with a degree of 4.

This section provides more resources on the topic if you are looking to go deeper.

- An Introduction to Statistical Learning with Applications in R, 2014.
- Feature Engineering and Selection, 2019.

In this tutorial, you discovered how to use polynomial feature transforms for feature engineering with numerical input variables.

Specifically, you learned:

- Some machine learning algorithms prefer or perform better with polynomial input features.
- How to use the polynomial features transform to create new versions of input variables for predictive modeling.
- How the degree of the polynomial impacts the number of input features created by the transform.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Use Polynomial Feature Transforms for Machine Learning appeared first on Machine Learning Mastery.

]]>The post How to Scale Data With Outliers for Machine Learning appeared first on Machine Learning Mastery.

]]>This includes algorithms that use a weighted sum of the input, like linear regression, and algorithms that use distance measures, like k-nearest neighbors.

Standardizing is a popular scaling technique that subtracts the mean from values and divides by the standard deviation, transforming the probability distribution for an input variable to a standard Gaussian (zero mean and unit variance). Standardization can become skewed or biased if the input variable contains outlier values.

To overcome this, the median and interquartile range can be used when standardizing numerical input variables, generally referred to as robust scaling.

In this tutorial, you will discover how to use robust scaler transforms to standardize numerical input variables for classification and regression.

After completing this tutorial, you will know:

- Many machine learning algorithms prefer or perform better when numerical input variables are scaled.
- Robust scaling techniques that use percentiles can be used to scale numerical input variables that contain outliers.
- How to use the RobustScaler to scale numerical input variables using the median and interquartile range.

Let’s get started.

This tutorial is divided into five parts; they are:

- Scaling Data
- Robust Scaler Transforms
- Sonar Dataset
- IQR Robust Scaler Transform
- Explore Robust Scaler Range

It is common to scale data prior to fitting a machine learning model.

This is because data often consists of many different input variables or features (columns) and each may have a different range of values or units of measure, such as feet, miles, kilograms, dollars, etc.

If there are input variables that have very large values relative to the other input variables, these large values can dominate or skew some machine learning algorithms. The result is that the algorithms pay most of their attention to the large values and ignore the variables with smaller values.

This includes algorithms that use a weighted sum of inputs like linear regression, logistic regression, and artificial neural networks, as well as algorithms that use distance measures between examples, such as k-nearest neighbors and support vector machines.

As such, it is normal to scale input variables to a common range as a data preparation technique prior to fitting a model.

One approach to data scaling involves calculating the mean and standard deviation of each variable and using these values to scale the values to have a mean of zero and a standard deviation of one, a so-called “*standard normal*” probability distribution. This process is called standardization and is most useful when input variables have a Gaussian probability distribution.

Standardization is calculated by subtracting the mean value and dividing by the standard deviation.

- value = (value – mean) / stdev

Sometimes an input variable may have outlier values. These are values on the edge of the distribution that may have a low probability of occurrence, yet are overrepresented for some reason. Outliers can skew a probability distribution and make data scaling using standardization difficult as the calculated mean and standard deviation will be skewed by the presence of the outliers.

One approach to standardizing input variables in the presence of outliers is to ignore the outliers from the calculation of the mean and standard deviation, then use the calculated values to scale the variable.

This is called robust standardization or **robust data scaling**.

This can be achieved by calculating the median (50th percentile) and the 25th and 75th percentiles. The values of each variable then have their median subtracted and are divided by the interquartile range (IQR) which is the difference between the 75th and 25th percentiles.

- value = (value – median) / (p75 – p25)

The resulting variable has a zero mean and median and a standard deviation of 1, although not skewed by outliers and the outliers are still present with the same relative relationships to other values.

The robust scaler transform is available in the scikit-learn Python machine learning library via the RobustScaler class.

The “*with_centering*” argument controls whether the value is centered to zero (median is subtracted) and defaults to *True*.

The “*with_scaling*” argument controls whether the value is scaled to the IQR (standard deviation set to one) or not and defaults to *True*.

Interestingly, the definition of the scaling range can be specified via the “*quantile_range*” argument. It takes a tuple of two integers between 0 and 100 and defaults to the percentile values of the IQR, specifically (25, 75). Changing this will change the definition of outliers and the scope of the scaling.

We will take a closer look at how to use the robust scaler transforms on a real dataset.

First, let’s introduce a real dataset.

The sonar dataset is a standard machine learning dataset for binary classification.

It involves 60 real-valued inputs and a two-class target variable. There are 208 examples in the dataset and the classes are reasonably balanced.

A baseline classification algorithm can achieve a classification accuracy of about 53.4 percent using repeated stratified 10-fold cross-validation. Top performance on this dataset is about 88 percent using repeated stratified 10-fold cross-validation.

The dataset describes radar returns of rocks or simulated mines.

You can learn more about the dataset from here:

No need to download the dataset; we will download it automatically from our worked examples.

First, let’s load and summarize the dataset. The complete example is listed below.

# load and summarize the sonar dataset from pandas import read_csv from pandas.plotting import scatter_matrix from matplotlib import pyplot # Load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv" dataset = read_csv(url, header=None) # summarize the shape of the dataset print(dataset.shape) # summarize each variable print(dataset.describe()) # histograms of the variables dataset.hist() pyplot.show()

Running the example first summarizes the shape of the loaded dataset.

This confirms the 60 input variables, one output variable, and 208 rows of data.

A statistical summary of the input variables is provided showing that values are numeric and range approximately from 0 to 1.

(208, 61) 0 1 2 ... 57 58 59 count 208.000000 208.000000 208.000000 ... 208.000000 208.000000 208.000000 mean 0.029164 0.038437 0.043832 ... 0.007949 0.007941 0.006507 std 0.022991 0.032960 0.038428 ... 0.006470 0.006181 0.005031 min 0.001500 0.000600 0.001500 ... 0.000300 0.000100 0.000600 25% 0.013350 0.016450 0.018950 ... 0.003600 0.003675 0.003100 50% 0.022800 0.030800 0.034300 ... 0.005800 0.006400 0.005300 75% 0.035550 0.047950 0.057950 ... 0.010350 0.010325 0.008525 max 0.137100 0.233900 0.305900 ... 0.044000 0.036400 0.043900 [8 rows x 60 columns]

Finally, a histogram is created for each input variable.

If we ignore the clutter of the plots and focus on the histograms themselves, we can see that many variables have a skewed distribution.

The dataset provides a good candidate for using a robust scaler transform to standardize the data in the presence of skewed distributions and outliers.

Next, let’s fit and evaluate a machine learning model on the raw dataset.

We will use a k-nearest neighbor algorithm with default hyperparameters and evaluate it using repeated stratified k-fold cross-validation. The complete example is listed below.

# evaluate knn on the raw sonar dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import LabelEncoder from matplotlib import pyplot # load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv" dataset = read_csv(url, header=None) data = dataset.values # separate into input and output columns X, y = data[:, :-1], data[:, -1] # ensure inputs are floats and output is an integer label X = X.astype('float32') y = LabelEncoder().fit_transform(y.astype('str')) # define and configure the model model = KNeighborsClassifier() # evaluate the model cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') # report model performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example evaluates a KNN model on the raw sonar dataset.

We can see that the model achieved a mean classification accuracy of about 79.7 percent, showing that it has skill (better than 53.4 percent) and is in the ball-park of good performance (88 percent).

Accuracy: 0.797 (0.073)

Next, let’s explore a robust scaling transform of the dataset.

We can apply the robust scaler to the Sonar dataset directly.

We will use the default configuration and scale values to the IQR. First, a RobustScaler instance is defined with default hyperparameters. Once defined, we can call the *fit_transform()* function and pass it to our dataset to create a quantile transformed version of our dataset.

... # perform a robust scaler transform of the dataset trans = RobustScaler() data = trans.fit_transform(data)

Let’s try it on our sonar dataset.

The complete example of creating a robust scaler transform of the sonar dataset and plotting histograms of the result is listed below.

# visualize a robust scaler transform of the sonar dataset from pandas import read_csv from pandas import DataFrame from pandas.plotting import scatter_matrix from sklearn.preprocessing import RobustScaler from matplotlib import pyplot # load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv" dataset = read_csv(url, header=None) # retrieve just the numeric input values data = dataset.values[:, :-1] # perform a robust scaler transform of the dataset trans = RobustScaler() data = trans.fit_transform(data) # convert the array back to a dataframe dataset = DataFrame(data) # summarize print(dataset.describe()) # histograms of the variables dataset.hist() pyplot.show()

Running the example first reports a summary of each input variable.

We can see that the distributions have been adjusted. The median values are now zero and the standard deviation values are now close to 1.0.

0 1 ... 58 59 count 208.000000 208.000000 ... 2.080000e+02 208.000000 mean 0.286664 0.242430 ... 2.317814e-01 0.222527 std 1.035627 1.046347 ... 9.295312e-01 0.927381 min -0.959459 -0.958730 ... -9.473684e-01 -0.866359 25% -0.425676 -0.455556 ... -4.097744e-01 -0.405530 50% 0.000000 0.000000 ... 6.591949e-17 0.000000 75% 0.574324 0.544444 ... 5.902256e-01 0.594470 max 5.148649 6.447619 ... 4.511278e+00 7.115207 [8 rows x 60 columns]

Histogram plots of the variables are created, although the distributions don’t look much different from their original distributions seen in the previous section.

Next, let’s evaluate the same KNN model as the previous section, but in this case on a robust scaler transform of the dataset.

The complete example is listed below.

# evaluate knn on the sonar dataset with robust scaler transform from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import RobustScaler from sklearn.pipeline import Pipeline from matplotlib import pyplot # load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv" dataset = read_csv(url, header=None) data = dataset.values # separate into input and output columns X, y = data[:, :-1], data[:, -1] # ensure inputs are floats and output is an integer label X = X.astype('float32') y = LabelEncoder().fit_transform(y.astype('str')) # define the pipeline trans = RobustScaler(with_centering=False, with_scaling=True) model = KNeighborsClassifier() pipeline = Pipeline(steps=[('t', trans), ('m', model)]) # evaluate the pipeline cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') # report pipeline performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example, we can see that the robust scaler transform results in a lift in performance from 79.7 percent accuracy without the transform to about 81.9 percent with the transform.

Accuracy: 0.819 (0.076)

Next, let’s explore the effect of different scaling ranges.

The range used to scale each variable is chosen by default as the IQR is bounded by the 25th and 75th percentiles.

This is specified by the “*quantile_range*” argument as a tuple.

Other values can be specified and might improve the performance of the model, such as a wider range, allowing fewer values to be considered outliers, or a more narrow range, allowing more values to be considered outliers.

The example below explores the effect of different definitions of the range from 1st to the 99th percentiles to 30th to 70th percentiles.

The complete example is listed below.

# explore the scaling range of the robust scaler transform from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import RobustScaler from sklearn.preprocessing import LabelEncoder from sklearn.pipeline import Pipeline from matplotlib import pyplot # get the dataset def get_dataset(): # load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv" dataset = read_csv(url, header=None) data = dataset.values # separate into input and output columns X, y = data[:, :-1], data[:, -1] # ensure inputs are floats and output is an integer label X = X.astype('float32') y = LabelEncoder().fit_transform(y.astype('str')) return X, y # get a list of models to evaluate def get_models(): models = dict() for value in [1, 5, 10, 15, 20, 25, 30]: # define the pipeline trans = RobustScaler(quantile_range=(value, 100-value)) model = KNeighborsClassifier() models[str(value)] = Pipeline(steps=[('t', trans), ('m', model)]) return models # evaluate a give model using cross-validation def evaluate_model(model): cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): scores = evaluate_model(model) results.append(scores) names.append(name) print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example reports the mean classification accuracy for each value-defined IQR range.

We can see that the default of 25th to 75th percentile achieves the best results, although the values of 20-80 and 30-70 achieve results that are very similar.

>1 0.818 (0.069) >5 0.813 (0.085) >10 0.812 (0.076) >15 0.811 (0.081) >20 0.811 (0.080) >25 0.819 (0.076) >30 0.816 (0.072)

Box and whisker plots are created to summarize the classification accuracy scores for each IQR range.

We can see a marked difference in the distribution and mean accuracy with the larger ranges of 25-75 and 30-70 percentiles.

This section provides more resources on the topic if you are looking to go deeper.

- Standardization, or mean removal and variance scaling, scikit-learn.
- sklearn.preprocessing.RobustScaler API.

In this tutorial, you discovered how to use robust scaler transforms to standardize numerical input variables for classification and regression.

Specifically, you learned:

- Many machine learning algorithms prefer or perform better when numerical input variables are scaled.
- Robust scaling techniques that use percentiles can be used to scale numerical input variables that contain outliers.
- How to use the RobustScaler to scale numerical input variables using the median and interquartile range.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Scale Data With Outliers for Machine Learning appeared first on Machine Learning Mastery.

]]>The post Recursive Feature Elimination (RFE) for Feature Selection in Python appeared first on Machine Learning Mastery.

]]>RFE is popular because it is easy to configure and use and because it is effective at selecting those features (columns) in a training dataset that are more or most relevant in predicting the target variable.

There are two important configuration options when using RFE: the choice in the number of features to select and the choice of the algorithm used to help choose features. Both of these hyperparameters can be explored, although the performance of the method is not strongly dependent on these hyperparameters being configured well.

In this tutorial, you will discover how to use Recursive Feature Elimination (RFE) for feature selection in Python.

After completing this tutorial, you will know:

- RFE is an efficient approach for eliminating features from a training dataset for feature selection.
- How to use RFE for feature selection for classification and regression predictive modeling problems.
- How to explore the number of selected features and wrapped algorithm used by the RFE procedure.

Let’s get started.

This tutorial is divided into three parts; they are:

- Recursive Feature Elimination
- RFE With scikit-learn
- RFE for Classification
- RFE for Regression

- RFE Hyperparameters
- Explore Number of Features
- Automatically Select the Number of Features
- Which Features Were Selected
- Explore Base Algorithm

Recursive Feature Elimination, or RFE for short, is a feature selection algorithm.

A machine learning dataset for classification or regression is comprised of rows and columns, like an excel spreadsheet. Rows are often referred to as samples and columns are referred to as features, e.g. features of an observation in a problem domain.

Feature selection refers to techniques that select a subset of the most relevant features (columns) for a dataset. Fewer features can allow machine learning algorithms to run more efficiently (less space or time complexity) and be more effective. Some machine learning algorithms can be misled by irrelevant input features, resulting in worse predictive performance.

For more on feature selection generally, see the tutorial:

RFE is a wrapper-type feature selection algorithm. This means that a different machine learning algorithm is given and used in the core of the method, is wrapped by RFE, and used to help select features. This is in contrast to filter-based feature selections that score each feature and select those features with the largest (or smallest) score.

Technically, RFE is a wrapper-style feature selection algorithm that also uses filter-based feature selection internally.

RFE works by searching for a subset of features by starting with all features in the training dataset and successfully removing features until the desired number remains.

This is achieved by fitting the given machine learning algorithm used in the core of the model, ranking features by importance, discarding the least important features, and re-fitting the model. This process is repeated until a specified number of features remains.

When the full model is created, a measure of variable importance is computed that ranks the predictors from most important to least. […] At each stage of the search, the least important predictors are iteratively eliminated prior to rebuilding the model.

— Pages 494-495, Applied Predictive Modeling, 2013.

Features are scored either using the provided machine learning model (e.g. some algorithms like decision trees offer importance scores) or by using a statistical method.

The importance calculations can be model based (e.g., the random forest importance criterion) or using a more general approach that is independent of the full model.

— Page 494, Applied Predictive Modeling, 2013.

Now that we are familiar with the RFE procedure, let’s review how we can use it in our projects.

RFE can be implemented from scratch, although it can be challenging for beginners.

The scikit-learn Python machine learning library provides an implementation of RFE for machine learning.

It is available in modern versions of the library.

First, confirm that you are using a modern version of the library by running the following script:

# check scikit-learn version import sklearn print(sklearn.__version__)

Running the script will print your version of scikit-learn.

Your version should be the same or higher. If not, you must upgrade your version of the scikit-learn library.

0.22.1

The RFE method is available via the RFE class in scikit-learn.

RFE is a transform. To use it, first the class is configured with the chosen algorithm specified via the “*estimator*” argument and the number of features to select via the “*n_features_to_select*” argument.

The algorithm must provide a way to calculate important scores, such as a decision tree. The algorithm used in RFE does not have to be the algorithm that is fit on the selected features; different algorithms can be used.

Once configured, the class must be fit on a training dataset to select the features by calling the *fit()* function. After the class is fit, the choice of input variables can be seen via the “*support_*” attribute that provides a *True* or *False* for each input variable.

It can then be applied to the training and test datasets by calling the *transform()* function.

... # define the method rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=3) # fit the model rfe.fit(X, y) # transform the data X, y = rfe.transform(X, y)

It is common to use k-fold cross-validation to evaluate a machine learning algorithm on a dataset. When using cross-validation, it is good practice to perform data transforms like RFE as part of a Pipeline to avoid data leakage.

Now that we are familiar with the RFE API, let’s take a look at how to develop a RFE for both classification and regression.

In this section, we will look at using RFE for a classification problem.

First, we can use the make_classification() function to create a synthetic binary classification problem with 1,000 examples and 10 input features, five of which are important and five of which are redundant.

The complete example is listed below.

# test classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and summarizes the shape of the input and output components.

(1000, 10) (1000,)

Next, we can evaluate an RFE feature selection algorithm on this dataset. We will use a DecisionTreeClassifier to choose features and set the number of features to five. We will then fit a new *DecisionTreeClassifier* model on the selected features.

We will evaluate the model using repeated stratified k-fold cross-validation, with three repeats and 10 folds. We will report the mean and standard deviation of the accuracy of the model across all repeats and folds.

The complete example is listed below.

# evaluate RFE for classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.feature_selection import RFE from sklearn.tree import DecisionTreeClassifier from sklearn.pipeline import Pipeline # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # create pipeline rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5) model = DecisionTreeClassifier() pipeline = Pipeline(steps=[('s',rfe),('m',model)]) # evaluate model cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') # report performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example reports the mean and standard deviation accuracy of the model.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see the RFE that uses a decision tree and selects five features and then fits a decision tree on the selected features achieves a classification accuracy of about 88.6 percent.

Accuracy: 0.886 (0.030)

We can also use the RFE model pipeline as a final model and make predictions for classification.

First, the RFE and model are fit on all available data, then the *predict()* function can be called to make predictions on new data.

The example below demonstrates this on our binary classification dataset.

# make a prediction with an RFE pipeline from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.feature_selection import RFE from sklearn.tree import DecisionTreeClassifier from sklearn.pipeline import Pipeline # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # create pipeline rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5) model = DecisionTreeClassifier() pipeline = Pipeline(steps=[('s',rfe),('m',model)]) # fit the model on all available data pipeline.fit(X, y) # make a prediction for one example data = [[2.56999479,-0.13019997,3.16075093,-4.35936352,-1.61271951,-1.39352057,-2.48924933,-1.93094078,3.26130366,2.05692145]] yhat = pipeline.predict(data) print('Predicted Class: %d' % (yhat))

Running the example fits the RFE pipeline on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.

Predicted Class: 1

Now that we are familiar with using RFE for classification, let’s look at the API for regression.

In this section, we will look at using RFE for a regression problem.

First, we can use the make_regression() function to create a synthetic regression problem with 1,000 examples and 10 input features, five of which are important and five of which are redundant.

The complete example is listed below.

# test regression dataset from sklearn.datasets import make_regression # define dataset X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and summarizes the shape of the input and output components.

(1000, 10) (1000,)

Next, we can evaluate an REFE algorithm on this dataset.

As we did with the last section, we will evaluate the pipeline with a decision tree using repeated k-fold cross-validation, with three repeats and 10 folds.

We will report the mean absolute error (MAE) of the model across all repeats and folds. The scikit-learn library makes the MAE negative so that it is maximized instead of minimized. This means that larger negative MAE are better and a perfect model has a MAE of 0.

The complete example is listed below.

# evaluate RFE for regression from numpy import mean from numpy import std from sklearn.datasets import make_regression from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from sklearn.feature_selection import RFE from sklearn.tree import DecisionTreeRegressor from sklearn.pipeline import Pipeline # define dataset X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) # create pipeline rfe = RFE(estimator=DecisionTreeRegressor(), n_features_to_select=5) model = DecisionTreeRegressor() pipeline = Pipeline(steps=[('s',rfe),('m',model)]) # evaluate model cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(pipeline, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise') # report performance print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example reports the mean and standard deviation accuracy of the model.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see the RFE pipeline with a decision tree model achieves a MAE of about 26.

MAE: -26.853 (2.696)

We can also use the c as a final model and make predictions for regression.

First, the Pipeline is fit on all available data, then the *predict()* function can be called to make predictions on new data.

The example below demonstrates this on our regression dataset.

# make a regression prediction with an RFE pipeline from numpy import mean from numpy import std from sklearn.datasets import make_regression from sklearn.feature_selection import RFE from sklearn.tree import DecisionTreeRegressor from sklearn.pipeline import Pipeline # define dataset X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) # create pipeline rfe = RFE(estimator=DecisionTreeRegressor(), n_features_to_select=5) model = DecisionTreeRegressor() pipeline = Pipeline(steps=[('s',rfe),('m',model)]) # fit the model on all available data pipeline.fit(X, y) # make a prediction for one example data = [[-2.02220122,0.31563495,0.82797464,-0.30620401,0.16003707,-1.44411381,0.87616892,-0.50446586,0.23009474,0.76201118]] yhat = pipeline.predict(data) print('Predicted: %.3f' % (yhat))

Running the example fits the RFE pipeline on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.

Predicted: -84.288

Now that we are familiar with using the scikit-learn API to evaluate and use RFE for feature selection, let’s look at configuring the model.

In this section, we will take a closer look at some of the hyperparameters you should consider tuning for the RFE method for feature selection and their effect on model performance.

An important hyperparameter for the RFE algorithm is the number of features to select.

In the previous section, we used an arbitrary number of selected features, five, which matches the number of informative features in the synthetic dataset. In practice, we cannot know the best number of features to select with RFE; instead, it is good practice to test different values.

The example below demonstrates selecting different numbers of features from 2 to 10 on the synthetic binary classification dataset.

# explore the number of selected features for RFE from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.feature_selection import RFE from sklearn.tree import DecisionTreeClassifier from sklearn.pipeline import Pipeline from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) return X, y # get a list of models to evaluate def get_models(): models = dict() for i in range(2, 10): rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=i) model = DecisionTreeClassifier() models[str(i)] = Pipeline(steps=[('s',rfe),('m',model)]) return models # evaluate a give model using cross-validation def evaluate_model(model): cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): scores = evaluate_model(model) results.append(scores) names.append(name) print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example first reports the mean accuracy for each configured number of input features.

In this case, we can see that performance improves as the number of features increase and perhaps peaks around 4-to-7 as we might expect, given that only five features are relevant to the target variable.

>2 0.715 (0.044) >3 0.825 (0.031) >4 0.876 (0.033) >5 0.887 (0.030) >6 0.890 (0.031) >7 0.888 (0.025) >8 0.885 (0.028) >9 0.884 (0.025)

A box and whisker plot is created for the distribution of accuracy scores for each configured number of features.

It is also possible to automatically select the number of features chosen by RFE.

This can be achieved by performing cross-validation evaluation of different numbers of features as we did in the previous section and automatically selecting the number of features that resulted in the best mean score.

The RFECV class implements this for us.

The *RFECV* is configured just like the RFE class regarding the choice of the algorithm that is wrapped. Additionally, the minimum number of features to be considered can be specified via the “*min_features_to_select*” argument (defaults to 1) and we can also specify the type of cross-validation and scoring to use via the “*cv*” (defaults to 5) and “*scoring*” arguments (uses accuracy for classification).

... # automatically choose the number of features rfe = RFECV(estimator=DecisionTreeClassifier())

We can demonstrate this on our synthetic binary classification problem and use RFECV in our pipeline instead of RFE to automatically choose the number of selected features.

The complete example is listed below.

# automatically select the number of features for RFE from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.feature_selection import RFECV from sklearn.tree import DecisionTreeClassifier from sklearn.pipeline import Pipeline # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # create pipeline rfe = RFECV(estimator=DecisionTreeClassifier()) model = DecisionTreeClassifier() pipeline = Pipeline(steps=[('s',rfe),('m',model)]) # evaluate model cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') # report performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example reports the mean and standard deviation accuracy of the model.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see the RFE that uses a decision tree and automatically selects a number of features and then fits a decision tree on the selected features achieves a classification accuracy of about 88.6 percent.

Accuracy: 0.886 (0.026)

When using RFE, we may be interested to know which features were selected and which were removed.

This can be achieved by reviewing the attributes of the fit RFE object (or fit RFECV object). The “*support_*” attribute reports true or false as to which features in order of column index were included and the “*ranking_*” attribute reports the relative ranking of features in the same order.

The example below fits an RFE model on the whole dataset and selects five features, then reports each feature column index (0 to 9), whether it was selected or not (*True* or *False*), and the relative feature ranking.

# report which features were selected by RFE from sklearn.datasets import make_classification from sklearn.feature_selection import RFE from sklearn.tree import DecisionTreeClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # define RFE rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5) # fit RFE rfe.fit(X, y) # summarize all features for i in range(X.shape[1]): print('Column: %d, Selected %s, Rank: %.3f' % (i, rfe.support_[i], rfe.ranking_[i]))

Running the example lists of the 10 input features and whether or not they were selected as well as their relative ranking of importance.

Column: 0, Selected False, Rank: 5.000 Column: 1, Selected False, Rank: 4.000 Column: 2, Selected True, Rank: 1.000 Column: 3, Selected True, Rank: 1.000 Column: 4, Selected True, Rank: 1.000 Column: 5, Selected False, Rank: 6.000 Column: 6, Selected True, Rank: 1.000 Column: 7, Selected False, Rank: 3.000 Column: 8, Selected True, Rank: 1.000 Column: 9, Selected False, Rank: 2.000

There are many algorithms that can be used in the core RFE, as long as they provide some indication of variable importance.

Most decision tree algorithms are likely to report the same general trends in feature importance, but this is not guaranteed. It might be helpful to explore the use of different algorithms wrapped by RFE.

The example below demonstrates how you might explore this configuration option.

# explore the algorithm wrapped by RFE from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.feature_selection import RFE from sklearn.linear_model import LogisticRegression from sklearn.linear_model import Perceptron from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import GradientBoostingClassifier from sklearn.pipeline import Pipeline from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) return X, y # get a list of models to evaluate def get_models(): models = dict() # lr rfe = RFE(estimator=LogisticRegression(), n_features_to_select=5) model = DecisionTreeClassifier() models['lr'] = Pipeline(steps=[('s',rfe),('m',model)]) # perceptron rfe = RFE(estimator=Perceptron(), n_features_to_select=5) model = DecisionTreeClassifier() models['per'] = Pipeline(steps=[('s',rfe),('m',model)]) # cart rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5) model = DecisionTreeClassifier() models['cart'] = Pipeline(steps=[('s',rfe),('m',model)]) # rf rfe = RFE(estimator=RandomForestClassifier(), n_features_to_select=5) model = DecisionTreeClassifier() models['rf'] = Pipeline(steps=[('s',rfe),('m',model)]) # gbm rfe = RFE(estimator=GradientBoostingClassifier(), n_features_to_select=5) model = DecisionTreeClassifier() models['gbm'] = Pipeline(steps=[('s',rfe),('m',model)]) return models # evaluate a give model using cross-validation def evaluate_model(model): cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): scores = evaluate_model(model) results.append(scores) names.append(name) print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example first reports the mean accuracy for each wrapped algorithm.

In this case, the results suggest that linear algorithms like logistic regression might select better features more reliably than the chosen decision tree and ensemble of decision tree algorithms.

>lr 0.893 (0.030) >per 0.843 (0.040) >cart 0.887 (0.033) >rf 0.858 (0.038) >gbm 0.891 (0.030)

A box and whisker plot is created for the distribution of accuracy scores for each configured wrapped algorithm.

We can see the general trend of good performance with logistic regression, CART and perhaps GBM. This highlights that even thought the actual model used to fit the chosen features is the same in each case, the model used within RFE can make an important difference to which features are selected and in turn the performance on the prediction problem.

This section provides more resources on the topic if you are looking to go deeper.

- Applied Predictive Modeling, 2013.

- Recursive feature elimination, scikit-learn Documentation.
- sklearn.feature_selection.RFE API.
- sklearn.feature_selection.RFECV API.

In this tutorial, you discovered how to use Recursive Feature Elimination (RFE) for feature selection in Python.

Specifically, you learned:

- RFE is an efficient approach for eliminating features from a training dataset for feature selection.
- How to use RFE for feature selection for classification and regression predictive modeling problems.
- How to explore the number of selected features and wrapped algorithm used by the RFE procedure.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Recursive Feature Elimination (RFE) for Feature Selection in Python appeared first on Machine Learning Mastery.

]]>The post How to Use Discretization Transforms for Machine Learning appeared first on Machine Learning Mastery.

]]>This could be caused by outliers in the data, multi-modal distributions, highly exponential distributions, and more.

Many machine learning algorithms prefer or perform better when numerical input variables have a standard probability distribution.

The discretization transform provides an automatic way to change a numeric input variable to have a different data distribution, which in turn can be used as input to a predictive model.

In this tutorial, you will discover how to use discretization transforms to map numerical values to discrete categories for machine learning

After completing this tutorial, you will know:

- Many machine learning algorithms prefer or perform better when numerical with non-standard probability distributions are made discrete.
- Discretization transforms are a technique for transforming numerical input or output variables to have discrete ordinal labels.
- How to use the KBinsDiscretizer to change the structure and distribution of numeric variables to improve the performance of predictive models.

Let’s get started.

This tutorial is divided into six parts; they are:

- Change Data Distribution
- Discretization Transforms
- Sonar Dataset
- Uniform Discretization Transform
- K-means Discretization Transform
- Quantile Discretization Transform

Some machine learning algorithms may prefer or require categorical or ordinal input variables, such as some decision tree and rule-based algorithms.

Some classification and clustering algorithms deal with nominal attributes only and cannot handle ones measured on a numeric scale.

— Page 296, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.

Further, the performance of many machine learning algorithms degrades for variables that have non-standard probability distributions.

This applies both to real-valued input variables in the case of classification and regression tasks, and real-valued target variables in the case of regression tasks.

Some input variables may have a highly skewed distribution, such as an exponential distribution where the most common observations are bunched together. Some input variables may have outliers that cause the distribution to be highly spread.

These concerns and others, like non-standard distributions and multi-modal distributions, can make a dataset challenging to model with a range of machine learning models.

As such, it is often desirable to transform each input variable to have a standard probability distribution.

One approach is to use transform of the numerical variable to have a discrete probability distribution where each numerical value is assigned a label and the labels have an ordered (ordinal) relationship.

This is called a **binning** or a **discretization transform** and can improve the performance of some machine learning models for datasets by making the probability distribution of numerical input variables discrete.

A discretization transform will map numerical variables onto discrete values.

Binning, also known as categorization or discretization, is the process of translating a quantitative variable into a set of two or more qualitative buckets (i.e., categories).

— Page 129, Feature Engineering and Selection, 2019.

Values for the variable are grouped together into discrete bins and each bin is assigned a unique integer such that the ordinal relationship between the bins is preserved.

The use of bins is often referred to as binning or *k*-bins, where *k* refers to the number of groups to which a numeric variable is mapped.

The mapping provides a high-order ranking of values that can smooth out the relationships between observations. The transformation can be applied to each numeric input variable in the training dataset and then provided as input to a machine learning model to learn a predictive modeling task.

The determination of the bins must be included inside of the resampling process.

— Page 132, Feature Engineering and Selection, 2019.

Different methods for grouping the values into k discrete bins can be used; common techniques include:

**Uniform**: Each bin has the same width in the span of possible values for the variable.**Quantile**: Each bin has the same number of values, split based on percentiles.**Clustered**: Clusters are identified and examples are assigned to each group.

The discretization transform is available in the scikit-learn Python machine learning library via the KBinsDiscretizer class.

The “*strategy*” argument controls the manner in which the input variable is divided, as either “*uniform*,” “*quantile*,” or “*kmeans*.”

The “*n_bins*” argument controls the number of bins that will be created and must be set based on the choice of strategy, e.g. “*uniform*” is flexible, “*quantile*” must have a “*n_bins*” less than the number of observations or sensible percentiles, and “*kmeans*” must use a value for the number of clusters that can be reasonably found.

The “*encode*” argument controls whether the transform will map each value to an integer value by setting “*ordinal*” or a one-hot encoding “*onehot*.” An ordinal encoding is almost always preferred, although a one-hot encoding may allow a model to learn non-ordinal relationships between the groups, such as in the case of *k*-means clustering strategy.

We can demonstrate the *KBinsDiscretizer* with a small worked example. We can generate a sample of random Gaussian numbers. The KBinsDiscretizer can then be used to convert the floating values into fixed number of discrete categories with an ranked ordinal relationship.

The complete example is listed below.

# demonstration of the discretization transform from numpy.random import randn from sklearn.preprocessing import KBinsDiscretizer from matplotlib import pyplot # generate gaussian data sample data = randn(1000) # histogram of the raw data pyplot.hist(data, bins=25) pyplot.show() # reshape data to have rows and columns data = data.reshape((len(data),1)) # discretization transform the raw data kbins = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform') data_trans = kbins.fit_transform(data) # summarize first few rows print(data_trans[:10, :]) # histogram of the transformed data pyplot.hist(data_trans, bins=10) pyplot.show()

Running the example first creates a sample of 1,000 random Gaussian floating-point values and plots the data as a histogram.

Next the KBinsDiscretizer is used to map the numerical values to categorical values. We configure the transform to create 10 categories (0 to 9), to output the result in ordinal format (integers) and to divide the range of the input data uniformly.

A sample of the transformed data is printed, clearly showing the integer format of the data as expected.

[[5.] [3.] [2.] [6.] [7.] [5.] [3.] [4.] [4.] [2.]]

Finally, a histogram is created showing the 10 discrete categories and how the observations are distributed across these groups, following the same pattern as the original data with a Gaussian shape.

In the following sections will take a closer look at how to use the discretization transform on a real dataset.

Next, let’s introduce the dataset.

The sonar dataset is a standard machine learning dataset for binary classification.

It involves 60 real-valued inputs and a two-class target variable. There are 208 examples in the dataset and the classes are reasonably balanced.

A baseline classification algorithm can achieve a classification accuracy of about 53.4 percent using repeated stratified 10-fold cross-validation. Top performance on this dataset is about 88 percent using repeated stratified 10-fold cross-validation.

The dataset describes radar returns of rocks or simulated mines.

You can learn more about the dataset from here:

No need to download the dataset; we will download it automatically from our worked examples.

First, let’s load and summarize the dataset. The complete example is listed below.

# load and summarize the sonar dataset from pandas import read_csv from pandas.plotting import scatter_matrix from matplotlib import pyplot # Load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv" dataset = read_csv(url, header=None) # summarize the shape of the dataset print(dataset.shape) # summarize each variable print(dataset.describe()) # histograms of the variables dataset.hist() pyplot.show()

Running the example first summarizes the shape of the loaded dataset.

This confirms the 60 input variables, one output variable, and 208 rows of data.

A statistical summary of the input variables is provided showing that values are numeric and range approximately from 0 to 1.

(208, 61) 0 1 2 ... 57 58 59 count 208.000000 208.000000 208.000000 ... 208.000000 208.000000 208.000000 mean 0.029164 0.038437 0.043832 ... 0.007949 0.007941 0.006507 std 0.022991 0.032960 0.038428 ... 0.006470 0.006181 0.005031 min 0.001500 0.000600 0.001500 ... 0.000300 0.000100 0.000600 25% 0.013350 0.016450 0.018950 ... 0.003600 0.003675 0.003100 50% 0.022800 0.030800 0.034300 ... 0.005800 0.006400 0.005300 75% 0.035550 0.047950 0.057950 ... 0.010350 0.010325 0.008525 max 0.137100 0.233900 0.305900 ... 0.044000 0.036400 0.043900 [8 rows x 60 columns]

Finally, a histogram is created for each input variable.

If we ignore the clutter of the plots and focus on the histograms themselves, we can see that many variables have a skewed distribution.

Next, let’s fit and evaluate a machine learning model on the raw dataset.

We will use a k-nearest neighbor algorithm with default hyperparameters and evaluate it using repeated stratified k-fold cross-validation.

The complete example is listed below.

# evaluate knn on the raw sonar dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import LabelEncoder from matplotlib import pyplot # load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv" dataset = read_csv(url, header=None) data = dataset.values # separate into input and output columns X, y = data[:, :-1], data[:, -1] # ensure inputs are floats and output is an integer label X = X.astype('float32') y = LabelEncoder().fit_transform(y.astype('str')) # define and configure the model model = KNeighborsClassifier() # evaluate the model cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') # report model performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example evaluates a KNN model on the raw sonar dataset.

We can see that the model achieved a mean classification accuracy of about 79.7 percent, showing that it has skill (better than 53.4 percent) and is in the ball-park of good performance (88 percent).

Accuracy: 0.797 (0.073)

Next, let’s explore a uniform discretization transform of the dataset.

A uniform discretization transform will preserve the probability distribution of each input variable but will make it discrete with the specified number of ordinal groups or labels.

We can apply the uniform discretization transform using the KBinsDiscretizer class and setting the “*strategy*” argument to “*uniform*.” We must also set the desired number of bins set via the “*n_bins*” argument; in this case, we will use 10.

Once defined, we can call the *fit_transform()* function and pass it our dataset to create a quantile transformed version of our dataset.

... # perform a uniform discretization transform of the dataset trans = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform') data = trans.fit_transform(data)

Let’s try it on our sonar dataset.

The complete example of creating a uniform discretization transform of the sonar dataset and plotting histograms of the result is listed below.

# visualize a uniform ordinal discretization transform of the sonar dataset from pandas import read_csv from pandas import DataFrame from pandas.plotting import scatter_matrix from sklearn.preprocessing import KBinsDiscretizer from matplotlib import pyplot # load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv" dataset = read_csv(url, header=None) # retrieve just the numeric input values data = dataset.values[:, :-1] # perform a uniform discretization transform of the dataset trans = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform') data = trans.fit_transform(data) # convert the array back to a dataframe dataset = DataFrame(data) # histograms of the variables dataset.hist() pyplot.show()

Running the example transforms the dataset and plots histograms of each input variable.

We can see that the shape of the histograms generally matches the shape of the raw dataset, although in this case, each variable has a fixed number of 10 values or ordinal groups.

Next, let’s evaluate the same KNN model as the previous section, but in this case on a uniform discretization transform of the dataset.

The complete example is listed below.

# evaluate knn on the sonar dataset with uniform ordinal discretization transform from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import KBinsDiscretizer from sklearn.pipeline import Pipeline from matplotlib import pyplot # load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv" dataset = read_csv(url, header=None) data = dataset.values # separate into input and output columns X, y = data[:, :-1], data[:, -1] # ensure inputs are floats and output is an integer label X = X.astype('float32') y = LabelEncoder().fit_transform(y.astype('str')) # define the pipeline trans = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform') model = KNeighborsClassifier() pipeline = Pipeline(steps=[('t', trans), ('m', model)]) # evaluate the pipeline cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') # report pipeline performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example, we can see that the uniform discretization transform results in a lift in performance from 79.7 percent accuracy without the transform to about 82.7 percent with the transform.

Accuracy: 0.827 (0.082)

Next, let’s take a closer look at the k-means discretization transform.

A K-means discretization transform will attempt to fit k clusters for each input variable and then assign each observation to a cluster.

Unless the empirical distribution of the variable is complex, the number of clusters is likely to be small, such as 3-to-5.

We can apply the K-means discretization transform using the *KBinsDiscretizer* class and setting the “*strategy*” argument to “*kmeans*.” We must also set the desired number of bins set via the “*n_bins*” argument; in this case, we will use three.

Once defined, we can call the *fit_transform()* function and pass it to our dataset to create a quantile transformed version of our dataset.

... # perform a k-means discretization transform of the dataset trans = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='kmeans') data = trans.fit_transform(data)

Let’s try it on our sonar dataset.

The complete example of creating a K-means discretization transform of the sonar dataset and plotting histograms of the result is listed below.

# visualize a k-means ordinal discretization transform of the sonar dataset from pandas import read_csv from pandas import DataFrame from pandas.plotting import scatter_matrix from sklearn.preprocessing import KBinsDiscretizer from matplotlib import pyplot # load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv" dataset = read_csv(url, header=None) # retrieve just the numeric input values data = dataset.values[:, :-1] # perform a k-means discretization transform of the dataset trans = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='kmeans') data = trans.fit_transform(data) # convert the array back to a dataframe dataset = DataFrame(data) # histograms of the variables dataset.hist() pyplot.show()

Running the example transforms the dataset and plots histograms of each input variable.

We can see that the observations for each input variable are organized into one of three groups, some of which appear to be quite even in terms of observations, and others much less so.

Next, let’s evaluate the same KNN model as the previous section, but in this case on a K-means discretization transform of the dataset.

The complete example is listed below.

# evaluate knn on the sonar dataset with k-means ordinal discretization transform from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import KBinsDiscretizer from sklearn.pipeline import Pipeline from matplotlib import pyplot # load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv" dataset = read_csv(url, header=None) data = dataset.values # separate into input and output columns X, y = data[:, :-1], data[:, -1] # ensure inputs are floats and output is an integer label X = X.astype('float32') y = LabelEncoder().fit_transform(y.astype('str')) # define the pipeline trans = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='kmeans') model = KNeighborsClassifier() pipeline = Pipeline(steps=[('t', trans), ('m', model)]) # evaluate the pipeline cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') # report pipeline performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example, we can see that the K-means discretization transform results in a lift in performance from 79.7 percent accuracy without the transform to about 81.4 percent with the transform, although slightly less than the uniform distribution in the previous section.

Accuracy: 0.814 (0.088)

Next, let’s take a closer look at the quantile discretization transform.

A quantile discretization transform will attempt to split the observations for each input variable into k groups, where the number of observations assigned to each group is approximately equal.

Unless there are a large number of observations or a complex empirical distribution, the number of bins must be kept small, such as 5-10.

We can apply the quantile discretization transform using the *KBinsDiscretizer* class and setting the “*strategy*” argument to “*quantile*.” We must also set the desired number of bins set via the “*n_bins*” argument; in this case, we will use 10.

... # perform a quantile discretization transform of the dataset trans = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='quantile') data = trans.fit_transform(data)

The example below applies the quantile discretization transform and creates histogram plots of each of the transformed variables.

# visualize a quantile ordinal discretization transform of the sonar dataset from pandas import read_csv from pandas import DataFrame from pandas.plotting import scatter_matrix from sklearn.preprocessing import KBinsDiscretizer from matplotlib import pyplot # load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv" dataset = read_csv(url, header=None) # retrieve just the numeric input values data = dataset.values[:, :-1] # perform a quantile discretization transform of the dataset trans = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='quantile') data = trans.fit_transform(data) # convert the array back to a dataframe dataset = DataFrame(data) # histograms of the variables dataset.hist() pyplot.show()

Running the example transforms the dataset and plots histograms of each input variable.

We can see that the histograms all show a uniform probability distribution for each input variable, where each of the 10 groups has the same number of observations.

Next, let’s evaluate the same KNN model as the previous section, but in this case, on a quantile discretization transform of the raw dataset.

The complete example is listed below.

# evaluate knn on the sonar dataset with quantile ordinal discretization transform from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import KBinsDiscretizer from sklearn.pipeline import Pipeline from matplotlib import pyplot # load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv" dataset = read_csv(url, header=None) data = dataset.values # separate into input and output columns X, y = data[:, :-1], data[:, -1] # ensure inputs are floats and output is an integer label X = X.astype('float32') y = LabelEncoder().fit_transform(y.astype('str')) # define the pipeline trans = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='quantile') model = KNeighborsClassifier() pipeline = Pipeline(steps=[('t', trans), ('m', model)]) # evaluate the pipeline cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') # report pipeline performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example, we can see that the uniform transform results in a lift in performance from 79.7 percent accuracy without the transform to about 84.0 percent with the transform, better than the uniform and K-means methods of the previous sections.

Accuracy: 0.840 (0.072)

We chose the number of bins as an arbitrary number; in this case, 10.

This hyperparameter can be tuned to explore the effect of the resolution of the transform on the resulting skill of the model.

The example below performs this experiment and plots the mean accuracy for different “*n_bins*” values from two to 10.

# explore number of discrete bins on classification accuracy from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import KBinsDiscretizer from sklearn.preprocessing import LabelEncoder from sklearn.pipeline import Pipeline from matplotlib import pyplot # get the dataset def get_dataset(): # load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv" dataset = read_csv(url, header=None) data = dataset.values # separate into input and output columns X, y = data[:, :-1], data[:, -1] # ensure inputs are floats and output is an integer label X = X.astype('float32') y = LabelEncoder().fit_transform(y.astype('str')) return X, y # get a list of models to evaluate def get_models(): models = dict() for i in range(2,11): # define the pipeline trans = KBinsDiscretizer(n_bins=i, encode='ordinal', strategy='quantile') model = KNeighborsClassifier() models[str(i)] = Pipeline(steps=[('t', trans), ('m', model)]) return models # evaluate a give model using cross-validation def evaluate_model(model): cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') return scores # get the dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): scores = evaluate_model(model) results.append(scores) names.append(name) print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example reports the mean classification accuracy for each value of the “*n_bins*” argument.

We can see that surprisingly smaller values resulted in better accuracy, with values such as three achieving an accuracy of about 86.7 percent.

>2 0.806 (0.080) >3 0.867 (0.070) >4 0.835 (0.083) >5 0.838 (0.070) >6 0.836 (0.071) >7 0.854 (0.071) >8 0.837 (0.077) >9 0.841 (0.069) >10 0.840 (0.072)

Box and whisker plots are created to summarize the classification accuracy scores for each number of discrete bins on the dataset.

We can see a small bump in accuracy at three bins and the scores drop and remain flat for larger values.

The results highlight that there is likely some benefit in exploring different numbers of discrete bins for the chosen method to see if better performance can be achieved.

This section provides more resources on the topic if you are looking to go deeper.

- Continuous Probability Distributions for Machine Learning
- How to Transform Target Variables for Regression With Scikit-Learn

- Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.
- Feature Engineering and Selection, 2019.

In this tutorial, you discovered how to use discretization transforms to map numerical values to discrete categories for machine learning.

Specifically, you learned:

- Many machine learning algorithms prefer or perform better when numerical with non-standard probability distributions are made discrete.
- Discretization transforms are a technique for transforming numerical input or output variables to have discrete ordinal labels.
- How to use the KBinsDiscretizer to change the structure and distribution of numeric variables to improve the performance of predictive models.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Use Discretization Transforms for Machine Learning appeared first on Machine Learning Mastery.

]]>The post How to Use Quantile Transforms for Machine Learning appeared first on Machine Learning Mastery.

]]>This could be caused by outliers in the data, multi-modal distributions, highly exponential distributions, and more.

Many machine learning algorithms prefer or perform better when numerical input variables and even output variables in the case of regression have a standard probability distribution, such as a Gaussian (normal) or a uniform distribution.

The quantile transform provides an automatic way to transform a numeric input variable to have a different data distribution, which in turn, can be used as input to a predictive model.

In this tutorial, you will discover how to use quantile transforms to change the distribution of numeric variables for machine learning.

After completing this tutorial, you will know:

- Many machine learning algorithms prefer or perform better when numerical variables have a Gaussian or standard probability distribution.
- Quantile transforms are a technique for transforming numerical input or output variables to have a Gaussian or uniform probability distribution.
- How to use the QuantileTransformer to change the probability distribution of numeric variables to improve the performance of predictive models.

Let’s get started.

This tutorial is divided into five parts; they are:

- Change Data Distribution
- Quantile Transforms
- Sonar Dataset
- Normal Quantile Transform
- Uniform Quantile Transform

Many machine learning algorithms perform better when the distribution of variables is Gaussian.

Recall that the observations for each variable may be thought to be drawn from a probability distribution. The Gaussian is a common distribution with the familiar bell shape. It is so common that it is often referred to as the “*normal*” distribution.

For more on the Gaussian probability distribution, see the tutorial:

Some algorithms, like linear regression and logistic regression, explicitly assume the real-valued variables have a Gaussian distribution. Other nonlinear algorithms may not have this assumption, yet often perform better when variables have a Gaussian distribution.

This applies both to real-valued input variables in the case of classification and regression tasks, and real-valued target variables in the case of regression tasks.

Some input variables may have a highly skewed distribution, such as an exponential distribution where the most common observations are bunched together. Some input variables may have outliers that cause the distribution to be highly spread.

These concerns and others, like non-standard distributions and multi-modal distributions, can make a dataset challenging to model with a range of machine learning models.

As such, it is often desirable to transform each input variable to have a standard probability distribution, such as a Gaussian (normal) distribution or a uniform distribution.

A quantile transform will map a variable’s probability distribution to another probability distribution.

Recall that a quantile function, also called a percent-point function (PPF), is the inverse of the cumulative probability distribution (CDF). A CDF is a function that returns the probability of a value at or below a given value. The PPF is the inverse of this function and returns the value at or below a given probability.

The quantile function ranks or smooths out the relationship between observations and can be mapped onto other distributions, such as the uniform or normal distribution.

The transformation can be applied to each numeric input variable in the training dataset and then provided as input to a machine learning model to learn a predictive modeling task.

This quantile transform is available in the scikit-learn Python machine learning library via the QuantileTransformer class.

The class has an “*output_distribution*” argument that can be set to “*uniform*” or “*normal*” and defaults to “*uniform*“.

It also provides a “*n_quantiles*” that determines the resolution of the mapping or ranking of the observations in the dataset. This must be set to a value less than the number of observations in the dataset and defaults to 1,000.

We can demonstrate the *QuantileTransformer* with a small worked example. We can generate a sample of random Gaussian numbers and impose a skew on the distribution by calculating the exponent. The *QuantileTransformer* can then be used to transform the dataset to be another distribution, in this cases back to a Gaussian distribution.

The complete example is listed below.

# demonstration of the quantile transform from numpy import exp from numpy.random import randn from sklearn.preprocessing import QuantileTransformer from matplotlib import pyplot # generate gaussian data sample data = randn(1000) # add a skew to the data distribution data = exp(data) # histogram of the raw data with a skew pyplot.hist(data, bins=25) pyplot.show() # reshape data to have rows and columns data = data.reshape((len(data),1)) # quantile transform the raw data quantile = QuantileTransformer(output_distribution='normal') data_trans = quantile.fit_transform(data) # histogram of the transformed data pyplot.hist(data_trans, bins=25) pyplot.show()

Running the example first creates a sample of 1,000 random Gaussian values and adds a skew to the dataset.

A histogram is created from the skewed dataset and clearly shows the distribution pushed to the far left.

Then a *QuantileTransformer* is used to map the data distribution Gaussian and standardize the result, centering the values on the mean value of 0 and a standard deviation of 1.0.

A histogram of the transform data is created showing a Gaussian shaped data distribution.

In the following sections will take a closer look at how to use the quantile transform on a real dataset.

Next, let’s introduce the dataset.

The sonar dataset is a standard machine learning dataset for binary classification.

The dataset describes sonar returns of rocks or simulated mines.

You can learn more about the dataset from here:

No need to download the dataset; we will download it automatically from our worked examples.

First, let’s load and summarize the dataset. The complete example is listed below.

Running the example first summarizes the shape of the loaded dataset.

This confirms the 60 input variables, one output variable, and 208 rows of data.

Finally a histogram is created for each input variable.

The dataset provides a good candidate for using a quantile transform to make the variables more-Gaussian.

Next, let’s fit and evaluate a machine learning model on the raw dataset.

We will use a k-nearest neighbor algorithm with default hyperparameters and evaluate it using repeated stratified k-fold cross-validation. The complete example is listed below.

Running the example evaluates a KNN model on the raw sonar dataset.

Accuracy: 0.797 (0.073)

Next, let’s explore a normal quantile transform of the dataset.

It is often desirable to transform an input variable to have a normal probability distribution to improve the modeling performance.

We can apply the Quantile transform using the *QuantileTransformer* class and set the “*output_distribution*” argument to “*normal*“. We must also set the “*n_quantiles*” argument to a value less than the number of observations in the training dataset, in this case, 100.

Once defined, we can call the *fit_transform()* function and pass it to our dataset to create a quantile transformed version of our dataset.

... # perform a normal quantile transform of the dataset trans = QuantileTransformer(n_quantiles=100, output_distribution='normal') data = trans.fit_transform(data)

Let’s try it on our sonar dataset.

The complete example of creating a normal quantile transform of the sonar dataset and plotting histograms of the result is listed below.

# visualize a normal quantile transform of the sonar dataset from pandas import read_csv from pandas import DataFrame from pandas.plotting import scatter_matrix from sklearn.preprocessing import QuantileTransformer from matplotlib import pyplot # load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv" dataset = read_csv(url, header=None) # retrieve just the numeric input values data = dataset.values[:, :-1] # perform a normal quantile transform of the dataset trans = QuantileTransformer(n_quantiles=100, output_distribution='normal') data = trans.fit_transform(data) # convert the array back to a dataframe dataset = DataFrame(data) # histograms of the variables dataset.hist() pyplot.show()

Running the example transforms the dataset and plots histograms of each input variable.

We can see that the shape of the histograms for each variable looks very Gaussian as compared to the raw data.

Next, let’s evaluate the same KNN model as the previous section, but in this case on a normal quantile transform of the dataset.

The complete example is listed below.

# evaluate knn on the sonar dataset with normal quantile transform from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import QuantileTransformer from sklearn.pipeline import Pipeline from matplotlib import pyplot # load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv" dataset = read_csv(url, header=None) data = dataset.values # separate into input and output columns X, y = data[:, :-1], data[:, -1] # ensure inputs are floats and output is an integer label X = X.astype('float32') y = LabelEncoder().fit_transform(y.astype('str')) # define the pipeline trans = QuantileTransformer(n_quantiles=100, output_distribution='normal') model = KNeighborsClassifier() pipeline = Pipeline(steps=[('t', trans), ('m', model)]) # evaluate the pipeline cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') # report pipeline performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example, we can see that the normal quantile transform results in a lift in performance from 79.7% accuracy without the transform to about 81.7% with the transform.

Accuracy: 0.817 (0.087)

Next, let’s take a closer look at the uniform quantile transform.

Sometimes it can be beneficial to transform a highly exponential or multi-modal distribution to have a uniform distribution.

This is especially useful for data with a large and sparse range of values, e.g. outliers that are common rather than rare.

We can apply the transform by defining a *QuantileTransformer* class and setting the “*output_distribution*” argument to “*uniform*” (the default).

... # perform a uniform quantile transform of the dataset trans = QuantileTransformer(n_quantiles=100, output_distribution='uniform') data = trans.fit_transform(data)

The example below applies the uniform quantile transform and creates histogram plots of each of the transformed variables.

# visualize a uniform quantile transform of the sonar dataset from pandas import read_csv from pandas import DataFrame from pandas.plotting import scatter_matrix from sklearn.preprocessing import QuantileTransformer from matplotlib import pyplot # load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv" dataset = read_csv(url, header=None) # retrieve just the numeric input values data = dataset.values[:, :-1] # perform a uniform quantile transform of the dataset trans = QuantileTransformer(n_quantiles=100, output_distribution='uniform') data = trans.fit_transform(data) # convert the array back to a dataframe dataset = DataFrame(data) # histograms of the variables dataset.hist() pyplot.show()

Running the example transforms the dataset and plots histograms of each input variable.

We can see that the shape of the histograms for each variable looks very uniform compared to the raw data.

Next, let’s evaluate the same KNN model as the previous section, but in this case on a uniform quantile transform of the raw dataset.

The complete example is listed below.

# evaluate knn on the sonar dataset with uniform quantile transform from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import QuantileTransformer from sklearn.pipeline import Pipeline from matplotlib import pyplot # load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv" dataset = read_csv(url, header=None) data = dataset.values # separate into input and output columns X, y = data[:, :-1], data[:, -1] # ensure inputs are floats and output is an integer label X = X.astype('float32') y = LabelEncoder().fit_transform(y.astype('str')) # define the pipeline trans = QuantileTransformer(n_quantiles=100, output_distribution='uniform') model = KNeighborsClassifier() pipeline = Pipeline(steps=[('t', trans), ('m', model)]) # evaluate the pipeline cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') # report pipeline performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example, we can see that the uniform transform results in a lift in performance from 79.7 percent accuracy without the transform to about 84.5 percent with the transform, better than the normal transform that achieved a score of 81.7 percent.

Accuracy: 0.845 (0.074)

We chose the number of quantiles as an arbitrary number, in this case, 100.

This hyperparameter can be tuned to explore the effect of the resolution of the transform on the resulting skill of the model.

The example below performs this experiment and plots the mean accuracy for different “*n_quantiles*” values from 1 to 99.

# explore number of quantiles on classification accuracy from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import QuantileTransformer from sklearn.preprocessing import LabelEncoder from sklearn.pipeline import Pipeline from matplotlib import pyplot # get the dataset def get_dataset(): # load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv" dataset = read_csv(url, header=None) data = dataset.values # separate into input and output columns X, y = data[:, :-1], data[:, -1] # ensure inputs are floats and output is an integer label X = X.astype('float32') y = LabelEncoder().fit_transform(y.astype('str')) return X, y # get a list of models to evaluate def get_models(): models = dict() for i in range(1,100): # define the pipeline trans = QuantileTransformer(n_quantiles=i, output_distribution='uniform') model = KNeighborsClassifier() models[str(i)] = Pipeline(steps=[('t', trans), ('m', model)]) return models # evaluate a give model using cross-validation def evaluate_model(model): cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results = list() for name, model in models.items(): scores = evaluate_model(model) results.append(mean(scores)) print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.plot(results) pyplot.show()

Running the example reports the mean classification accuracy for each value of the “*n_quantiles*” argument.

We can see that surprisingly smaller values resulted in better accuracy, with values such as 4 achieving an accuracy of about 85.4 percent.

>1 0.466 (0.016) >2 0.813 (0.085) >3 0.840 (0.080) >4 0.854 (0.075) >5 0.848 (0.072) >6 0.851 (0.071) >7 0.845 (0.071) >8 0.848 (0.066) >9 0.848 (0.071) >10 0.843 (0.074) ...

A line plot is created showing the number of quantiles used in the transform versus the classification accuracy of the resulting model.

We can see a bump with values less than 10 and drop and flat performance after that.

The results highlight that there is likely some benefit in exploring different distributions and number of quantiles to see if better performance can be achieved.

This section provides more resources on the topic if you are looking to go deeper.

- Continuous Probability Distributions for Machine Learning
- How to Transform Target Variables for Regression With Scikit-Learn

In this tutorial, you discovered how to use quantile transforms to change the distribution of numeric variables for machine learning.

Specifically, you learned:

- Many machine learning algorithms prefer or perform better when numerical variables have a Gaussian or standard probability distribution.
- Quantile transforms are a technique for transforming numerical input or output variables to have a Gaussian or uniform probability distribution.
- How to use the QuantileTransformer to change the probability distribution of numeric variables to improve the performance of predictive models.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Use Quantile Transforms for Machine Learning appeared first on Machine Learning Mastery.

]]>The post How to Use Power Transforms for Machine Learning appeared first on Machine Learning Mastery.

]]>Your data may not have a Gaussian distribution and instead may have a Gaussian-like distribution (e.g. nearly Gaussian but with outliers or a skew) or a totally different distribution (e.g. exponential).

As such, you may be able to achieve better performance on a wide range of machine learning algorithms by transforming input and/or output variables to have a Gaussian or more-Gaussian distribution. Power transforms like the Box-Cox transform and the Yeo-Johnson transform provide an automatic way of performing these transforms on your data and are provided in the scikit-learn Python machine learning library.

In this tutorial, you will discover how to use power transforms in scikit-learn to make variables more Gaussian for modeling.

After completing this tutorial, you will know:

- Many machine learning algorithms prefer or perform better when numerical variables have a Gaussian probability distribution.
- Power transforms are a technique for transforming numerical input or output variables to have a Gaussian or more-Gaussian-like probability distribution.
- How to use the PowerTransform in scikit-learn to use the Box-Cox and Yeo-Johnson transforms when preparing data for predictive modeling.

Let’s get started.

This tutorial is divided into five parts; they are:

- Make Data More Gaussian
- Power Transforms
- Sonar Dataset
- Box-Cox Transform
- Yeo-Johnson Transform

Many machine learning algorithms perform better when the distribution of variables is Gaussian.

Recall that the observations for each variable may be thought to be drawn from a probability distribution. The Gaussian is a common distribution with the familiar bell shape. It is so common that it is often referred to as the “*normal*” distribution.

For more on the Gaussian probability distribution, see the tutorial:

Some algorithms like linear regression and logistic regression explicitly assume the real-valued variables have a Gaussian distribution. Other nonlinear algorithms may not have this assumption, yet often perform better when variables have a Gaussian distribution.

This applies both to real-valued input variables in the case of classification and regression tasks, and real-valued target variables in the case of regression tasks.

There are data preparation techniques that can be used to transform each variable to make the distribution Gaussian, or if not Gaussian, then more Gaussian like.

These transforms are most effective when the data distribution is nearly-Gaussian to begin with and is afflicted with a skew or outliers.

Another common reason for transformations is to remove distributional skewness. An un-skewed distribution is one that is roughly symmetric. This means that the probability of falling on either side of the distribution’s mean is roughly equal

— Page 31, Applied Predictive Modeling, 2013.

Power transforms refer to a class of techniques that use a power function (like a logarithm or exponent) to make the probability distribution of a variable Gaussian or more-Gaussian like.

For more on the topic of making variables Gaussian, see the tutorial:

A power transform will make the probability distribution of a variable more Gaussian.

This is often described as removing a skew in the distribution, although more generally is described as stabilizing the variance of the distribution.

The log transform is a specific example of a family of transformations known as power transforms. In statistical terms, these are variance-stabilizing transformations.

— Page 23, Feature Engineering for Machine Learning, 2018.

We can apply a power transform directly by calculating the log or square root of the variable, although this may or may not be the best power transform for a given variable.

Replacing the data with the log, square root, or inverse may help to remove the skew.

— Page 31, Applied Predictive Modeling, 2013.

Instead, we can use a generalized version of the transform that finds a parameter (*lambda*) that best transforms a variable to a Gaussian probability distribution.

There are two popular approaches for such automatic power transforms; they are:

- Box-Cox Transform
- Yeo-Johnson Transform

The transformed training dataset can then be fed to a machine learning model to learn a predictive modeling task.

A hyperparameter, often referred to as lambda is used to control the nature of the transform.

… statistical methods can be used to empirically identify an appropriate transformation. Box and Cox (1964) propose a family of transformations that are indexed by a parameter, denoted as lambda

— Page 32, Applied Predictive Modeling, 2013.

Below are some common values for lambda

*lambda*= -1. is a reciprocal transform.*lambda*= -0.5 is a reciprocal square root transform.*lambda*= 0.0 is a log transform.*lambda*= 0.5 is a square root transform.*lambda*= 1.0 is no transform.

The optimal value for this hyperparameter used in the transform for each variable can be stored and reused to transform new data in the future in an identical manner, such as a test dataset or new data in the future.

These power transforms are available in the scikit-learn Python machine learning library via the PowerTransformer class.

The class takes an argument named “*method*” that can be set to ‘*yeo-johnson*‘ or ‘*box-cox*‘ for the preferred method. It will also standardize the data automatically after the transform, meaning each variable will have a zero mean and unit variance. This can be turned off by setting the “*standardize*” argument to *False*.

We can demonstrate the *PowerTransformer* with a small worked example. We can generate a sample of random Gaussian numbers and impose a skew on the distribution by calculating the exponent. The PowerTransformer can then be used to automatically remove the skew from the data.

The complete example is listed below.

# demonstration of the power transform on data with a skew from numpy import exp from numpy.random import randn from sklearn.preprocessing import PowerTransformer from matplotlib import pyplot # generate gaussian data sample data = randn(1000) # add a skew to the data distribution data = exp(data) # histogram of the raw data with a skew pyplot.hist(data, bins=25) pyplot.show() # reshape data to have rows and columns data = data.reshape((len(data),1)) # power transform the raw data power = PowerTransformer(method='yeo-johnson', standardize=True) data_trans = power.fit_transform(data) # histogram of the transformed data pyplot.hist(data_trans, bins=25) pyplot.show()

Running the example first creates a sample of 1,000 random Gaussian values and adds a skew to the dataset.

A histogram is created from the skewed dataset and clearly shows the distribution pushed to the far left.

Then a *PowerTransformer* is used to make the data distribution more-Gaussian and standardize the result, centering the values on the mean value of 0 and a standard deviation of 1.0.

A histogram of the transform data is created showing a more-Gaussian shaped data distribution.

In the following sections will take a closer look at how to use these two power transforms on a real dataset.

Next, let’s introduce the dataset.

The sonar dataset is a standard machine learning dataset for binary classification.

It involves 60 real-valued inputs and a 2-class target variable. There are 208 examples in the dataset and the classes are reasonably balanced.

The dataset describes radar returns of rocks or simulated mines.

You can learn more about the dataset from here:

No need to download the dataset; we will download it automatically from our worked examples.

First, let’s load and summarize the dataset. The complete example is listed below.

Running the example first summarizes the shape of the loaded dataset.

This confirms the 60 input variables, one output variable, and 208 rows of data.

Finally, a histogram is created for each input variable.

The dataset provides a good candidate for using a power transform to make the variables more-Gaussian.

Next, let’s fit and evaluate a machine learning model on the raw dataset.

Running the example evaluates a KNN model on the raw sonar dataset.

Accuracy: 0.797 (0.073)

Next, let’s explore a Box-Cox power transform of the dataset.

The Box-Cox transform is named for the two authors of the method.

It is a power transform that assumes the values of the input variable to which it is applied are **strictly positive**. That means 0 and negative values are not supported.

It is important to note that the Box-Cox procedure can only be applied to data that is strictly positive.

— Page 123, Feature Engineering and Selection, 2019.

We can apply the Box-Cox transform using the *PowerTransformer* class and setting the “*method*” argument to “*box-cox*“. Once defined, we can call the *fit_transform()* function and pass it to our dataset to create a Box-Cox transformed version of our dataset.

... pt = PowerTransformer(method='box-cox') data = pt.fit_transform(data)

Our dataset does not have negative values but may have zero values. This may cause a problem.

Let’s try anyway.

The complete example of creating a Box-Cox transform of the sonar dataset and plotting histograms of the result is listed below.

# visualize a box-cox transform of the sonar dataset from pandas import read_csv from pandas import DataFrame from pandas.plotting import scatter_matrix from sklearn.preprocessing import PowerTransformer from matplotlib import pyplot # Load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv" dataset = read_csv(url, header=None) # retrieve just the numeric input values data = dataset.values[:, :-1] # perform a box-cox transform of the dataset pt = PowerTransformer(method='box-cox') data = pt.fit_transform(data) # convert the array back to a dataframe dataset = DataFrame(data) # histograms of the variables dataset.hist() pyplot.show()

Running the example results in an error as follows:

ValueError: The Box-Cox transformation can only be applied to strictly positive data

As expected, we cannot use the transform on the raw data because it is **not strictly positive**.

One way to solve this problem is to use a MixMaxScaler transform first to scale the data to positive values, then apply the transform.

We can use a Pipeline object to apply both transforms in sequence; for example:

... # perform a box-cox transform of the dataset scaler = MinMaxScaler(feature_range=(1, 2)) power = PowerTransformer(method='box-cox') pipeline = Pipeline(steps=[('s', scaler),('p', power)]) data = pipeline.fit_transform(data)

The updated version of applying the Box-Cox transform to the scaled dataset is listed below.

# visualize a box-cox transform of the scaled sonar dataset from pandas import read_csv from pandas import DataFrame from pandas.plotting import scatter_matrix from sklearn.preprocessing import PowerTransformer from sklearn.preprocessing import MinMaxScaler from sklearn.pipeline import Pipeline from matplotlib import pyplot # Load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv" dataset = read_csv(url, header=None) # retrieve just the numeric input values data = dataset.values[:, :-1] # perform a box-cox transform of the dataset scaler = MinMaxScaler(feature_range=(1, 2)) power = PowerTransformer(method='box-cox') pipeline = Pipeline(steps=[('s', scaler),('p', power)]) data = pipeline.fit_transform(data) # convert the array back to a dataframe dataset = DataFrame(data) # histograms of the variables dataset.hist() pyplot.show()

Running the example transforms the dataset and plots histograms of each input variable.

We can see that the shape of the histograms for each variable looks more Gaussian than the raw data.

Next, let’s evaluate the same KNN model as the previous section, but in this case on a Box-Cox transform of the scaled dataset.

The complete example is listed below.

# evaluate knn on the box-cox sonar dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import PowerTransformer from sklearn.preprocessing import MinMaxScaler from sklearn.pipeline import Pipeline from matplotlib import pyplot # load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv" dataset = read_csv(url, header=None) data = dataset.values # separate into input and output columns X, y = data[:, :-1], data[:, -1] # ensure inputs are floats and output is an integer label X = X.astype('float32') y = LabelEncoder().fit_transform(y.astype('str')) # define the pipeline scaler = MinMaxScaler(feature_range=(1, 2)) power = PowerTransformer(method='box-cox') model = KNeighborsClassifier() pipeline = Pipeline(steps=[('s', scaler),('p', power), ('m', model)]) # evaluate the pipeline cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') # report pipeline performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example, we can see that the Box-Cox transform results in a lift in performance from 79.7 percent accuracy without the transform to about 81.1 percent with the transform.

Accuracy: 0.811 (0.085)

Next, let’s take a closer look at the Yeo-Johnson transform.

The Yeo-Johnson transform is also named for the authors.

Unlike the Box-Cox transform, it does not require the values for each input variable to be strictly positive. It supports zero values and negative values. This means we can apply it to our dataset without scaling it first.

We can apply the transform by defining a *PowerTransform* object and setting the “*method*” argument to “*yeo-johnson*” (the default).

... # perform a yeo-johnson transform of the dataset pt = PowerTransformer(method='yeo-johnson') data = pt.fit_transform(data)

The example below applies the Yeo-Johnson transform and creates histogram plots of each of the transformed variables.

# visualize a yeo-johnson transform of the sonar dataset from pandas import read_csv from pandas import DataFrame from pandas.plotting import scatter_matrix from sklearn.preprocessing import PowerTransformer from matplotlib import pyplot # Load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv" dataset = read_csv(url, header=None) # retrieve just the numeric input values data = dataset.values[:, :-1] # perform a yeo-johnson transform of the dataset pt = PowerTransformer(method='yeo-johnson') data = pt.fit_transform(data) # convert the array back to a dataframe dataset = DataFrame(data) # histograms of the variables dataset.hist() pyplot.show()

Running the example transforms the dataset and plots histograms of each input variable.

We can see that the shape of the histograms for each variable look more Gaussian than the raw data, much like the box-cox transform.

Next, let’s evaluate the same KNN model as the previous section, but in this case on a Yeo-Johnson transform of the raw dataset.

The complete example is listed below.

# evaluate knn on the yeo-johnson sonar dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import PowerTransformer from sklearn.preprocessing import MinMaxScaler from sklearn.pipeline import Pipeline from matplotlib import pyplot # load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv" dataset = read_csv(url, header=None) data = dataset.values # separate into input and output columns X, y = data[:, :-1], data[:, -1] # ensure inputs are floats and output is an integer label X = X.astype('float32') y = LabelEncoder().fit_transform(y.astype('str')) # define the pipeline power = PowerTransformer(method='yeo-johnson') model = KNeighborsClassifier() pipeline = Pipeline(steps=[('p', power), ('m', model)]) # evaluate the pipeline cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') # report pipeline performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example, we can see that the Yeo-Johnson transform results in a lift in performance from 79.7 percent accuracy without the transform to about 80.8 percent with the transform, less than the Box-Cox transform that achieved about 81.1 percent.

Accuracy: 0.808 (0.082)

Sometimes a lift in performance can be achieved by first standardizing the raw dataset prior to performing a Yeo-Johnson transform.

We can explore this by adding a StandardScaler as a first step in the pipeline.

The complete example is listed below.

# evaluate knn on the yeo-johnson standardized sonar dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import PowerTransformer from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline from matplotlib import pyplot # load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv" dataset = read_csv(url, header=None) data = dataset.values # separate into input and output columns X, y = data[:, :-1], data[:, -1] # ensure inputs are floats and output is an integer label X = X.astype('float32') y = LabelEncoder().fit_transform(y.astype('str')) # define the pipeline scaler = StandardScaler() power = PowerTransformer(method='yeo-johnson') model = KNeighborsClassifier() pipeline = Pipeline(steps=[('s', scaler), ('p', power), ('m', model)]) # evaluate the pipeline cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') # report pipeline performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example, we can see that standardizing the data prior to the Yeo-Johnson transform resulted in a small lift in performance from about 80.8 percent to about 81.6 percent, a small lift over the results for the Box-Cox transform.

Accuracy: 0.816 (0.077)

This section provides more resources on the topic if you are looking to go deeper.

- Continuous Probability Distributions for Machine Learning
- How to Use Power Transforms for Time Series Forecast Data with Python
- How to Transform Target Variables for Regression With Scikit-Learn
- How to Transform Data to Better Fit The Normal Distribution
- 4 Common Machine Learning Data Transforms for Time Series Forecasting

- Feature Engineering for Machine Learning, 2018.
- Applied Predictive Modeling, 2013.
- Feature Engineering and Selection, 2019.

In this tutorial, you discovered how to use power transforms in scikit-learn to make variables more Gaussian for modeling.

Specifically, you learned:

- Many machine learning algorithms prefer or perform better when numerical variables have a Gaussian probability distribution.
- Power transforms are a technique for transforming numerical input or output variables to have a Gaussian or more-Gaussian-like probability distribution.
- How to use the PowerTransform in scikit-learn to use the Box-Cox and Yeo-Johnson transforms when preparing data for predictive modeling.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Use Power Transforms for Machine Learning appeared first on Machine Learning Mastery.

]]>The post Statistical Imputation for Missing Values in Machine Learning appeared first on Machine Learning Mastery.

]]>As such, it is good practice to identify and replace missing values for each column in your input data prior to modeling your prediction task. This is called missing data imputation, or imputing for short.

A popular approach for data imputation is to calculate a statistical value for each column (such as a mean) and replace all missing values for that column with the statistic. It is a popular approach because the statistic is easy to calculate using the training dataset and because it often results in good performance.

In this tutorial, you will discover how to use statistical imputation strategies for missing data in machine learning.

After completing this tutorial, you will know:

- Missing values must be marked with NaN values and can be replaced with statistical measures to calculate the column of values.
- How to load a CSV value with missing values and mark the missing values with NaN values and report the number and percentage of missing values for each column.
- How to impute missing values with statistics as a data preparation method when evaluating models and when fitting a final model to make predictions on new data.

Let’s get started.

This tutorial is divided into three parts; they are:

- Statistical Imputation
- Horse Colic Dataset
- Statistical Imputation With SimpleImputer
- SimpleImputer Data Transform
- SimpleImputer and Model Evaluation
- Comparing Different Imputed Statistics
- SimpleImputer Transform When Making a Prediction

A dataset may have missing values.

These are rows of data where one or more values or columns in that row are not present. The values may be missing completely or they may be marked with a special character or value, such as a question mark “?”.

These values can be expressed in many ways. I’ve seen them show up as nothing at all […], an empty string […], the explicit string NULL or undefined or N/A or NaN, and the number 0, among others. No matter how they appear in your dataset, knowing what to expect and checking to make sure the data matches that expectation will reduce problems as you start to use the data.

— Page 10, Bad Data Handbook, 2012.

Values could be missing for many reasons, often specific to the problem domain, and might include reasons such as corrupt measurements or data unavailability.

They may occur for a number of reasons, such as malfunctioning measurement equipment, changes in experimental design during data collection, and collation of several similar but not identical datasets.

— Page 63, Data Mining: Practical Machine Learning Tools and Techniques, 2016.

Most machine learning algorithms require numeric input values, and a value to be present for each row and column in a dataset. As such, missing values can cause problems for machine learning algorithms.

As such, it is common to identify missing values in a dataset and replace them with a numeric value. This is called data imputing, or missing data imputation.

A simple and popular approach to data imputation involves using statistical methods to estimate a value for a column from those values that are present, then replace all missing values in the column with the calculated statistic.

It is simple because statistics are fast to calculate and it is popular because it often proves very effective.

Common statistics calculated include:

- The column mean value.
- The column median value.
- The column mode value.
- A constant value.

Now that we are familiar with statistical methods for missing value imputation, let’s take a look at a dataset with missing values.

The horse colic dataset describes medical characteristics of horses with colic and whether they lived or died.

There are 300 rows and 26 input variables with one output variable. It is a binary classification prediction task that involves predicting 1 if the horse lived and 2 if the horse died.

A naive model can achieve a classification accuracy of about 67 percent, and a top-performing model can achieve an accuracy of about 85.2 percent using three repeats of 10-fold cross-validation. This defines the range of expected modeling performance on the dataset.

The dataset has numerous missing values for many of the columns where each missing value is marked with a question mark character (“?”).

Below provides an example of rows from the dataset with marked missing values.

2,1,530101,38.50,66,28,3,3,?,2,5,4,4,?,?,?,3,5,45.00,8.40,?,?,2,2,11300,00000,00000,2 1,1,534817,39.2,88,20,?,?,4,1,3,4,2,?,?,?,4,2,50,85,2,2,3,2,02208,00000,00000,2 2,1,530334,38.30,40,24,1,1,3,1,3,3,1,?,?,?,1,1,33.00,6.70,?,?,1,2,00000,00000,00000,1 1,9,5290409,39.10,164,84,4,1,6,2,2,4,4,1,2,5.00,3,?,48.00,7.20,3,5.30,2,1,02208,00000,00000,1 ...

You can learn more about the dataset here:

No need to download the dataset as we will download it automatically in the worked examples.

Marking missing values with a NaN (not a number) value in a loaded dataset using Python is a best practice.

We can load the dataset using the read_csv() Pandas function and specify the “*na_values*” to load values of ‘*?*‘ as missing, marked with a NaN value.

... # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv' dataframe = read_csv(url, header=None, na_values='?')

Once loaded, we can review the loaded data to confirm that “?” values are marked as NaN.

... # summarize the first few rows print(dataframe.head())

We can then enumerate each column and report the number of rows with missing values for the column.

... # summarize the number of rows with missing values for each column for i in range(dataframe.shape[1]): # count number of rows with missing values n_miss = dataframe[[i]].isnull().sum() perc = n_miss / dataframe.shape[0] * 100 print('> %d, Missing: %d (%.1f%%)' % (i, n_miss, perc))

Tying this together, the complete example of loading and summarizing the dataset is listed below.

# summarize the horse colic dataset from pandas import read_csv # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv' dataframe = read_csv(url, header=None, na_values='?') # summarize the first few rows print(dataframe.head()) # summarize the number of rows with missing values for each column for i in range(dataframe.shape[1]): # count number of rows with missing values n_miss = dataframe[[i]].isnull().sum() perc = n_miss / dataframe.shape[0] * 100 print('> %d, Missing: %d (%.1f%%)' % (i, n_miss, perc))

Running the example first loads the dataset and summarizes the first five rows.

We can see that the missing values that were marked with a “?” character have been replaced with NaN values.

0 1 2 3 4 5 6 ... 21 22 23 24 25 26 27 0 2.0 1 530101 38.5 66.0 28.0 3.0 ... NaN 2.0 2 11300 0 0 2 1 1.0 1 534817 39.2 88.0 20.0 NaN ... 2.0 3.0 2 2208 0 0 2 2 2.0 1 530334 38.3 40.0 24.0 1.0 ... NaN 1.0 2 0 0 0 1 3 1.0 9 5290409 39.1 164.0 84.0 4.0 ... 5.3 2.0 1 2208 0 0 1 4 2.0 1 530255 37.3 104.0 35.0 NaN ... NaN 2.0 2 4300 0 0 2 [5 rows x 28 columns]

Next, we can see the list of all columns in the dataset and the number and percentage of missing values.

We can see that some columns (e.g. column indexes 1 and 2) have no missing values and other columns (e.g. column indexes 15 and 21) have many or even a majority of missing values.

> 0, Missing: 1 (0.3%) > 1, Missing: 0 (0.0%) > 2, Missing: 0 (0.0%) > 3, Missing: 60 (20.0%) > 4, Missing: 24 (8.0%) > 5, Missing: 58 (19.3%) > 6, Missing: 56 (18.7%) > 7, Missing: 69 (23.0%) > 8, Missing: 47 (15.7%) > 9, Missing: 32 (10.7%) > 10, Missing: 55 (18.3%) > 11, Missing: 44 (14.7%) > 12, Missing: 56 (18.7%) > 13, Missing: 104 (34.7%) > 14, Missing: 106 (35.3%) > 15, Missing: 247 (82.3%) > 16, Missing: 102 (34.0%) > 17, Missing: 118 (39.3%) > 18, Missing: 29 (9.7%) > 19, Missing: 33 (11.0%) > 20, Missing: 165 (55.0%) > 21, Missing: 198 (66.0%) > 22, Missing: 1 (0.3%) > 23, Missing: 0 (0.0%) > 24, Missing: 0 (0.0%) > 25, Missing: 0 (0.0%) > 26, Missing: 0 (0.0%) > 27, Missing: 0 (0.0%)

Now that we are familiar with the horse colic dataset that has missing values, let’s look at how we can use statistical imputation.

The scikit-learn machine learning library provides the SimpleImputer class that supports statistical imputation.

In this section, we will explore how to effectively use the SimpleImputer class.

The SimpleImputer is a data transform that is first configured based on the type of statistic to calculate for each column, e.g. mean.

... # define imputer imputer = SimpleImputer(strategy='mean')

Then the imputer is fit on a dataset to calculate the statistic for each column.

... # fit on the dataset imputer.fit(X)

The fit imputer is then applied to a dataset to create a copy of the dataset with all missing values for each column replaced with a statistic value.

... # transform the dataset Xtrans = imputer.transform(X)

We can demonstrate its usage on the horse colic dataset and confirm it works by summarizing the total number of missing values in the dataset before and after the transform.

The complete example is listed below.

# statistical imputation transform for the horse colic dataset from numpy import isnan from pandas import read_csv from sklearn.impute import SimpleImputer # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv' dataframe = read_csv(url, header=None, na_values='?') # split into input and output elements data = dataframe.values X, y = data[:, :-1], data[:, -1] # print total missing print('Missing: %d' % sum(isnan(X).flatten())) # define imputer imputer = SimpleImputer(strategy='mean') # fit on the dataset imputer.fit(X) # transform the dataset Xtrans = imputer.transform(X) # print total missing print('Missing: %d' % sum(isnan(Xtrans).flatten()))

Running the example first loads the dataset and reports the total number of missing values in the dataset as 1,605.

The transform is configured, fit, and performed and the resulting new dataset has no missing values, confirming it was performed as we expected.

Each missing value was replaced with the mean value of its column.

Missing: 1605 Missing: 0

It is a good practice to evaluate machine learning models on a dataset using k-fold cross-validation.

To correctly apply statistical missing data imputation and avoid data leakage, it is required that the statistics calculated for each column are calculated on the training dataset only, then applied to the train and test sets for each fold in the dataset.

If we are using resampling to select tuning parameter values or to estimate performance, the imputation should be incorporated within the resampling.

— Page 42, Applied Predictive Modeling, 2013.

This can be achieved by creating a modeling pipeline where the first step is the statistical imputation, then the second step is the model. This can be achieved using the Pipeline class.

For example, the *Pipeline* below uses a *SimpleImputer* with a ‘*mean*‘ strategy, followed by a random forest model.

... # define modeling pipeline model = RandomForestClassifier() imputer = SimpleImputer(strategy='mean') pipeline = Pipeline(steps=[('i', imputer), ('m', model)])

We can evaluate the mean-imputed dataset and random forest modeling pipeline for the horse colic dataset with repeated 10-fold cross-validation.

The complete example is listed below.

# evaluate mean imputation and random forest for the horse colic dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.ensemble import RandomForestClassifier from sklearn.impute import SimpleImputer from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv' dataframe = read_csv(url, header=None, na_values='?') # split into input and output elements data = dataframe.values X, y = data[:, :-1], data[:, -1] # define modeling pipeline model = RandomForestClassifier() imputer = SimpleImputer(strategy='mean') pipeline = Pipeline(steps=[('i', imputer), ('m', model)]) # define model evaluation cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example correctly applies data imputation to each fold of the cross-validation procedure.

The pipeline is evaluated using three repeats of 10-fold cross-validation and reports the mean classification accuracy on the dataset as about 84.2 percent, which is a good score.

Mean Accuracy: 0.842 (0.049)

How do we know that using a ‘*mean*‘ statistical strategy is good or best for this dataset?

The answer is that we don’t and that it was chosen arbitrarily.

We can design an experiment to test each statistical strategy and discover what works best for this dataset, comparing the mean, median, mode (most frequent), and constant (0) strategies. The mean accuracy of each approach can then be compared.

The complete example is listed below.

# compare statistical imputation strategies for the horse colic dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.ensemble import RandomForestClassifier from sklearn.impute import SimpleImputer from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from matplotlib import pyplot # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv' dataframe = read_csv(url, header=None, na_values='?') # split into input and output elements data = dataframe.values X, y = data[:, :-1], data[:, -1] # evaluate each strategy on the dataset results = list() strategies = ['mean', 'median', 'most_frequent', 'constant'] for s in strategies: # create the modeling pipeline pipeline = Pipeline(steps=[('i', SimpleImputer(strategy=s)), ('m', RandomForestClassifier())]) # evaluate the model cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # store results results.append(scores) print('>%s %.3f (%.3f)' % (s, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=strategies, showmeans=True) pyplot.xticks(rotation=45) pyplot.show()

Running the example evaluates each statistical imputation strategy on the horse colic dataset using repeated cross-validation.

Your specific results may vary given the stochastic nature of the learning algorithm; consider running the example a few times.

The mean accuracy of each strategy is reported along the way. The results suggest that using a constant value, e.g. 0, results in the best performance of about 86.7 percent, which is an outstanding result.

>mean 0.851 (0.053) >median 0.844 (0.052) >most_frequent 0.838 (0.056) >constant 0.867 (0.044)

At the end of the run, a box and whisker plot is created for each set of results, allowing the distribution of results to be compared.

We can clearly see that the distribution of accuracy scores for the constant strategy is better than the other strategies.

We may wish to create a final modeling pipeline with the constant imputation strategy and random forest algorithm, then make a prediction for new data.

This can be achieved by defining the pipeline and fitting it on all available data, then calling the *predict()* function passing new data in as an argument.

Importantly, the row of new data must mark any missing values using the NaN value.

... # define new data row = [2,1,530101,38.50,66,28,3,3,nan,2,5,4,4,nan,nan,nan,3,5,45.00,8.40,nan,nan,2,2,11300,00000,00000]

The complete example is listed below.

# constant imputation strategy and prediction for the hose colic dataset from numpy import nan from pandas import read_csv from sklearn.ensemble import RandomForestClassifier from sklearn.impute import SimpleImputer from sklearn.pipeline import Pipeline # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv' dataframe = read_csv(url, header=None, na_values='?') # split into input and output elements data = dataframe.values X, y = data[:, :-1], data[:, -1] # create the modeling pipeline pipeline = Pipeline(steps=[('i', SimpleImputer(strategy='constant')), ('m', RandomForestClassifier())]) # fit the model pipeline.fit(X, y) # define new data row = [2,1,530101,38.50,66,28,3,3,nan,2,5,4,4,nan,nan,nan,3,5,45.00,8.40,nan,nan,2,2,11300,00000,00000] # make a prediction yhat = pipeline.predict([row]) # summarize prediction print('Predicted Class: %d' % yhat[0])

Running the example fits the modeling pipeline on all available data.

A new row of data is defined with missing values marked with NaNs and a classification prediction is made.

Predicted Class: 2

This section provides more resources on the topic if you are looking to go deeper.

- Results for Standard Classification and Regression Machine Learning Datasets
- How to Handle Missing Data with Python

- Bad Data Handbook, 2012.
- Data Mining: Practical Machine Learning Tools and Techniques, 2016.
- Applied Predictive Modeling, 2013.

In this tutorial, you discovered how to use statistical imputation strategies for missing data in machine learning.

Specifically, you learned:

- Missing values must be marked with NaN values and can be replaced with statistical measures to calculate the column of values.
- How to impute missing values with statistics as a data preparation method when evaluating models and when fitting a final model to make predictions on new data.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Statistical Imputation for Missing Values in Machine Learning appeared first on Machine Learning Mastery.

]]>