The post Tune Hyperparameters for Classification Machine Learning Algorithms appeared first on Machine Learning Mastery.

]]>Machine learning algorithms have hyperparameters that allow you to tailor the behavior of the algorithm to your specific dataset.

Hyperparameters are different from parameters, which are the internal coefficients or weights for a model found by the learning algorithm. Unlike parameters, hyperparameters are specified by the practitioner when configuring the model.

Typically, it is challenging to know what values to use for the hyperparameters of a given algorithm on a given dataset, therefore it is common to use random or grid search strategies for different hyperparameter values.

The more hyperparameters of an algorithm that you need to tune, the slower the tuning process. Therefore, it is desirable to select a minimum subset of model hyperparameters to search or tune.

Not all model hyperparameters are equally important. Some hyperparameters have an outsized effect on the behavior, and in turn, the performance of a machine learning algorithm.

As a machine learning practitioner, you must know which hyperparameters to focus on to get a good result quickly.

In this tutorial, you will discover those hyperparameters that are most important for some of the top machine learning algorithms.

Let’s get started.

We will take a closer look at the important hyperparameters of the top machine learning algorithms that you may use for classification.

We will look at the hyperparameters you need to focus on and suggested values to try when tuning the model on your dataset.

The suggestions are based both on advice from textbooks on the algorithms and practical advice suggested by practitioners, as well as a little of my own experience.

The seven classification algorithms we will look at are as follows:

- Logistic Regression
- Ridge Classifier
- K-Nearest Neighbors (KNN)
- Support Vector Machine (SVM)
- Bagged Decision Trees (Bagging)
- Random Forest
- Stochastic Gradient Boosting

We will consider these algorithms in the context of their scikit-learn implementation (Python); nevertheless, you can use the same hyperparameter suggestions with other platforms, such as Weka and R.

A small grid searching example is also given for each algorithm that you can use as a starting point for your own classification predictive modeling project.

**Note**: if you have had success with different hyperparameter values or even different hyperparameters than those suggested in this tutorial, let me know in the comments below. I’d love to hear about it.

Let’s dive in.

Logistic regression does not really have any critical hyperparameters to tune.

Sometimes, you can see useful differences in performance or convergence with different solvers (*solver*).

**solver**in [‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’]

Regularization (*penalty*) can sometimes be helpful.

**penalty**in [‘none’, ‘l1’, ‘l2’, ‘elasticnet’]

**Note**: not all solvers support all regularization terms.

The C parameter controls the penality strength, which can also be effective.

**C**in [100, 10, 1.0, 0.1, 0.01]

For the full list of hyperparameters, see:

The example below demonstrates grid searching the key hyperparameters for LogisticRegression on a synthetic binary classification dataset.

Some combinations were omitted to cut back on the warnings/errors.

# example of grid searching key hyperparametres for logistic regression from sklearn.datasets.samples_generator import make_blobs from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.model_selection import GridSearchCV from sklearn.linear_model import LogisticRegression # define dataset X, y = make_blobs(n_samples=1000, centers=2, n_features=100, cluster_std=20) # define models and parameters model = LogisticRegression() solvers = ['newton-cg', 'lbfgs', 'liblinear'] penalty = ['l2'] c_values = [100, 10, 1.0, 0.1, 0.01] # define grid search grid = dict(solver=solvers,penalty=penalty,C=c_values) cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0) grid_result = grid_search.fit(X, y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param))

Running the example prints the best result as well as the results from all combinations evaluated.

Best: 0.945333 using {'C': 0.01, 'penalty': 'l2', 'solver': 'liblinear'} 0.936333 (0.016829) with: {'C': 100, 'penalty': 'l2', 'solver': 'newton-cg'} 0.937667 (0.017259) with: {'C': 100, 'penalty': 'l2', 'solver': 'lbfgs'} 0.938667 (0.015861) with: {'C': 100, 'penalty': 'l2', 'solver': 'liblinear'} 0.936333 (0.017413) with: {'C': 10, 'penalty': 'l2', 'solver': 'newton-cg'} 0.938333 (0.017904) with: {'C': 10, 'penalty': 'l2', 'solver': 'lbfgs'} 0.939000 (0.016401) with: {'C': 10, 'penalty': 'l2', 'solver': 'liblinear'} 0.937333 (0.017114) with: {'C': 1.0, 'penalty': 'l2', 'solver': 'newton-cg'} 0.939000 (0.017195) with: {'C': 1.0, 'penalty': 'l2', 'solver': 'lbfgs'} 0.939000 (0.015780) with: {'C': 1.0, 'penalty': 'l2', 'solver': 'liblinear'} 0.940000 (0.015706) with: {'C': 0.1, 'penalty': 'l2', 'solver': 'newton-cg'} 0.940333 (0.014941) with: {'C': 0.1, 'penalty': 'l2', 'solver': 'lbfgs'} 0.941000 (0.017000) with: {'C': 0.1, 'penalty': 'l2', 'solver': 'liblinear'} 0.943000 (0.016763) with: {'C': 0.01, 'penalty': 'l2', 'solver': 'newton-cg'} 0.943000 (0.016763) with: {'C': 0.01, 'penalty': 'l2', 'solver': 'lbfgs'} 0.945333 (0.017651) with: {'C': 0.01, 'penalty': 'l2', 'solver': 'liblinear'}

Ridge regression is a penalized linear regression model for predicting a numerical value.

Nevertheless, it can be very effective when applied to classification.

Perhaps the most important parameter to tune is the regularization strength (*alpha*). A good starting point might be values in the range [0.1 to 1.0]

**alpha**in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

For the full list of hyperparameters, see:

The example below demonstrates grid searching the key hyperparameters for RidgeClassifier on a synthetic binary classification dataset.

# example of grid searching key hyperparametres for ridge classifier from sklearn.datasets.samples_generator import make_blobs from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.model_selection import GridSearchCV from sklearn.linear_model import RidgeClassifier # define dataset X, y = make_blobs(n_samples=1000, centers=2, n_features=100, cluster_std=20) # define models and parameters model = RidgeClassifier() alpha = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] # define grid search grid = dict(alpha=alpha) cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0) grid_result = grid_search.fit(X, y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param))

Running the example prints the best result as well as the results from all combinations evaluated.

Best: 0.974667 using {'alpha': 0.1} 0.974667 (0.014545) with: {'alpha': 0.1} 0.974667 (0.014545) with: {'alpha': 0.2} 0.974667 (0.014545) with: {'alpha': 0.3} 0.974667 (0.014545) with: {'alpha': 0.4} 0.974667 (0.014545) with: {'alpha': 0.5} 0.974667 (0.014545) with: {'alpha': 0.6} 0.974667 (0.014545) with: {'alpha': 0.7} 0.974667 (0.014545) with: {'alpha': 0.8} 0.974667 (0.014545) with: {'alpha': 0.9} 0.974667 (0.014545) with: {'alpha': 1.0}

The most important hyperparameter for KNN is the number of neighbors (*n_neighbors*).

Test values between at least 1 and 21, perhaps just the odd numbers.

**n_neighbors**in [1 to 21]

It may also be interesting to test different distance metrics (*metric*) for choosing the composition of the neighborhood.

**metric**in [‘euclidean’, ‘manhattan’, ‘minkowski’]

For a fuller list see:

It may also be interesting to test the contribution of members of the neighborhood via different weightings (*weights*).

**weights**in [‘uniform’, ‘distance’]

For the full list of hyperparameters, see:

The example below demonstrates grid searching the key hyperparameters for KNeighborsClassifier on a synthetic binary classification dataset.

# example of grid searching key hyperparametres for KNeighborsClassifier from sklearn.datasets.samples_generator import make_blobs from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.model_selection import GridSearchCV from sklearn.neighbors import KNeighborsClassifier # define dataset X, y = make_blobs(n_samples=1000, centers=2, n_features=100, cluster_std=20) # define models and parameters model = KNeighborsClassifier() n_neighbors = range(1, 21, 2) weights = ['uniform', 'distance'] metric = ['euclidean', 'manhattan', 'minkowski'] # define grid search grid = dict(n_neighbors=n_neighbors,weights=weights,metric=metric) cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0) grid_result = grid_search.fit(X, y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param))

Running the example prints the best result as well as the results from all combinations evaluated.

Best: 0.937667 using {'metric': 'manhattan', 'n_neighbors': 13, 'weights': 'uniform'} 0.833667 (0.031674) with: {'metric': 'euclidean', 'n_neighbors': 1, 'weights': 'uniform'} 0.833667 (0.031674) with: {'metric': 'euclidean', 'n_neighbors': 1, 'weights': 'distance'} 0.895333 (0.030081) with: {'metric': 'euclidean', 'n_neighbors': 3, 'weights': 'uniform'} 0.895333 (0.030081) with: {'metric': 'euclidean', 'n_neighbors': 3, 'weights': 'distance'} 0.909000 (0.021810) with: {'metric': 'euclidean', 'n_neighbors': 5, 'weights': 'uniform'} 0.909000 (0.021810) with: {'metric': 'euclidean', 'n_neighbors': 5, 'weights': 'distance'} 0.925333 (0.020774) with: {'metric': 'euclidean', 'n_neighbors': 7, 'weights': 'uniform'} 0.925333 (0.020774) with: {'metric': 'euclidean', 'n_neighbors': 7, 'weights': 'distance'} 0.929000 (0.027368) with: {'metric': 'euclidean', 'n_neighbors': 9, 'weights': 'uniform'} 0.929000 (0.027368) with: {'metric': 'euclidean', 'n_neighbors': 9, 'weights': 'distance'} ...

The SVM algorithm, like gradient boosting, is very popular, very effective, and provides a large number of hyperparameters to tune.

Perhaps the first important parameter is the choice of kernel that will control the manner in which the input variables will be projected. There are many to choose from, but linear, polynomial, and RBF are the most common, perhaps just linear and RBF in practice.

**kernels**in [‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’]

If the polynomial kernel works out, then it is a good idea to dive into the degree hyperparameter.

Another critical parameter is the penalty (*C*) that can take on a range of values and has a dramatic effect on the shape of the resulting regions for each class. A log scale might be a good starting point.

**C**in [100, 10, 1.0, 0.1, 0.001]

For the full list of hyperparameters, see:

The example below demonstrates grid searching the key hyperparameters for SVC on a synthetic binary classification dataset.

# example of grid searching key hyperparametres for SVC from sklearn.datasets.samples_generator import make_blobs from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.model_selection import GridSearchCV from sklearn.svm import SVC # define dataset X, y = make_blobs(n_samples=1000, centers=2, n_features=100, cluster_std=20) # define model and parameters model = SVC() kernel = ['poly', 'rbf', 'sigmoid'] C = [50, 10, 1.0, 0.1, 0.01] gamma = ['scale'] # define grid search grid = dict(kernel=kernel,C=C,gamma=gamma) cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0) grid_result = grid_search.fit(X, y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param))

Running the example prints the best result as well as the results from all combinations evaluated.

Best: 0.974333 using {'C': 1.0, 'gamma': 'scale', 'kernel': 'poly'} 0.973667 (0.012512) with: {'C': 50, 'gamma': 'scale', 'kernel': 'poly'} 0.970667 (0.018062) with: {'C': 50, 'gamma': 'scale', 'kernel': 'rbf'} 0.945333 (0.024594) with: {'C': 50, 'gamma': 'scale', 'kernel': 'sigmoid'} 0.973667 (0.012512) with: {'C': 10, 'gamma': 'scale', 'kernel': 'poly'} 0.970667 (0.018062) with: {'C': 10, 'gamma': 'scale', 'kernel': 'rbf'} 0.957000 (0.016763) with: {'C': 10, 'gamma': 'scale', 'kernel': 'sigmoid'} 0.974333 (0.012565) with: {'C': 1.0, 'gamma': 'scale', 'kernel': 'poly'} 0.971667 (0.016948) with: {'C': 1.0, 'gamma': 'scale', 'kernel': 'rbf'} 0.966333 (0.016224) with: {'C': 1.0, 'gamma': 'scale', 'kernel': 'sigmoid'} 0.972333 (0.013585) with: {'C': 0.1, 'gamma': 'scale', 'kernel': 'poly'} 0.974000 (0.013317) with: {'C': 0.1, 'gamma': 'scale', 'kernel': 'rbf'} 0.971667 (0.015934) with: {'C': 0.1, 'gamma': 'scale', 'kernel': 'sigmoid'} 0.972333 (0.013585) with: {'C': 0.01, 'gamma': 'scale', 'kernel': 'poly'} 0.973667 (0.014716) with: {'C': 0.01, 'gamma': 'scale', 'kernel': 'rbf'} 0.974333 (0.013828) with: {'C': 0.01, 'gamma': 'scale', 'kernel': 'sigmoid'}

The most important parameter for bagged decision trees is the number of trees (*n_estimators*).

Ideally, this should be increased until no further improvement is seen in the model.

Good values might be a log scale from 10 to 1,000.

**n_estimators**in [10, 100, 1000]

For the full list of hyperparameters, see:

The example below demonstrates grid searching the key hyperparameters for BaggingClassifier on a synthetic binary classification dataset.

# example of grid searching key hyperparameters for BaggingClassifier from sklearn.datasets.samples_generator import make_blobs from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.model_selection import GridSearchCV from sklearn.ensemble import BaggingClassifier # define dataset X, y = make_blobs(n_samples=1000, centers=2, n_features=100, cluster_std=20) # define models and parameters model = BaggingClassifier() n_estimators = [10, 100, 1000] # define grid search grid = dict(n_estimators=n_estimators) cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0) grid_result = grid_search.fit(X, y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param))

Running the example prints the best result as well as the results from all combinations evaluated.

Best: 0.873667 using {'n_estimators': 1000} 0.839000 (0.038588) with: {'n_estimators': 10} 0.869333 (0.030434) with: {'n_estimators': 100} 0.873667 (0.035070) with: {'n_estimators': 1000}

The most important parameter is the number of random features to sample at each split point (*max_features*).

You could try a range of integer values, such as 1 to 20, or 1 to half the number of input features.

**max_features**[1 to 20]

Alternately, you could try a suite of different default value calculators.

**max_features**in [‘sqrt’, ‘log2’]

Another important parameter for random forest is the number of trees (*n_estimators*).

Ideally, this should be increased until no further improvement is seen in the model.

Good values might be a log scale from 10 to 1,000.

**n_estimators**in [10, 100, 1000]

For the full list of hyperparameters, see:

The example below demonstrates grid searching the key hyperparameters for BaggingClassifier on a synthetic binary classification dataset.

# example of grid searching key hyperparameters for RandomForestClassifier from sklearn.datasets.samples_generator import make_blobs from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.model_selection import GridSearchCV from sklearn.ensemble import RandomForestClassifier # define dataset X, y = make_blobs(n_samples=1000, centers=2, n_features=100, cluster_std=20) # define models and parameters model = RandomForestClassifier() n_estimators = [10, 100, 1000] max_features = ['sqrt', 'log2'] # define grid search grid = dict(n_estimators=n_estimators,max_features=max_features) cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0) grid_result = grid_search.fit(X, y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param))

Running the example prints the best result as well as the results from all combinations evaluated.

Best: 0.952000 using {'max_features': 'log2', 'n_estimators': 1000} 0.841000 (0.032078) with: {'max_features': 'sqrt', 'n_estimators': 10} 0.938333 (0.020830) with: {'max_features': 'sqrt', 'n_estimators': 100} 0.944667 (0.024998) with: {'max_features': 'sqrt', 'n_estimators': 1000} 0.817667 (0.033235) with: {'max_features': 'log2', 'n_estimators': 10} 0.940667 (0.021592) with: {'max_features': 'log2', 'n_estimators': 100} 0.952000 (0.019562) with: {'max_features': 'log2', 'n_estimators': 1000}

Also called Gradient Boosting Machine (GBM) or named for the specific implementation, such as XGBoost.

The gradient boosting algorithm has many parameters to tune.

There are some parameter pairings that are important to consider. The first is the learning rate, also called shrinkage or eta (*learning_rate*) and the number of trees in the model (*n_estimators*). Both could be considered on a log scale, although in different directions.

**learning_rate**in [0.001, 0.01, 0.1]**n_estimators**[10, 100, 1000]

Another pairing is the number of rows or subset of the data to consider for each tree (*subsample*) and the depth of each tree (*max_depth*). These could be grid searched at a 0.1 and 1 interval respectively, although common values can be tested directly.

**subsample**in [0.5, 0.7, 1.0]**max_depth**in [3, 7, 9]

For more detailed advice on tuning the XGBoost implementation, see:

For the full list of hyperparameters, see:

The example below demonstrates grid searching the key hyperparameters for GradientBoostingClassifier on a synthetic binary classification dataset.

# example of grid searching key hyperparameters for GradientBoostingClassifier from sklearn.datasets.samples_generator import make_blobs from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.model_selection import GridSearchCV from sklearn.ensemble import GradientBoostingClassifier # define dataset X, y = make_blobs(n_samples=1000, centers=2, n_features=100, cluster_std=20) # define models and parameters model = GradientBoostingClassifier() n_estimators = [10, 100, 1000] learning_rate = [0.001, 0.01, 0.1] subsample = [0.5, 0.7, 1.0] max_depth = [3, 7, 9] # define grid search grid = dict(learning_rate=learning_rate, n_estimators=n_estimators, subsample=subsample, max_depth=max_depth) cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0) grid_result = grid_search.fit(X, y) # summarize results print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param))

Running the example prints the best result as well as the results from all combinations evaluated.

Best: 0.936667 using {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 1000, 'subsample': 0.5} 0.803333 (0.042058) with: {'learning_rate': 0.001, 'max_depth': 3, 'n_estimators': 10, 'subsample': 0.5} 0.783667 (0.042386) with: {'learning_rate': 0.001, 'max_depth': 3, 'n_estimators': 10, 'subsample': 0.7} 0.711667 (0.041157) with: {'learning_rate': 0.001, 'max_depth': 3, 'n_estimators': 10, 'subsample': 1.0} 0.832667 (0.040244) with: {'learning_rate': 0.001, 'max_depth': 3, 'n_estimators': 100, 'subsample': 0.5} 0.809667 (0.040040) with: {'learning_rate': 0.001, 'max_depth': 3, 'n_estimators': 100, 'subsample': 0.7} 0.741333 (0.043261) with: {'learning_rate': 0.001, 'max_depth': 3, 'n_estimators': 100, 'subsample': 1.0} 0.881333 (0.034130) with: {'learning_rate': 0.001, 'max_depth': 3, 'n_estimators': 1000, 'subsample': 0.5} 0.866667 (0.035150) with: {'learning_rate': 0.001, 'max_depth': 3, 'n_estimators': 1000, 'subsample': 0.7} 0.838333 (0.037424) with: {'learning_rate': 0.001, 'max_depth': 3, 'n_estimators': 1000, 'subsample': 1.0} 0.838333 (0.036614) with: {'learning_rate': 0.001, 'max_depth': 7, 'n_estimators': 10, 'subsample': 0.5} 0.821667 (0.040586) with: {'learning_rate': 0.001, 'max_depth': 7, 'n_estimators': 10, 'subsample': 0.7} 0.729000 (0.035903) with: {'learning_rate': 0.001, 'max_depth': 7, 'n_estimators': 10, 'subsample': 1.0} 0.884667 (0.036854) with: {'learning_rate': 0.001, 'max_depth': 7, 'n_estimators': 100, 'subsample': 0.5} 0.871333 (0.035094) with: {'learning_rate': 0.001, 'max_depth': 7, 'n_estimators': 100, 'subsample': 0.7} 0.729000 (0.037625) with: {'learning_rate': 0.001, 'max_depth': 7, 'n_estimators': 100, 'subsample': 1.0} 0.905667 (0.033134) with: {'learning_rate': 0.001, 'max_depth': 7, 'n_estimators': 1000, 'subsample': 0.5} ...

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered the top hyperparameters and how to configure them for top machine learning algorithms.

Do you have other hyperparameter suggestions? Let me know in the comments below.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Tune Hyperparameters for Classification Machine Learning Algorithms appeared first on Machine Learning Mastery.

]]>The post How to Develop Super Learner Ensembles in Python appeared first on Machine Learning Mastery.

]]>Selecting a machine learning algorithm for a predictive modeling problem involves evaluating many different models and model configurations using k-fold cross-validation.

The super learner is an ensemble machine learning algorithm that combines all of the models and model configurations that you might investigate for a predictive modeling problem and uses them to make a prediction as-good-as or better than any single model that you may have investigated.

The super learner algorithm is an application of stacked generalization, called stacking or blending, to k-fold cross-validation where all models use the same k-fold splits of the data and a meta-model is fit on the out-of-fold predictions from each model.

In this tutorial, you will discover the super learner ensemble machine learning algorithm.

After completing this tutorial, you will know:

- Super learner is the application of stacked generalization using out-of-fold predictions during k-fold cross-validation.
- The super learner ensemble algorithm is straightforward to implement in Python using scikit-learn models.
- The ML-Ensemble (mlens) library provides a convenient implementation that allows the super learner to be fit and used in just a few lines of code.

Let’s get started.

This tutorial is divided into three parts; they are:

- What Is the Super Learner?
- Manually Develop a Super Learner With scikit-learn
- Super Learner With ML-Ensemble Library

There are many hundreds of models to choose from for a predictive modeling problem; which one is best?

Then, after a model is chosen, how do you best configure it for your specific dataset?

These are open questions in applied machine learning. The best answer we have at the moment is to use empirical experimentation to test and discover what works best for your dataset.

In practice, it is generally impossible to know a priori which learner will perform best for a given prediction problem and data set.

— Super Learner, 2007.

This involves selecting many different algorithms that may be appropriate for your regression or classification problem and evaluating their performance on your dataset using a resampling technique, such as k-fold cross-validation.

The algorithm that performs the best on your dataset according to k-fold cross-validation is then selected, fit on all available data, and you can then start using it to make predictions.

There is an alternative approach.

Consider that you have already fit many different algorithms on your dataset, and some algorithms have been evaluated many times with different configurations. You may have many tens or hundreds of different models of your problem. Why not use all those models instead of the best model from the group?

This is the intuition behind the so-called “*super learner*” ensemble algorithm.

The super learner algorithm involves first pre-defining the k-fold split of your data, then evaluating all different algorithms and algorithm configurations on the same split of the data. All out-of-fold predictions are then kept and used to train a that learns how to best combine the predictions.

The algorithms may differ in the subset of the covariates used, the basis functions, the loss functions, the searching algorithm, and the range of tuning parameters, among others.

— Super Learner In Prediction, 2010.

The results of this model should be no worse than the best performing model evaluated during k-fold cross-validation and has the likelihood of performing better than any single model.

The super learner algorithm was proposed by Mark van der Laan, Eric Polley, and Alan Hubbard from Berkeley in their 2007 paper titled “Super Learner.” It was published in a biological journal, which may be sheltered from the broader machine learning community.

The super learner technique is an example of the general method called “*stacked generalization*,” or “*stacking*” for short, and is known in applied machine learning as blending, as often a linear model is used as the meta-model.

The super learner is related to the stacking algorithm introduced in neural networks context …

— Super Learner In Prediction, 2010.

For more on the topic stacking, see the posts:

- How to Develop a Stacking Ensemble for Deep Learning Neural Networks in Python With Keras
- How to Implement Stacked Generalization (Stacking) From Scratch With Python

We can think of the “*super learner*” as the supplicating of stacking specifically to k-fold cross-validation.

I have sometimes seen this type of blending ensemble referred to as a cross-validation ensemble.

The procedure can be summarized as follows:

- 1. Select a k-fold split of the training dataset.
- 2. Select m base-models or model configurations.
- 3. For each basemodel:
- a. Evaluate using k-fold cross-validation.
- b. Store all out-of-fold predictions.
- c. Fit the model on the full training dataset and store.

- 4. Fit a meta-model on the out-of-fold predictions.
- 5. Evaluate the model on a holdout dataset or use model to make predictions.

The image below, taken from the original paper, summarizes this data flow.

Let’s take a closer look at some common sticking points you may have with this procedure.

**Q. What are the inputs and outputs for the meta-model?**

The meta-model takes in predictions from base-models as input and predicts the target for the training dataset as output:

**Input**: Predictions from base-models.**Output**: Prediction for training dataset.

For example, if we had 50 base-models, then one input sample would be a vector with 50 values, each value in the vector representing a prediction from one of the base-models for one sample of the training dataset.

If we had 1,000 examples (rows) in the training dataset and 50 models, then the input data for the meta-model would be 1,000 rows and 50 columns.

**Q. Won’t the meta-model overfit the training data?**

Probably not.

This is the trick of the super learner, and the stacked generalization procedure in general.

The input to the meta-model is the out-of-fold (out-of-sample) predictions. In aggregate, the out-of-fold predictions for a model represent the model’s skill or capability in making predictions on data not seen during training.

By training a meta-model on out-of-sample predictions of other models, the meta-model learns how to both correct the out-of-sample predictions for each model and to best combine the out-of-sample predictions from multiple models; actually, it does both tasks at the same time.

Importantly, to get an idea of the true capability of the meta-model, it must be evaluated on new out-of-sample data. That is, data not used to train the base models.

**Q. Can this work for regression and classification?**

Yes, it was described in the papers for regression (predicting a numerical value).

It can work just as well for classification (predicting a class label), although it is probably best to predict probabilities to give the meta-model more granularity when combining predictions.

**Q. Why do we fit each base-model on the entire training dataset?**

Each base-model is fit on the entire training dataset so that the model can be used later to make predictions on new examples not seen during training.

This step is strictly not required until predictions are needed by the super learner.

**Q. How do we make a prediction?**

To make a prediction on a new sample (row of data), first, the row of data is provided as input to each base model to generate a prediction from each model.

The predictions from the base-models are then concatenated into a vector and provided as input to the meta-model. The meta-model then makes a final prediction for the row of data.

We can summarize this procedure as follows:

- 1. Take a sample not seen by the models during training.
- 2. For each base-model:
- a. Make a prediction given the sample.
- b. Store prediction.

- 3. Concatenate predictions from submodel into a single vector.
- 4. Provide vector as input to the meta-model to make a final prediction.

Now that we are familiar with the super learner algorithm, let’s look at a worked example.

The Super Learner algorithm is relatively straightforward to implement on top of the scikit-learn Python machine learning library.

In this section, we will develop an example of super learning for both regression and classification that you can adapt to your own problems.

We will use the make_regression() test problem and generate 1,000 examples (rows) with 100 features (columns). This is a simple regression problem with a linear relationship between input and output, with added noise.

We will split the data so that 50 percent is used for training the model and 50 percent is held back to evaluate the final super model and base-models.

... # create the inputs and outputs X, y = make_regression(n_samples=1000, n_features=100, noise=0.5) # split X, X_val, y, y_val = train_test_split(X, y, test_size=0.50) print('Train', X.shape, y.shape, 'Test', X_val.shape, y_val.shape)

Next, we will define a bunch of different regression models.

In this case, we will use nine different algorithms with modest configuration. You can use any models or model configurations you like.

The *get_models()* function below defines all of the models and returns them as a list.

# create a list of base-models def get_models(): models = list() models.append(LinearRegression()) models.append(ElasticNet()) models.append(SVR(gamma='scale')) models.append(DecisionTreeRegressor()) models.append(KNeighborsRegressor()) models.append(AdaBoostRegressor()) models.append(BaggingRegressor(n_estimators=10)) models.append(RandomForestRegressor(n_estimators=10)) models.append(ExtraTreesRegressor(n_estimators=10)) return models

Next, we will use k-fold cross-validation to make out-of-fold predictions that will be used as the dataset to train the meta-model or “*super learner*.”

This involves first splitting the data into k folds; we will use 10. For each fold, we will fit the model on the training part of the split and make out-of-fold predictions on the test part of the split. This is repeated for each model and all out-of-fold predictions are stored.

Each out-of-fold prediction will be a column for the meta-model input. We will collect columns from each algorithm for one fold of the data, horizontally stacking the rows. Then for all groups of columns we collect, we will vertically stack these rows into one long dataset with 500 rows and nine columns.

The *get_out_of_fold_predictions()* function below does this for a given test dataset and list of models; it will return the input and output dataset required to train the meta-model.

# collect out of fold predictions form k-fold cross validation def get_out_of_fold_predictions(X, y, models): meta_X, meta_y = list(), list() # define split of data kfold = KFold(n_splits=10, shuffle=True) # enumerate splits for train_ix, test_ix in kfold.split(X): fold_yhats = list() # get data train_X, test_X = X[train_ix], X[test_ix] train_y, test_y = y[train_ix], y[test_ix] meta_y.extend(test_y) # fit and make predictions with each sub-model for model in models: model.fit(train_X, train_y) yhat = model.predict(test_X) # store columns fold_yhats.append(yhat.reshape(len(yhat),1)) # store fold yhats as columns meta_X.append(hstack(fold_yhats)) return vstack(meta_X), asarray(meta_y)

We can then call the function to get the models and the function to prepare the meta-model dataset.

... # get models models = get_models() # get out of fold predictions meta_X, meta_y = get_out_of_fold_predictions(X, y, models) print('Meta ', meta_X.shape, meta_y.shape)

Next, we can fit all of the base-models on the entire training dataset.

# fit all base models on the training dataset def fit_base_models(X, y, models): for model in models: model.fit(X, y)

Then, we can fit the meta-model on the prepared dataset.

In this case, we will use a linear regression model as the meta-model, as was used in the original paper.

# fit a meta model def fit_meta_model(X, y): model = LinearRegression() model.fit(X, y) return model

Next, we can evaluate the base-models on the holdout dataset.

# evaluate a list of models on a dataset def evaluate_models(X, y, models): for model in models: yhat = model.predict(X) mse = mean_squared_error(y, yhat) print('%s: RMSE %.3f' % (model.__class__.__name__, sqrt(mse)))

And, finally, use the super learner (base and meta-model) to make predictions on the holdout dataset and evaluate the performance of the approach.

The super_learner_predictions() function below will use the meta-model to make predictions for new data.

# make predictions with stacked model def super_learner_predictions(X, models, meta_model): meta_X = list() for model in models: yhat = model.predict(X) meta_X.append(yhat.reshape(len(yhat),1)) meta_X = hstack(meta_X) # predict return meta_model.predict(meta_X)

We can call this function and evaluate the results.

... # evaluate meta model yhat = super_learner_predictions(X_val, models, meta_model) print('Super Learner: RMSE %.3f' % (sqrt(mean_squared_error(y_val, yhat))))

Tying this all together, the complete example of a super learner algorithm for regression using scikit-learn models is listed below.

# example of a super learner model for regression from math import sqrt from numpy import hstack from numpy import vstack from numpy import asarray from sklearn.datasets.samples_generator import make_regression from sklearn.model_selection import KFold from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from sklearn.linear_model import LinearRegression from sklearn.linear_model import ElasticNet from sklearn.neighbors import KNeighborsRegressor from sklearn.tree import DecisionTreeRegressor from sklearn.svm import SVR from sklearn.ensemble import AdaBoostRegressor from sklearn.ensemble import BaggingRegressor from sklearn.ensemble import RandomForestRegressor from sklearn.ensemble import ExtraTreesRegressor # create a list of base-models def get_models(): models = list() models.append(LinearRegression()) models.append(ElasticNet()) models.append(SVR(gamma='scale')) models.append(DecisionTreeRegressor()) models.append(KNeighborsRegressor()) models.append(AdaBoostRegressor()) models.append(BaggingRegressor(n_estimators=10)) models.append(RandomForestRegressor(n_estimators=10)) models.append(ExtraTreesRegressor(n_estimators=10)) return models # collect out of fold predictions form k-fold cross validation def get_out_of_fold_predictions(X, y, models): meta_X, meta_y = list(), list() # define split of data kfold = KFold(n_splits=10, shuffle=True) # enumerate splits for train_ix, test_ix in kfold.split(X): fold_yhats = list() # get data train_X, test_X = X[train_ix], X[test_ix] train_y, test_y = y[train_ix], y[test_ix] meta_y.extend(test_y) # fit and make predictions with each sub-model for model in models: model.fit(train_X, train_y) yhat = model.predict(test_X) # store columns fold_yhats.append(yhat.reshape(len(yhat),1)) # store fold yhats as columns meta_X.append(hstack(fold_yhats)) return vstack(meta_X), asarray(meta_y) # fit all base models on the training dataset def fit_base_models(X, y, models): for model in models: model.fit(X, y) # fit a meta model def fit_meta_model(X, y): model = LinearRegression() model.fit(X, y) return model # evaluate a list of models on a dataset def evaluate_models(X, y, models): for model in models: yhat = model.predict(X) mse = mean_squared_error(y, yhat) print('%s: RMSE %.3f' % (model.__class__.__name__, sqrt(mse))) # make predictions with stacked model def super_learner_predictions(X, models, meta_model): meta_X = list() for model in models: yhat = model.predict(X) meta_X.append(yhat.reshape(len(yhat),1)) meta_X = hstack(meta_X) # predict return meta_model.predict(meta_X) # create the inputs and outputs X, y = make_regression(n_samples=1000, n_features=100, noise=0.5) # split X, X_val, y, y_val = train_test_split(X, y, test_size=0.50) print('Train', X.shape, y.shape, 'Test', X_val.shape, y_val.shape) # get models models = get_models() # get out of fold predictions meta_X, meta_y = get_out_of_fold_predictions(X, y, models) print('Meta ', meta_X.shape, meta_y.shape) # fit base models fit_base_models(X, y, models) # fit the meta model meta_model = fit_meta_model(meta_X, meta_y) # evaluate base models evaluate_models(X_val, y_val, models) # evaluate meta model yhat = super_learner_predictions(X_val, models, meta_model) print('Super Learner: RMSE %.3f' % (sqrt(mean_squared_error(y_val, yhat))))

Running the example first reports the shape of the prepared dataset, then the shape of the dataset for the meta-model.

Next, the performance of each base-model is reported on the holdout dataset, and finally, the performance of the super learner on the holdout dataset.

Your specific results will differ given the stochastic nature of the dataset and learning algorithms. Try running the example a few times.

In this case, we can see that the linear models perform well on the dataset and the nonlinear algorithms not so well.

We can also see that the super learner out-performed all of the base-models.

Train (500, 100) (500,) Test (500, 100) (500,) Meta (500, 9) (500,) LinearRegression: RMSE 0.548 ElasticNet: RMSE 67.142 SVR: RMSE 172.717 DecisionTreeRegressor: RMSE 159.137 KNeighborsRegressor: RMSE 154.064 AdaBoostRegressor: RMSE 98.422 BaggingRegressor: RMSE 108.915 RandomForestRegressor: RMSE 115.637 ExtraTreesRegressor: RMSE 105.749 Super Learner: RMSE 0.546

You can imagine plugging in all kinds of different models into this example, including XGBoost and Keras deep learning models.

Now that we have seen how to develop a super learner for regression, let’s look at an example for classification.

The super learner algorithm for classification is much the same.

The inputs to the meta learner can be class labels or class probabilities, with the latter more likely to be useful given the increased granularity or uncertainty captured in the predictions.

In this problem, we will use the make_blobs() test classification problem and use 1,000 examples with 100 input variables and two class labels.

... # create the inputs and outputs X, y = make_blobs(n_samples=1000, centers=2, n_features=100, cluster_std=20) # split X, X_val, y, y_val = train_test_split(X, y, test_size=0.50) print('Train', X.shape, y.shape, 'Test', X_val.shape, y_val.shape)

Next, we can change the *get_models()* function to define a suite of linear and nonlinear classification algorithms.

# create a list of base-models def get_models(): models = list() models.append(LogisticRegression(solver='liblinear')) models.append(DecisionTreeClassifier()) models.append(SVC(gamma='scale', probability=True)) models.append(GaussianNB()) models.append(KNeighborsClassifier()) models.append(AdaBoostClassifier()) models.append(BaggingClassifier(n_estimators=10)) models.append(RandomForestClassifier(n_estimators=10)) models.append(ExtraTreesClassifier(n_estimators=10)) return models

Next, we can change the *get_out_of_fold_predictions()* function to predict probabilities by a call to the *predict_proba()* function.

# collect out of fold predictions form k-fold cross validation def get_out_of_fold_predictions(X, y, models): meta_X, meta_y = list(), list() # define split of data kfold = KFold(n_splits=10, shuffle=True) # enumerate splits for train_ix, test_ix in kfold.split(X): fold_yhats = list() # get data train_X, test_X = X[train_ix], X[test_ix] train_y, test_y = y[train_ix], y[test_ix] meta_y.extend(test_y) # fit and make predictions with each sub-model for model in models: model.fit(train_X, train_y) yhat = model.predict_proba(test_X) # store columns fold_yhats.append(yhat) # store fold yhats as columns meta_X.append(hstack(fold_yhats)) return vstack(meta_X), asarray(meta_y)

A Logistic Regression algorithm instead of a Linear Regression algorithm will be used as the meta-algorithm in the *fit_meta_model()* function.

# fit a meta model def fit_meta_model(X, y): model = LogisticRegression(solver='liblinear') model.fit(X, y) return model

And classification accuracy will be used to report model performance.

The complete example of the super learner algorithm for classification using scikit-learn models is listed below.

# example of a super learner model for binary classification from numpy import hstack from numpy import vstack from numpy import asarray from sklearn.datasets.samples_generator import make_blobs from sklearn.model_selection import KFold from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.neighbors import KNeighborsClassifier from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.svm import SVC from sklearn.naive_bayes import GaussianNB from sklearn.ensemble import AdaBoostClassifier from sklearn.ensemble import BaggingClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import ExtraTreesClassifier # create a list of base-models def get_models(): models = list() models.append(LogisticRegression(solver='liblinear')) models.append(DecisionTreeClassifier()) models.append(SVC(gamma='scale', probability=True)) models.append(GaussianNB()) models.append(KNeighborsClassifier()) models.append(AdaBoostClassifier()) models.append(BaggingClassifier(n_estimators=10)) models.append(RandomForestClassifier(n_estimators=10)) models.append(ExtraTreesClassifier(n_estimators=10)) return models # collect out of fold predictions form k-fold cross validation def get_out_of_fold_predictions(X, y, models): meta_X, meta_y = list(), list() # define split of data kfold = KFold(n_splits=10, shuffle=True) # enumerate splits for train_ix, test_ix in kfold.split(X): fold_yhats = list() # get data train_X, test_X = X[train_ix], X[test_ix] train_y, test_y = y[train_ix], y[test_ix] meta_y.extend(test_y) # fit and make predictions with each sub-model for model in models: model.fit(train_X, train_y) yhat = model.predict_proba(test_X) # store columns fold_yhats.append(yhat) # store fold yhats as columns meta_X.append(hstack(fold_yhats)) return vstack(meta_X), asarray(meta_y) # fit all base models on the training dataset def fit_base_models(X, y, models): for model in models: model.fit(X, y) # fit a meta model def fit_meta_model(X, y): model = LogisticRegression(solver='liblinear') model.fit(X, y) return model # evaluate a list of models on a dataset def evaluate_models(X, y, models): for model in models: yhat = model.predict(X) acc = accuracy_score(y, yhat) print('%s: %.3f' % (model.__class__.__name__, acc*100)) # make predictions with stacked model def super_learner_predictions(X, models, meta_model): meta_X = list() for model in models: yhat = model.predict_proba(X) meta_X.append(yhat) meta_X = hstack(meta_X) # predict return meta_model.predict(meta_X) # create the inputs and outputs X, y = make_blobs(n_samples=1000, centers=2, n_features=100, cluster_std=20) # split X, X_val, y, y_val = train_test_split(X, y, test_size=0.50) print('Train', X.shape, y.shape, 'Test', X_val.shape, y_val.shape) # get models models = get_models() # get out of fold predictions meta_X, meta_y = get_out_of_fold_predictions(X, y, models) print('Meta ', meta_X.shape, meta_y.shape) # fit base models fit_base_models(X, y, models) # fit the meta model meta_model = fit_meta_model(meta_X, meta_y) # evaluate base models evaluate_models(X_val, y_val, models) # evaluate meta model yhat = super_learner_predictions(X_val, models, meta_model) print('Super Learner: %.3f' % (accuracy_score(y_val, yhat) * 100))

As before, the shape of the dataset and the prepared meta dataset is reported, followed by the performance of the base-models on the holdout dataset and finally the super model itself on the holdout dataset.

Your specific results will differ given the stochastic nature of the dataset and learning algorithms. Try running the example a few times.

In this case, we can see that the super learner has slightly better performance than the base learner algorithms.

Train (500, 100) (500,) Test (500, 100) (500,) Meta (500, 18) (500,) LogisticRegression: 96.600 DecisionTreeClassifier: 74.400 SVC: 97.400 GaussianNB: 97.800 KNeighborsClassifier: 95.400 AdaBoostClassifier: 93.200 BaggingClassifier: 84.400 RandomForestClassifier: 82.800 ExtraTreesClassifier: 82.600 Super Learner: 98.000

Implementing the super learner manually is a good exercise but is not ideal.

We may introduce bugs in the implementation and the example as listed does not make use of multiple cores to speed up the execution.

Thankfully, Sebastian Flennerhag provides an efficient and tested implementation of the Super Learner algorithm and other ensemble algorithms in his ML-Ensemble (mlens) Python library. It is specifically designed to work with scikit-learn models.

First, the library must be installed, which can be achieved via pip, as follows:

sudo pip install mlens

Next, a SuperLearner class can be defined, models added via a call to the *add()* function, the meta learner added via a call to the *add_meta()* function, then the model used like any other scikit-learn model.

... # configure model ensemble = SuperLearner(...) # add list of base learners ensemble.add(...) # add meta learner ensemble.add_meta(...) # use model ...

We can use this class on the regression and classification problems from the previous section.

First, we can define a function to calculate RMSE for our problem that the super learner can use to evaluate base-models.

# cost function for base models def rmse(yreal, yhat): return sqrt(mean_squared_error(yreal, yhat))

Next, we can configure the SuperLearner with 10-fold cross-validation, our evaluation function, and the use of the entire training dataset when preparing out-of-fold predictions to use as input for the meta-model.

The *get_super_learner()* function below implements this.

# create the super learner def get_super_learner(X): ensemble = SuperLearner(scorer=rmse, folds=10, shuffle=True, sample_size=len(X)) # add base models models = get_models() ensemble.add(models) # add the meta model ensemble.add_meta(LinearRegression()) return ensemble

We can then fit the model on the training dataset.

... # fit the super learner ensemble.fit(X, y)

Once fit, we can get a nice report of the performance of each of the base-models on the training dataset using k-fold cross-validation by accessing the “*data*” attribute on the model.

... # summarize base learners print(ensemble.data)

And that’s all there is to it.

Tying this together, the complete example of evaluating a super learner using the mlens library for regression is listed below.

# example of a super learner for regression using the mlens library from math import sqrt from sklearn.datasets.samples_generator import make_regression from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from sklearn.linear_model import LinearRegression from sklearn.linear_model import ElasticNet from sklearn.neighbors import KNeighborsRegressor from sklearn.tree import DecisionTreeRegressor from sklearn.svm import SVR from sklearn.ensemble import AdaBoostRegressor from sklearn.ensemble import BaggingRegressor from sklearn.ensemble import RandomForestRegressor from sklearn.ensemble import ExtraTreesRegressor from mlens.ensemble import SuperLearner # create a list of base-models def get_models(): models = list() models.append(LinearRegression()) models.append(ElasticNet()) models.append(SVR(gamma='scale')) models.append(DecisionTreeRegressor()) models.append(KNeighborsRegressor()) models.append(AdaBoostRegressor()) models.append(BaggingRegressor(n_estimators=10)) models.append(RandomForestRegressor(n_estimators=10)) models.append(ExtraTreesRegressor(n_estimators=10)) return models # cost function for base models def rmse(yreal, yhat): return sqrt(mean_squared_error(yreal, yhat)) # create the super learner def get_super_learner(X): ensemble = SuperLearner(scorer=rmse, folds=10, shuffle=True, sample_size=len(X)) # add base models models = get_models() ensemble.add(models) # add the meta model ensemble.add_meta(LinearRegression()) return ensemble # create the inputs and outputs X, y = make_regression(n_samples=1000, n_features=100, noise=0.5) # split X, X_val, y, y_val = train_test_split(X, y, test_size=0.50) print('Train', X.shape, y.shape, 'Test', X_val.shape, y_val.shape) # create the super learner ensemble = get_super_learner(X) # fit the super learner ensemble.fit(X, y) # summarize base learners print(ensemble.data) # evaluate meta model yhat = ensemble.predict(X_val) print('Super Learner: RMSE %.3f' % (rmse(y_val, yhat)))

Running the example first reports the RMSE for (score-m) for each base-model, then reports the RMSE for the super learner itself.

Fitting and evaluating is very fast given the use of multi-threading in the backend allowing all cores of your machine to be used.

Your specific results will differ given the stochastic nature of the dataset and learning algorithms. Try running the example a few times.

In this case, we can see that the super learner performs well.

Note that we cannot compare the base learner scores in the table to the super learner as the base learners were evaluated on the training dataset only, not the holdout dataset.

[MLENS] backend: threading Train (500, 100) (500,) Test (500, 100) (500,) score-m score-s ft-m ft-s pt-m pt-s layer-1 adaboostregressor 86.67 9.35 0.56 0.02 0.03 0.01 layer-1 baggingregressor 94.46 11.70 0.22 0.01 0.01 0.00 layer-1 decisiontreeregressor 137.99 12.29 0.03 0.00 0.00 0.00 layer-1 elasticnet 62.79 5.51 0.01 0.00 0.00 0.00 layer-1 extratreesregressor 84.18 7.87 0.15 0.03 0.00 0.01 layer-1 kneighborsregressor 152.42 9.85 0.00 0.00 0.00 0.00 layer-1 linearregression 0.59 0.07 0.02 0.01 0.00 0.00 layer-1 randomforestregressor 93.19 10.10 0.20 0.02 0.00 0.00 layer-1 svr 162.56 12.48 0.03 0.00 0.00 0.00 Super Learner: RMSE 0.571

The ML-Ensemble is also very easy to use for classification problems, following the same general pattern.

In this case, we will use our list of classifier models and a logistic regression model as the meta-model.

The complete example of fitting and evaluating a super learner model for a test classification problem with the mlens library is listed below.

# example of a super learner using the mlens library from sklearn.datasets.samples_generator import make_blobs from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.neighbors import KNeighborsClassifier from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.svm import SVC from sklearn.naive_bayes import GaussianNB from sklearn.ensemble import AdaBoostClassifier from sklearn.ensemble import BaggingClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import ExtraTreesClassifier from mlens.ensemble import SuperLearner # create a list of base-models def get_models(): models = list() models.append(LogisticRegression(solver='liblinear')) models.append(DecisionTreeClassifier()) models.append(SVC(gamma='scale', probability=True)) models.append(GaussianNB()) models.append(KNeighborsClassifier()) models.append(AdaBoostClassifier()) models.append(BaggingClassifier(n_estimators=10)) models.append(RandomForestClassifier(n_estimators=10)) models.append(ExtraTreesClassifier(n_estimators=10)) return models # create the super learner def get_super_learner(X): ensemble = SuperLearner(scorer=accuracy_score, folds=10, shuffle=True, sample_size=len(X)) # add base models models = get_models() ensemble.add(models) # add the meta model ensemble.add_meta(LogisticRegression(solver='lbfgs')) return ensemble # create the inputs and outputs X, y = make_blobs(n_samples=1000, centers=2, n_features=100, cluster_std=20) # split X, X_val, y, y_val = train_test_split(X, y, test_size=0.50) print('Train', X.shape, y.shape, 'Test', X_val.shape, y_val.shape) # create the super learner ensemble = get_super_learner(X) # fit the super learner ensemble.fit(X, y) # summarize base learners print(ensemble.data) # make predictions on hold out set yhat = ensemble.predict(X_val) print('Super Learner: %.3f' % (accuracy_score(y_val, yhat) * 100))

Running the example summarizes the shape of the dataset, the performance of the base-models, and finally the performance of the super learner on the holdout dataset.

Again, we can see that the super learner performs well on this test problem, and more importantly, is fit and evaluated very quickly as compared to the manual example in the previous section.

[MLENS] backend: threading Train (500, 100) (500,) Test (500, 100) (500,) score-m score-s ft-m ft-s pt-m pt-s layer-1 adaboostclassifier 0.90 0.04 0.51 0.05 0.04 0.01 layer-1 baggingclassifier 0.83 0.06 0.21 0.01 0.01 0.00 layer-1 decisiontreeclassifier 0.68 0.07 0.03 0.00 0.00 0.00 layer-1 extratreesclassifier 0.80 0.05 0.09 0.01 0.00 0.00 layer-1 gaussiannb 0.96 0.04 0.01 0.00 0.00 0.00 layer-1 kneighborsclassifier 0.90 0.03 0.00 0.00 0.03 0.01 layer-1 logisticregression 0.93 0.03 0.01 0.00 0.00 0.00 layer-1 randomforestclassifier 0.81 0.06 0.09 0.03 0.00 0.00 layer-1 svc 0.96 0.03 0.10 0.01 0.00 0.00 Super Learner: 97.400

This section provides more resources on the topic if you are looking to go deeper.

- How to Develop a Stacking Ensemble for Deep Learning Neural Networks in Python With Keras
- How to Implement Stacked Generalization (Stacking) From Scratch With Python
- How to Create a Bagging Ensemble of Deep Learning Models in Keras
- How to Use Out-of-Fold Predictions in Machine Learning

- Targeted Learning: Causal Inference for Observational and Experimental Data, 2011.
- Targeted Learning in Data Science: Causal Inference for Complex Longitudinal Studies, 2018.

- Super Learner, 2007.
- Super Learner In Prediction, 2010.
- Super Learning, 2011.
- Super Learning, Slides.

- SuperLearner: Super Learner Prediction, CRAN.
- SuperLearner: Prediction model ensembling method, GitHub.
- Guide to SuperLearner, Vignette, 2017.

In this tutorial, you discovered the super learner ensemble machine learning algorithm.

Specifically, you learned:

- Super learner is the application of stacked generalization using out-of-fold predictions during k-fold cross-validation.
- The super learner ensemble algorithm is straightforward to implement in Python using scikit-learn models.
- The ML-Ensemble (mlens) library provides a convenient implementation that allows the super learner to be fit and used in just a few lines of code.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Develop Super Learner Ensembles in Python appeared first on Machine Learning Mastery.

]]>The post Develop an Intuition for Bayes Theorem With Worked Examples appeared first on Machine Learning Mastery.

]]>Bayes Theorem provides a principled way for calculating a conditional probability.

It is a deceptively simple calculation, providing a method that is easy to use for scenarios where our intuition often fails.

The best way to develop an intuition for Bayes Theorem is to think about the meaning of the terms in the equation and to apply the calculation many times in a range of different real-world scenarios. This will provide the context for what is being calculated and examples that can be used as a starting point when applying the calculation in new scenarios in the future.

In this tutorial, you will discover an intuition for calculating Bayes Theorem by working through multiple realistic scenarios.

After completing this tutorial, you will know:

- Bayes Theorem is a technique for calculating a conditional probability.
- The common and helpful names used for the terms in the Bayes Theorem equation.
- How to work through three realistic scenarios using Bayes Theorem to find a solution.

Let’s get started.

This tutorial is divided into five parts; they are:

- Introduction to Bayes Theorem
- Naming the Terms in the Theorem
- Example 1: Elderly Fall and Death
- Example 2: Email and Spam Detection
- Example 3: Liars and Lie Detectors

Conditional probability is the probability of one event given the occurrence of another event, often described in terms of events *A* and *B* from two dependent random variables e.g. *X* and *Y*.

**Conditional Probability**: Probability of one (or more) event given the occurrence of another event, e.g. P(A given B) or P(A | B).

The conditional probability can be calculated using the joint probability; for example:

- P(A | B) = P(A and B) / P(B)

The conditional probability is not symmetrical; for example:

- P(A | B) != P(B | A)

Nevertheless, one conditional probability can be calculated using the other conditional probability.

This is called Bayes Theorem, named for Reverend Thomas Bayes, and can be stated as follows:

- P(A|B) = P(B|A) * P(A) / P(B)

Bayes Theorem provides a principled way for calculating a conditional probability and an alternative to using the joint probability.

This alternate approach to calculating the conditional probability is useful either when the joint probability is challenging to calculate, or when the reverse conditional probability is available or easy to calculate.

**Bayes Theorem**: Principled way of calculating a conditional probability without the joint probability.

It is often the case that we do not have access to the denominator directly, e.g. P(B).

We can calculate it an alternative way; for example:

- P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)

This gives a formulation of Bayes Theorem that we can use that uses the alternate calculation of P(B), described below:

- P(A|B) = P(B|A) * P(A) / P(B|A) * P(A) + P(B|not A) * P(not A)

**Note**: the denominator is simply the expansion we gave above.

As such, if we have P(A), then we can calculate P(not A) as its complement; for example:

- P(not A) = 1 – P(A)

Additionally, if we have P(not B|not A), then we can calculate P(B|not A) as its complement; for example:

- P(B|not A) = 1 – P(not B|not A)

Now that we are familiar with the calculation of Bayes Theorem, let’s take a closer look at the meaning of the terms in the equation.

The terms in the Bayes Theorem equation are given names depending on the context where the equation is used.

It can be helpful to think about the calculation from these different perspectives and help to map your problem onto the equation.

Firstly, in general, the result P(A|B) is referred to as the **posterior probability** and P(A) is referred to as the **prior probability**.

- P(A|B): Posterior probability.
- P(A): Prior probability.

Sometimes P(B|A) is referred to as the **likelihood** and P(B) is referred to as the **evidence**.

- P(B|A): Likelihood.
- P(B): Evidence.

This allows Bayes Theorem to be restated as:

- Posterior = Likelihood * Prior / Evidence

We can make this clear with a smoke and fire case.

What is the probability that there is fire given that there is smoke?

Where P(Fire) is the Prior, P(Smoke|Fire) is the Likelihood, and P(Smoke) is the evidence:

- P(Fire|Smoke) = P(Smoke|Fire) * P(Fire) / P(Smoke)

You can imagine the same situation with rain and clouds.

We can also think about the calculation in the terms of a binary classifier.

For example, P(B|A) may be referred to as the True Positive Rate (TPR) or the **sensitivity**, P(B|not A) may be referred to as the False Positive Rate (FPR), the complement P(not B|not A) may be referred to as the True Negative Rate (TNR) or **specificity**, and the value we are calculating P(A|B) may be referred to as the Positive Predictive Value (PPV) or **precision**.

- P(not B|not A): True Negative Rate or TNR (
**specificity**). - P(B|not A): False Positive Rate or FPR.
- P(not B|A): False Negative Rate or FNR.
- P(B|A): True Positive Rate or TPR (
**sensitivity**or recall). - P(A|B): Positive Predictive Value or PPV (
**precision**).

For example, we may re-state the calculation using these terms as follows:

- PPV = (TPV * P(A)) / (TPR * P(A) + FPR * P(not A))

This is a useful perspective on Bayes Theorem and is elaborated further in the tutorial:

Now that we are familiar with Bayes Theorem and the meaning of the terms, let’s look at some scenarios where we can calculate it.

Note that all of the following examples are contrived; they are not based on real-world probabilities.

Consider the case where an elderly person (over 80 years of age) falls; what is the probability that they will die from the fall?

Let’s assume that the base rate of someone elderly dying P(A) is 10%, and the base rate for elderly people falling P(B) is 5%, and from all elderly people, 7% of those that die had a fall P(B|A).

Let’s plug what we know into the theorem:

- P(A|B) = P(B|A) * P(A) / P(B)
- P(Die|Fall) = P(Fall|Die) * P(Die) / P(Fall)

or

- P(Die|Fall) = 0.07 * 0.10 / 0.05
- P(Die|Fall) = 0.14

That is, if an elderly person falls, then there is a 14 percent probability that they will die from the fall.

To make this concrete, we can perform the calculation in Python, first defining what we know, then using Bayes Theorem to calculate the outcome.

The complete example is listed below.

# calculate P(A|B) given P(B|A), P(A) and P(B) def bayes_theorem(p_a, p_b, p_b_given_a): # calculate P(A|B) = P(B|A) * P(A) / P(B) p_a_given_b = (p_b_given_a * p_a) / p_b return p_a_given_b # P(A) p_a = 0.10 # P(B) p_b = 0.05 # P(B|A) p_b_given_a = 0.07 # calculate P(A|B) result = bayes_theorem(p_a, p_b, p_b_given_a) # summarize print('P(A|B) = %.3f%%' % (result * 100))

Running the example confirms the value we calculated manually.

P(A|B) = 14%

Consider the case where we receive an email and the spam detector puts it in the spam folder; what is the probability it was spam?

Let’s assume some details such as 2 percent of the email we receive is spam P(A). Let’s assume that the spam detector is really good and when an email is spam, it detects it P(B|A) with an accuracy of 99 percent, and when an email is not spam, it will mark it as spam with a very low rate of 0.1 percent P(B|not A).

Let’s plug what we know into the theorem:

- P(A|B) = P(B|A) * P(A) / P(B)
- P(Spam|Detected) = P(Detected|Spam) * P(Spam) / P(Detected)

or

- P(Spam|Detected) = 0.99 * 0.02 / P(Detected)

We don’t know P(B), that is P(Detected), but we can calculate it using:

- P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)

Or in terms of our problem:

- P(Detected) = P(Detected|Spam) * P(Spam) + P(Detected|not Spam) * P(not Spam)

We know P(Detected|not Spam), which is 0.1 percent and we can calculate P(not Spam) as 1 – P(Spam); for example:

- P(not Spam) = 1 – P(Spam)
- P(not Spam) = 1 – 0.02
- P(not Spam) = 0.98

Therefore, we can calculate P(Detected) as:

- P(Detected) = 0.99 * 0.02 + 0.001 * 0.98
- P(Detected) = 0.0198 + 0.00098
- P(Detected) = 0.02078

That is, about 2 percent of all emails are detected as spam, regardless of whether they are spam or not.

Now we can calculate the answer as:

- P(Spam|Detected) = 0.99 * 0.02 / 0.02078
- P(Spam|Detected) = 0.0198 / 0.02078
- P(Spam|Detected) = 0.95283926852743

That is, if an email is in the spam folder, there is a 95.2 percent probability that it is, in fact, spam.

Again, let’s confirm this result by calculating it with an example in Python.

The complete example is listed below.

# calculate the probability of an email in the spam folder being spam # calculate P(A|B) given P(A), P(B|A), P(B|not A) def bayes_theorem(p_a, p_b_given_a, p_b_given_not_a): # calculate P(not A) not_a = 1 - p_a # calculate P(B) p_b = p_b_given_a * p_a + p_b_given_not_a * not_a # calculate P(A|B) p_a_given_b = (p_b_given_a * p_a) / p_b return p_a_given_b # P(A) p_a = 0.02 # P(B|A) p_b_given_a = 0.99 # P(B|not A) p_b_given_not_a = 0.001 # calculate P(A|B) result = bayes_theorem(p_a, p_b_given_a, p_b_given_not_a) # summarize print('P(A|B) = %.3f%%' % (result * 100))

Running the example gives the same result, confirming our manual calculation.

P(A|B) = 95.284%

Consider the case where a person is tested with a lie detector and the test suggests they are lying. What is the probability that the person is indeed lying?

Let’s assume some details, such as most people that are tested are telling the truth, such as 98 percent, meaning (1 – 0.98) or 2 percent are liars P(A). Let’s also assume that when someone is lying, that the test can detect them well, but not great, such as 72 percent of the time P(B|A). Let’s also assume that when the machine says they are not lying, this is true 97 percent of the time P(not B | not A).

Let’s plug what we know into the theorem:

- P(A|B) = P(B|A) * P(A) / P(B)
- P(Lying|Positive) = P(Positive|Lying) * P(Lying) / P(Positive)

Or:

- P(Lying|Positive) = 0.72 * 0.02 / P(Positive)

Again, we don’t know P(B), or in this case how often the detector returns a positive result in general.

We can calculate this using the formula:

- P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)

Or:

- P(Positive) = P(Positive|Lying) * P(Lying) + P(Positive|not Lying) * P(not Lying)

Or, with numbers:

- P(Positive) = 0.72 * 0.02 + P(Positive|not Lying) * (1 – 0.02)
- P(Positive) = 0.72 * 0.02 + P(Positive|not Lying) * 0.98

In this case, we don’t know the probability of a positive detection result given that the person was not lying; that is we don’t know the false positive rate or the false alarm rate.

This can be calculated as follows:

- P(B|not A) = 1 – P(not B|not A)

Or:

- P(Positive|not Lying) = 1 – P(not Positive|not Lying)
- P(Positive|not Lying) = 1 – 0.97
- P(Positive|not Lying) = 0.03

Therefore, we can calculate P(B) or P(Positive) as:

- P(Positive) = 0.72 * 0.02 + 0.03 * 0.98
- P(Positive) = 0.0144 + 0.0294
- P(Positive) = 0.0438

That is, the test returns a positive result about 4 percent of the time, regardless of whether the person is lying or not.

We can now calculate Bayes Theorem for this scenario:

- P(Lying|Positive) = 0.72 * 0.02 / 0.0438
- P(Lying|Positive) = 0.0144 / 0.0438
- P(Lying|Positive) = 0.328767123287671

That is, if the lie detector test comes back with a positive result, then there is a 32.8 percent probability that they are, in fact, lying. It’s a poor test!

Finally, let’s confirm this calculation in Python.

The complete example is listed below.

# calculate the probability of a person lying given a positive lie detector result # calculate P(A|B) given P(A), P(B|A), P(not B|not A) def bayes_theorem(p_a, p_b_given_a, p_not_b_given_not_a): # calculate P(not A) not_a = 1 - p_a # calculate P(B|not A) p_b_given_not_a = 1 - p_not_b_given_not_a # calculate P(B) p_b = p_b_given_a * p_a + p_b_given_not_a * not_a # calculate P(A|B) p_a_given_b = (p_b_given_a * p_a) / p_b return p_a_given_b # P(A), base rate p_a = 0.02 # P(B|A) p_b_given_a = 0.72 # P(not B| not A) p_not_b_given_not_a = 0.97 # calculate P(A|B) result = bayes_theorem(p_a, p_b_given_a, p_not_b_given_not_a) # summarize print('P(A|B) = %.3f%%' % (result * 100))

Running the example gives the same result, confirming our manual calculation.

P(A|B) = 32.877%

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered an intuition for calculating Bayes Theorem by working through multiple realistic scenarios.

Specifically, you learned:

- Bayes Theorem is a technique for calculating a conditional probability.
- The common and helpful names used for the terms in the Bayes Theorem equation.
- How to work through three realistic scenarios using Bayes Theorem to find a solution.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Develop an Intuition for Bayes Theorem With Worked Examples appeared first on Machine Learning Mastery.

]]>The post How to Use Out-of-Fold Predictions in Machine Learning appeared first on Machine Learning Mastery.

]]>Machine learning algorithms are typically evaluated using resampling techniques such as k-fold cross-validation.

During the k-fold cross-validation process, predictions are made on test sets comprised of data not used to train the model. These predictions are referred to as **out-of-fold predictions**, a type of out-of-sample predictions.

Out-of-fold predictions play an important role in machine learning in both estimating the performance of a model when making predictions on new data in the future, so-called the generalization performance of the model, and in the development of ensemble models.

In this tutorial, you will discover a gentle introduction to out-of-fold predictions in machine learning.

After completing this tutorial, you will know:

- Out-of-fold predictions are a type of out-of-sample predictions made on data not used to train a model.
- Out-of-fold predictions are most commonly used to estimate the performance of a model when making predictions on unseen data.
- Out-of-fold predictions can be used to construct an ensemble model called a stacked generalization or stacking ensemble.

Let’s get started.

This tutorial is divided into three parts; they are:

- What Are Out-of-Fold Predictions?
- Out-of-Fold Predictions for Evaluation
- Out-of-Fold Predictions for Ensembles

It is common to evaluate the performance of a machine learning algorithm on a dataset using a resampling technique such as k-fold cross-validation.

The k-fold cross-validation procedure involves splitting a training dataset into *k* groups, then using each of the *k* groups of examples on a test set while the remaining examples are used as a training set.

This means that *k* different models are trained and evaluated. The performance of the model is estimated using the predictions by the models made across all k-folds.

This procedure can be summarized as follows:

- 1. Shuffle the dataset randomly.
- 2. Split the dataset into k groups.
- 3. For each unique group:
- a. Take the group as a holdout or test data set.
- b. Take the remaining groups as a training data set.
- c. Fit a model on the training set and evaluate it on the test set.
- d. Retain the evaluation score and discard the model.

- 4. Summarize the skill of the model using the sample of model evaluation scores.

Importantly, each observation in the data sample is assigned to an individual group and stays in that group for the duration of the procedure. This means that each sample is given the opportunity to be used in the holdout set 1 time and used to train the model k-1 times.

For more on the topic of k-fold cross-validation, see the tutorial:

An out-of-fold prediction is a prediction by the model during the k-fold cross-validation procedure.

That is, out-of-fold predictions are those predictions made on the holdout datasets during the resampling procedure. If performed correctly, there will be one prediction for each example in the training dataset.

Sometimes, out-of-fold is summarized with the acronym OOF.

**Out-of-Fold Predictions**: Predictions made by models during the k-fold cross-validation procedure on the holdout examples.

The notion of out-of-fold predictions is directly related to the idea of **out-of-sample predictions**, as the predictions in both cases are made on examples that were not used during the training of the model and can be used to estimate the performance of the model when used to make predictions on new data.

As such, out-of-fold predictions are a type of out-of-sample prediction, although described in the context of a model evaluated using k-fold cross-validation.

**Out-of-Sample Predictions**: Predictions made by a model on data not used during the training of the model.

Out-of-sample predictions may also be referred to as holdout predictions.

There are two main uses for out-of-fold predictions; they are:

- Estimate the performance of the model on unseen data.
- Fit an ensemble model.

Let’s take a closer look at these two cases.

The most common use for out-of-fold predictions is to estimate the performance of the model.

That is, predictions on data that were not used to train the model can be made and evaluated using a scoring metric such as error or accuracy. This metric provides an estimate of the performance of the model when used to make predictions on new data, such as when the model will be used in practice to make predictions.

Generally, predictions made on data not used to train a model provide insight into how the model will generalize to new situations. As such, scores that evaluate these predictions are referred to as the generalized performance of a machine learning model.

There are two main approaches that these predictions can use to estimate the performance of the model.

The first is to score the model on the predictions made during each fold, then calculate the average of those scores. For example, if we are evaluating a classification model, then classification accuracy can be calculated on each group of out-of-fold predictions, then the mean accuracy can be reported.

**Approach 1**: Estimate performance as the mean score estimated on each group of out-of-fold predictions.

The second approach is to consider that each example appears just once in each test set. That is, each example in the training dataset has a single prediction made during the k-fold cross-validation process. As such, we can collect all predictions and compare them to their expected outcome and calculate a score directly across the entire training dataset.

**Approach 2:**Estimate performance using the aggregate of all out-of-fold predictions.

Both are reasonable approaches and the scores that result from each procedure should be approximately equivalent.

Calculating the mean from each group of out-of-sample predictions may be the most common approach, as the variance of the estimate can also be calculated as the standard deviation or standard error.

The

kresampled estimates of performance are summarized (usually with the mean and standard error) …

— Page 70, Applied Predictive Modeling, 2013.

We can demonstrate the difference between these two approaches to evaluating models using out-of-fold predictions with a small worked example.

We will use the make_blobs() scikit-learn function to create a test binary classification problem with 1,000 examples, two classes, and 100 input features.

The example below prepares a data sample and summarizes the shape of the input and output elements of the dataset.

# example of creating a test dataset from sklearn.datasets.samples_generator import make_blobs # create the inputs and outputs X, y = make_blobs(n_samples=1000, centers=2, n_features=100, cluster_std=20) # summarize the shape of the arrays print(X.shape, y.shape)

Running the example prints the shape of the input data showing 1,000 rows of data with 100 columns or input features and the corresponding classification labels.

(1000, 100) (1000,)

Next, we can use *k*-fold cross-validation to evaluate a KNeighborsClassifier model.

We will use *k*=10 for the KFold object, the sensible default, fit a model on each training dataset, and evaluate it on each holdout fold.

Accuracy scores will be stored in a list across each model evaluation and will report the mean and standard deviation of these scores.

The complete example is listed below.

# evaluate model by averaging performance across each fold from numpy import mean from numpy import std from sklearn.datasets.samples_generator import make_blobs from sklearn.model_selection import KFold from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score # create the inputs and outputs X, y = make_blobs(n_samples=1000, centers=2, n_features=100, cluster_std=20) # k-fold cross validation scores = list() kfold = KFold(n_splits=10, shuffle=True) # enumerate splits for train_ix, test_ix in kfold.split(X): # get data train_X, test_X = X[train_ix], X[test_ix] train_y, test_y = y[train_ix], y[test_ix] # fit model model = KNeighborsClassifier() model.fit(train_X, train_y) # evaluate model yhat = model.predict(test_X) acc = accuracy_score(test_y, yhat) # store score scores.append(acc) print('> ', acc) # summarize model performance mean_s, std_s = mean(scores), std(scores) print('Mean: %.3f, Standard Deviation: %.3f' % (mean_s, std_s))

Running the example reports the model classification accuracy on the holdout fold for each iteration.

At the end of the run, the mean and standard deviation of the accuracy scores are reported.

Your specific results will vary given the stochastic nature of the data sample and learning algorithm. Try running the example a few times.

> 0.95 > 0.92 > 0.95 > 0.95 > 0.91 > 0.97 > 0.96 > 0.96 > 0.98 > 0.91 Mean: 0.946, Standard Deviation: 0.023

We can contrast this with the alternate approach that evaluates all predictions as a single group.

Instead of evaluating the model on each holdout fold, predictions are made and stored in a list. Then, at the end of the run, the predictions are compared to the expected values for each holdout test set and a single accuracy score is reported.

The complete example is listed below.

# evaluate model by calculating the score across all predictions from sklearn.datasets.samples_generator import make_blobs from sklearn.model_selection import KFold from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score # create the inputs and outputs X, y = make_blobs(n_samples=1000, centers=2, n_features=100, cluster_std=20) # k-fold cross validation data_y, data_yhat = list(), list() kfold = KFold(n_splits=10, shuffle=True) # enumerate splits for train_ix, test_ix in kfold.split(X): # get data train_X, test_X = X[train_ix], X[test_ix] train_y, test_y = y[train_ix], y[test_ix] # fit model model = KNeighborsClassifier() model.fit(train_X, train_y) # make predictions yhat = model.predict(test_X) # store data_y.extend(test_y) data_yhat.extend(yhat) # evaluate the model acc = accuracy_score(data_y, data_yhat) print('Accuracy: %.3f' % (acc))

Running the example collects all of the expected and predicted values for each holdout dataset and reports a single accuracy score at the end of the run.

Your specific results will vary given the stochastic nature of the data sample and learning algorithm. Try running the example a few times.

Accuracy: 0.930

Again, both approaches are comparable and it may be a matter of taste as to the method you use on your own predictive modeling problem.

Another common use for out-of-fold predictions is to use them in the development of an ensemble model.

An ensemble is a machine learning model that combines the predictions from two or more models prepared on the same training dataset.

This is a very common procedure to use when working on a machine learning competition.

The out-of-fold predictions in aggregate provide information about how the model performs on each example in the training dataset when not used to train the model. This information can be used to train a model to correct or improve upon those predictions.

First, the *k*-fold cross-validation procedure is performed on each base model of interest, and all of the out-of-fold predictions are collected. Importantly, the same split of the training data into *k*-folds is performed for each model. Now we have one aggregated group of out-of-sample predictions for each model, e.g. predictions for each example in the training dataset.

**Base-Models**: Models evaluated using*k*-fold cross-validation on the training dataset and all out-of-fold predictions are retained.

Next, a second higher-order model, called a meta-model, is trained on the predictions made by the other models. This meta-model may or may not also take the input data for each example as input when making predictions. The job of this model is to learn how to best combine and correct the predictions made by the other models using their out-of-fold predictions.

**Meta-Model**: Model that takes the out-of-fold predictions made by one or more models as input and shows how to best combine and correct the predictions.

For example, we may have a two-class classification predictive modeling problem and train a decision tree and a k-nearest neighbor model as the base models. Each model predicts a 0 or 1 for each example in the training dataset via out-of-fold predictions. These predictions, along with the input data, can then form a new input to the meta-model.

**Meta-Model Input**: Input portion of a given sample concatenated with the predictions made by each base model.**Meta-Model Output**: Output portion of a given sample.

*Why use the out-of-fold predictions to train the meta-model?*

We could train each base model on the entire training dataset, then make a prediction for each example in the training dataset and use the predictions as input to the meta-model. The problem is the predictions will be optimistic because the samples were used in the training of each base model. This optimistic bias means that the predictions will be better than normal, and the meta-model will likely not learn what is required to combine and correct the predictions from the base models.

By using out-of-fold predictions from the base model to train the meta-model, the meta-model can see and harness the expected behavior of each base model when operating on unseen data, as will be the case when the ensemble is used in practice to make predictions on new data.

Finally, each of the base models are trained on the entire training dataset and these final models and the meta-model can be used to make predictions on new data. The performance of this ensemble can be evaluated on a separate holdout test dataset not used during training.

This procedure can be summarized as follows:

- 1. For each base model:
- a. Use k-fold cross-validation and collect out-of-fold predictions.
- b.Train meta-model on the out-of-fold predictions from all models.
- c. Train each base model on the entire training dataset.

This procedure is called stacked generalization, or stacking for short. Because it is common to use a linear weighted sum as the meta-model, this procedure is sometimes called ** blending**.

For more on the topic of stacking, see the tutorials:

- How to Develop a Stacking Ensemble for Deep Learning Neural Networks in Python With Keras
- How to Implement Stacked Generalization (Stacking) From Scratch With Python

We can make this procedure concrete with a worked example using the same dataset used in the previous section.

First, we will split the data into training and validation datasets. The training dataset will be used to fit the submodels and meta-model, and the validation dataset will be held back from training and used at the end to evaluate the meta-model and submodels.

... # split X, X_val, y, y_val = train_test_split(X, y, test_size=0.33)

In this example, we will use k-fold cross-validation to fit a DecisionTreeClassifier and KNeighborsClassifier model each cross-validation fold, and use the fit models to make out-of-fold predictions.

The models will make predictions of probabilities instead of class labels in an attempt to provide more useful input features for the meta-model. This is a good practice.

We will also keep track of the input data (100 features) and output data (expected label) for the out-of-fold data.

... # collect out of sample predictions data_x, data_y, knn_yhat, cart_yhat = list(), list(), list(), list() kfold = KFold(n_splits=10, shuffle=True) for train_ix, test_ix in kfold.split(X): # get data train_X, test_X = X[train_ix], X[test_ix] train_y, test_y = y[train_ix], y[test_ix] data_x.extend(test_X) data_y.extend(test_y) # fit and make predictions with cart model1 = DecisionTreeClassifier() model1.fit(train_X, train_y) yhat1 = model1.predict_proba(test_X)[:, 0] cart_yhat.extend(yhat1) # fit and make predictions with cart model2 = KNeighborsClassifier() model2.fit(train_X, train_y) yhat2 = model2.predict_proba(test_X)[:, 0] knn_yhat.extend(yhat2)

At the end of the run, we can then construct a dataset for a meta classifier comprised of 100 input features for the input data and the two columns of predicted probabilities from the kNN and decision tree models.

The *create_meta_dataset()* function below implements this, taking the out-of-fold data and predictions across the folds as input and constructs the input dataset for the meta-model.

# create a meta dataset def create_meta_dataset(data_x, yhat1, yhat2): # convert to columns yhat1 = array(yhat1).reshape((len(yhat1), 1)) yhat2 = array(yhat2).reshape((len(yhat2), 1)) # stack as separate columns meta_X = hstack((data_x, yhat1, yhat2)) return meta_X

We can then call this function to prepare data for the meta-model.

... # construct meta dataset meta_X = create_meta_dataset(data_x, knn_yhat, cart_yhat)

We can then fit each of the submodels on the entire training dataset ready for making predictions on the validation dataset.

... # fit final submodels model1 = DecisionTreeClassifier() model1.fit(X, y) model2 = KNeighborsClassifier() model2.fit(X, y)

We can then fit the meta-model on the prepared dataset, in this case, a LogisticRegression model.

... # construct meta classifier meta_model = LogisticRegression(solver='liblinear') meta_model.fit(meta_X, data_y)

Finally, we can use the meta-model to make predictions on the holdout dataset.

This requires that data first pass through the sub models, the outputs used in the construction of a dataset for the meta-model, then the meta-model is used to make a prediction. We will wrap all of this up into a function named *stack_prediction()* that takes the models and the data for which the prediction will be made.

# make predictions with stacked model def stack_prediction(model1, model2, meta_model, X): # make predictions yhat1 = model1.predict_proba(X)[:, 0] yhat2 = model2.predict_proba(X)[:, 0] # create input dataset meta_X = create_meta_dataset(X, yhat1, yhat2) # predict return meta_model.predict(meta_X)

We can then evaluate the submodels on the holdout dataset for reference, then use the meta-model to make a prediction on the holdout dataset and evaluate it.

We expect that the meta-model would achieve as good or better performance on the holdout dataset than any single submodel. If this is not the case, alternate submodels or meta-models could be used on the problem instead.

... # evaluate sub models on hold out dataset acc1 = accuracy_score(y_val, model1.predict(X_val)) acc2 = accuracy_score(y_val, model2.predict(X_val)) print('Model1 Accuracy: %.3f, Model2 Accuracy: %.3f' % (acc1, acc2)) # evaluate meta model on hold out dataset yhat = stack_prediction(model1, model2, meta_model, X_val) acc = accuracy_score(y_val, yhat) print('Meta Model Accuracy: %.3f' % (acc))

Tying this all together, the complete example is listed below.

# example of a stacked model for binary classification from numpy import hstack from numpy import array from sklearn.datasets.samples_generator import make_blobs from sklearn.model_selection import KFold from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score # create a meta dataset def create_meta_dataset(data_x, yhat1, yhat2): # convert to columns yhat1 = array(yhat1).reshape((len(yhat1), 1)) yhat2 = array(yhat2).reshape((len(yhat2), 1)) # stack as separate columns meta_X = hstack((data_x, yhat1, yhat2)) return meta_X # make predictions with stacked model def stack_prediction(model1, model2, meta_model, X): # make predictions yhat1 = model1.predict_proba(X)[:, 0] yhat2 = model2.predict_proba(X)[:, 0] # create input dataset meta_X = create_meta_dataset(X, yhat1, yhat2) # predict return meta_model.predict(meta_X) # create the inputs and outputs X, y = make_blobs(n_samples=1000, centers=2, n_features=100, cluster_std=20) # split X, X_val, y, y_val = train_test_split(X, y, test_size=0.33) # collect out of sample predictions data_x, data_y, knn_yhat, cart_yhat = list(), list(), list(), list() kfold = KFold(n_splits=10, shuffle=True) for train_ix, test_ix in kfold.split(X): # get data train_X, test_X = X[train_ix], X[test_ix] train_y, test_y = y[train_ix], y[test_ix] data_x.extend(test_X) data_y.extend(test_y) # fit and make predictions with cart model1 = DecisionTreeClassifier() model1.fit(train_X, train_y) yhat1 = model1.predict_proba(test_X)[:, 0] cart_yhat.extend(yhat1) # fit and make predictions with cart model2 = KNeighborsClassifier() model2.fit(train_X, train_y) yhat2 = model2.predict_proba(test_X)[:, 0] knn_yhat.extend(yhat2) # construct meta dataset meta_X = create_meta_dataset(data_x, knn_yhat, cart_yhat) # fit final submodels model1 = DecisionTreeClassifier() model1.fit(X, y) model2 = KNeighborsClassifier() model2.fit(X, y) # construct meta classifier meta_model = LogisticRegression(solver='liblinear') meta_model.fit(meta_X, data_y) # evaluate sub models on hold out dataset acc1 = accuracy_score(y_val, model1.predict(X_val)) acc2 = accuracy_score(y_val, model2.predict(X_val)) print('Model1 Accuracy: %.3f, Model2 Accuracy: %.3f' % (acc1, acc2)) # evaluate meta model on hold out dataset yhat = stack_prediction(model1, model2, meta_model, X_val) acc = accuracy_score(y_val, yhat) print('Meta Model Accuracy: %.3f' % (acc))

Running the example first reports the accuracy of the decision tree and kNN model, then the performance of the meta-model on the holdout dataset, not seen during training.

Your specific results will vary given the stochastic nature of the data sample and learning algorithm. Try running the example a few times.

In this case, we can see that the meta-model has out-performed both submodels.

Model1 Accuracy: 0.670, Model2 Accuracy: 0.930 Meta-Model Accuracy: 0.955

It might be interesting to try an ablative study to re-run the example with just model1, just model2, and neither model 1 and model 2 as input to the meta-model to confirm that the predictions from the submodels are actually adding value to the meta-model.

This section provides more resources on the topic if you are looking to go deeper.

- A Gentle Introduction to k-fold Cross-Validation
- How to Develop a Stacking Ensemble for Deep Learning Neural Networks in Python With Keras
- How to Implement Stacked Generalization (Stacking) From Scratch With Python
- How to Create a Bagging Ensemble of Deep Learning Models in Keras
- Ensemble Learning Methods for Deep Learning Neural Networks

- Applied Predictive Modeling, 2013.

- sklearn.datasets.make_blobs API.
- sklearn.model_selection.KFold API.
- sklearn.neighbors.KNeighborsClassifier API.
- sklearn.tree.DecisionTreeClassifier API.
- sklearn.metrics.accuracy_score API.
- sklearn.linear_model.LogisticRegression API.
- sklearn.model_selection.train_test_split API.

In this tutorial, you discovered out-of-fold predictions in machine learning.

Specifically, you learned:

- Out-of-fold predictions are a type of out-of-sample predictions made on data not used to train a model.
- Out-of-fold predictions are most commonly used to estimate the performance of a model when making predictions on unseen data.
- Out-of-fold predictions can be used to construct an ensemble model called a stacked generalization or stacking ensemble.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Use Out-of-Fold Predictions in Machine Learning appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to the Bayes Optimal Classifier appeared first on Machine Learning Mastery.

]]>The Bayes Optimal Classifier is a probabilistic model that makes the most probable prediction for a new example.

It is described using the Bayes Theorem that provides a principled way for calculating a conditional probability. It is also closely related to the Maximum a Posteriori: a probabilistic framework referred to as MAP that finds the most probable hypothesis for a training dataset.

In practice, the Bayes Optimal Classifier is computationally expensive, if not intractable to calculate, and instead, simplifications such as the Gibbs algorithm and Naive Bayes can be used to approximate the outcome.

In this post, you will discover Bayes Optimal Classifier for making the most accurate predictions for new instances of data.

After reading this post, you will know:

- Bayes Theorem provides a principled way for calculating conditional probabilities, called a posterior probability.
- Maximum a Posteriori is a probabilistic framework that finds the most probable hypothesis that describes the training dataset.
- Bayes Optimal Classifier is a probabilistic model that finds the most probable prediction using the training data and space of hypotheses to make a prediction for a new data instance.

Let’s get started.

This tutorial is divided into three parts; they are:

- Bayes Theorem
- Maximum a Posteriori (MAP)
- Bayes Optimal Classifier

Recall that the Bayes theorem provides a principled way of calculating a conditional probability.

It involves calculating the conditional probability of one outcome given another outcome, using the inverse of this relationship, stated as follows:

- P(A | B) = (P(B | A) * P(A)) / P(B)

The quantity that we are calculating is typically referred to as the posterior probability of *A* given *B* and *P(A)* is referred to as the prior probability of *A*.

The normalizing constant of *P(B)* can be removed, and the posterior can be shown to be proportional to the probability of B given A multiplied by the prior.

- P(A | B) is proportional to P(B | A) * P(A)

Or, simply:

- P(A | B) = P(B | A) * P(A)

This is a helpful simplification as we are not interested in estimating a probability, but instead in optimizing a quantity. A proportional quantity is good enough for this purpose.

For more on the topic of Bayes Theorem, see the post:

Now that we are up to speed on Bayes Theorem, let’s also take a look at the Maximum a Posteriori framework.

Machine learning involves finding a model (hypothesis) that best explains the training data.

There are two probabilistic frameworks that underlie many different machine learning algorithms.

They are:

- Maximum a Posteriori (MAP), a Bayesian method.
- Maximum Likelihood Estimation (MLE), a frequentist method.

The objective of both of these frameworks in the context of machine learning is to locate the hypothesis that is most probable given the training dataset.

Specifically, they answer the question:

What is the most probable hypothesis given the training data?

Both approaches frame the problem of fitting a model as optimization and involve searching for a distribution and set of parameters for the distribution that best describes the observed data.

MLE is a frequentist approach, and MAP provides a Bayesian alternative.

A popular replacement for maximizing the likelihood is maximizing the Bayesian posterior probability density of the parameters instead.

— Page 306, Information Theory, Inference and Learning Algorithms, 2003.

Given the simplification of Bayes Theorem to a proportional quantity, we can use it to estimate the proportional hypothesis and parameters (*theta*) that explain our dataset (*X*), stated as:

- P(theta | X) = P(X | theta) * P(theta)

Maximizing this quantity over a range of theta solves an optimization problem for estimating the central tendency of the posterior probability (e.g. the model of the distribution).

As such, this technique is referred to as “*maximum a posteriori estimation*,” or MAP estimation for short, and sometimes simply “*maximum posterior estimation*.”

- maximize P(X | theta) * P(theta)

For more on the topic of Maximum a Posteriori, see the post:

Now that we are familiar with the MAP framework, we can take a closer look at the related concept of the Bayes optimal classifier.

The Bayes optimal classifier is a probabilistic model that makes the most probable prediction for a new example, given the training dataset.

This model is also referred to as the Bayes optimal learner, the Bayes classifier, Bayes optimal decision boundary, or the Bayes optimal discriminant function.

**Bayes Classifier**: Probabilistic model that makes the most probable prediction for new examples.

Specifically, the Bayes optimal classifier answers the question:

What is the most probable classification of the new instance given the training data?

This is different from the MAP framework that seeks the most probable hypothesis (model). Instead, we are interested in making a specific prediction.

In general, the most probable classification of the new instance is obtained by combining the predictions of all hypotheses, weighted by their posterior probabilities.

— Page 175, Machine Learning, 1997.

The equation below demonstrates how to calculate the conditional probability for a new instance (*vi*) given the training data (*D*), given a space of hypotheses (*H*).

- P(vj | D) = sum {h in H} P(vj | hi) * P(hi | D)

Where *vj* is a new instance to be classified, *H* is the set of hypotheses for classifying the instance, *hi* is a given hypothesis, *P(vj | hi)* is the posterior probability for *vi* given hypothesis *hi*, and *P(hi | D)* is the posterior probability of the hypothesis *hi* given the data *D*.

Selecting the outcome with the maximum probability is an example of a Bayes optimal classification.

- max sum {h in H} P(vj | hi) * P(hi | D)

Any model that classifies examples using this equation is a Bayes optimal classifier and no other model can outperform this technique, on average.

Any system that classifies new instances according to [the equation] is called a Bayes optimal classifier, or Bayes optimal learner. No other classification method using the same hypothesis space and same prior knowledge can outperform this method on average.

— Page 175, Machine Learning, 1997.

We have to let that sink in.

It is a big deal.

It means that any other algorithm that operates on the same data, the same set of hypotheses, and same prior probabilities cannot outperform this approach, on average. Hence the name “*optimal classifier*.”

Although the classifier makes optimal predictions, it is not perfect given the uncertainty in the training data and incomplete coverage of the problem domain and hypothesis space. As such, the model will make errors. These errors are often referred to as Bayes errors.

The Bayes classifier produces the lowest possible test error rate, called the Bayes error rate. […] The Bayes error rate is analogous to the irreducible error …

— Page 38, An Introduction to Statistical Learning with Applications in R, 2017.

Because the Bayes classifier is optimal, the Bayes error is the minimum possible error that can be made.

**Bayes Error**: The minimum possible error that can be made when making predictions.

Further, the model is often described in terms of classification, e.g. the Bayes Classifier. Nevertheless, the principle applies just as well to regression: that is, predictive modeling problems where a numerical value is predicted instead of a class label.

It is a theoretical model, but it is held up as an ideal that we may wish to pursue.

In theory we would always like to predict qualitative responses using the Bayes classifier. But for real data, we do not know the conditional distribution of Y given X, and so computing the Bayes classifier is impossible. Therefore, the Bayes classifier serves as an unattainable gold standard against which to compare other methods.

— Page 39, An Introduction to Statistical Learning with Applications in R, 2017.

Because of the computational cost of this optimal strategy, we instead can work with direct simplifications of the approach.

Two of the most commonly used simplifications use a sampling algorithm for hypotheses, such as Gibbs sampling, or to use the simplifying assumptions of the Naive Bayes classifier.

**Gibbs Algorithm**. Randomly sample hypotheses biased on their posterior probability.**Naive Bayes**. Assume that variables in the input data are conditionally independent.

For more on the topic of Naive Bayes, see the post:

Nevertheless, many nonlinear machine learning algorithms are able to make predictions are that are close approximations of the Bayes classifier in practice.

Despite the fact that it is a very simple approach, KNN can often produce classifiers that are surprisingly close to the optimal Bayes classifier.

— Page 39, An Introduction to Statistical Learning with Applications in R, 2017.

This section provides more resources on the topic if you are looking to go deeper.

- A Gentle Introduction to Maximum a Posteriori (MAP) for Machine Learning
- A Gentle Introduction to Bayes Theorem for Machine Learning
- How to Develop a Naive Bayes Classifier from Scratch in Python

- Section 6.7 Bayes Optimal Classifier, Machine Learning, 1997.
- Section 2.4.2 Bayes error and noise, Foundations of Machine Learning, 2nd edition, 2018.
- Section 2.2.3 The Classification Setting, An Introduction to Statistical Learning with Applications in R, 2017.
- Information Theory, Inference and Learning Algorithms, 2003.

- The Multilayer Perceptron As An Approximation To A Bayes Optimal Discriminant Function, 1990.
- Bayes Optimal Multilabel Classification via Probabilistic Classifier Chains, 2010.
- Restricted bayes optimal classifiers, 2000.
- Bayes Classifier And Bayes Error, 2013.

In this post, you discovered the Bayes Optimal Classifier for making the most accurate predictions for new instances of data.

Specifically, you learned:

- Bayes Theorem provides a principled way for calculating conditional probabilities, called a posterior probability.
- Maximum a Posteriori is a probabilistic framework that finds the most probable hypothesis that describes the training dataset.
- Bayes Optimal Classifier is a probabilistic framework that finds the most probable prediction using the training data and space of hypotheses to make a prediction for a new data instance.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to the Bayes Optimal Classifier appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Model Selection for Machine Learning appeared first on Machine Learning Mastery.

]]>Given easy-to-use machine learning libraries like scikit-learn and Keras, it is straightforward to fit many different machine learning models on a given predictive modeling dataset.

The challenge of applied machine learning, therefore, becomes how to choose among a range of different models that you can use for your problem.

Naively, you might believe that model performance is sufficient, but should you consider other concerns, such as how long the model takes to train or how easy it is to explain to project stakeholders. Their concerns become more pressing if a chosen model must be used operationally for months or years.

Also, what are you choosing exactly: just the algorithm used to fit the model or the entire data preparation and model fitting pipeline?

In this post, you will discover the challenge of model selection for machine learning.

After reading this post, you will know:

- Model selection is the process of choosing one among many candidate models for a predictive modeling problem.
- There may be many competing concerns when performing model selection beyond model performance, such as complexity, maintainability, and available resources.
- The two main classes of model selection techniques are probabilistic measures and resampling methods.

Let’s get started.

This tutorial is divided into three parts; they are:

- What Is Model Selection
- Considerations for Model Selection
- Model Selection Techniques

Model selection is the process of selecting one final machine learning model from among a collection of candidate machine learning models for a training dataset.

Model selection is a process that can be applied both across different types of models (e.g. logistic regression, SVM, KNN, etc.) and across models of the same type configured with different model hyperparameters (e.g. different kernels in an SVM).

When we have a variety of models of different complexity (e.g., linear or logistic regression models with different degree polynomials, or KNN classifiers with different values of K), how should we pick the right one?

— Page 22, Machine Learning: A Probabilistic Perspective, 2012.

For example, we may have a dataset for which we are interested in developing a classification or regression predictive model. We do not know beforehand as to which model will perform best on this problem, as it is unknowable. Therefore, we fit and evaluate a suite of different models on the problem.

**Model selection** is the process of choosing one of the models as the final model that addresses the problem.

Model selection is different from **model assessment**.

For example, we evaluate or assess candidate models in order to choose the best one, and this is model selection. Whereas once a model is chosen, it can be evaluated in order to communicate how well it is expected to perform in general; this is model assessment.

The process of evaluating a model’s performance is known as model assessment, whereas the process of selecting the proper level of flexibility for a model is known as model selection.

— Page 175, An Introduction to Statistical Learning: with Applications in R, 2017.

Fitting models is relatively straightforward, although selecting among them is the true challenge of applied machine learning.

Firstly, we need to get over the idea of a “*best*” model.

All models have some predictive error, given the statistical noise in the data, the incompleteness of the data sample, and the limitations of each different model type. Therefore, the notion of a perfect or best model is not useful. Instead, we must seek a model that is “*good enough*.”

**What do we care about when choosing a final model?**

The project stakeholders may have specific requirements, such as maintainability and limited model complexity. As such, a model that has lower skill but is simpler and easier to understand may be preferred.

Alternately, if model skill is prized above all other concerns, then the ability of the model to perform well on out-of-sample data will be preferred regardless of the computational complexity involved.

Therefore, a “*good enough*” model may refer to many things and is specific to your project, such as:

- A model that meets the requirements and constraints of project stakeholders.
- A model that is sufficiently skillful given the time and resources available.
- A model that is skillful as compared to naive models.
- A model that is skillful relative to other tested models.
- A model that is skillful relative to the state-of-the-art.

Next, we must consider what is being selected.

For example, we are not selecting a fit model, as all models will be discarded. This is because once we choose a model, we will fit a new final model on all available data and start using it to make predictions.

Therefore, are we choosing among algorithms used to fit the models on the training dataset?

Some algorithms require specialized data preparation in order to best expose the structure of the problem to the learning algorithm. Therefore, we must go one step further and consider **model selection as the process of selecting among model development pipelines**.

Each pipeline may take in the same raw training dataset and outputs a model that can be evaluated in the same manner but may require different or overlapping computational steps, such as:

- Data filtering.
- Data transformation.
- Feature selection.
- Feature engineering.
- And more…

The closer you look at the challenge of model selection, the more nuance you will discover.

Now that we are familiar with some considerations involved in model selection, let’s review some common methods for selecting a model.

The best approach to model selection requires “*sufficient*” data, which may be nearly infinite depending on the complexity of the problem.

In this ideal situation, we would split the data into training, validation, and test sets, then fit candidate models on the training set, evaluate and select them on the validation set, and report the performance of the final model on the test set.

If we are in a data-rich situation, the best approach […] is to randomly divide the dataset into three parts: a training set, a validation set, and a test set. The training set is used to fit the models; the validation set is used to estimate prediction error for model selection; the test set is used for assessment of the generalization error of the final chosen model.

— Page 222, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2017.

This is impractical on most predictive modeling problems given that we rarely have sufficient data, or are able to even judge what would be sufficient.

In many applications, however, the supply of data for training and testing will be limited, and in order to build good models, we wish to use as much of the available data as possible for training. However, if the validation set is small, it will give a relatively noisy estimate of predictive performance.

– Page 32, Pattern Recognition and Machine Learning, 2006.

Instead, there are two main classes of techniques to approximate the ideal case of model selection; they are:

**Probabilistic Measures**: Choose a model via in-sample error and complexity.**Resampling Methods**: Choose a model via estimated out-of-sample error.

Let’s take a closer look at each in turn.

Probabilistic measures involve analytically scoring a candidate model using both its performance on the training dataset and the complexity of the model.

It is known that training error is optimistically biased, and therefore is not a good basis for choosing a model. The performance can be penalized based on how optimistic the training error is believed to be. This is typically achieved using algorithm-specific methods, often linear, that penalize the score based on the complexity of the model.

Historically various ‘information criteria’ have been proposed that attempt to correct for the bias of maximum likelihood by the addition of a penalty term to compensate for the over-fitting of more complex models.

– Page 33, Pattern Recognition and Machine Learning, 2006.

A model with fewer parameters is less complex, and because of this, is preferred because it is likely to generalize better on average.

Four commonly used probabilistic model selection measures include:

- Akaike Information Criterion (AIC).
- Bayesian Information Criterion (BIC).
- Minimum Description Length (MDL).
- Structural Risk Minimization (SRM).

Probabilistic measures are appropriate when using simpler linear models like linear regression or logistic regression where the calculating of model complexity penalty (e.g. in sample bias) is known and tractable.

Resampling methods seek to estimate the performance of a model (or more precisely, the model development process) on out-of-sample data.

This is achieved by splitting the training dataset into sub train and test sets, fitting a model on the sub train set, and evaluating it on the test set. This process may then be repeated multiple times and the mean performance across each trial is reported.

It is a type of Monte Carlo estimate of model performance on out-of-sample data, although each trial is not strictly independent as depending on the resampling method chosen, the same data may appear multiple times in different training datasets, or test datasets.

Three common resampling model selection methods include:

- Random train/test splits.
- Cross-Validation (k-fold, LOOCV, etc.).
- Bootstrap.

Most of the time probabilistic measures (described in the previous section) are not available, therefore resampling methods are used.

By far the most popular is the cross-validation family of methods that includes many subtypes.

Probably the simplest and most widely used method for estimating prediction error is cross-validation.

— Page 241, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2017.

An example is the widely used k-fold cross-validation that splits the training dataset into k folds where each example appears in a test set only once.

Another is the leave one out (LOOCV) where the test set is comprised of a single sample and each sample is given an opportunity to be the test set, requiring N (the number of samples in the training set) models to be constructed and evaluated.

This section provides more resources on the topic if you are looking to go deeper.

- Probabilistic Model Selection with AIC, BIC, and MDL
- A Gentle Introduction to Statistical Sampling and Resampling
- A Gentle Introduction to Monte Carlo Sampling for Probability
- A Gentle Introduction to k-fold Cross-Validation
- What is the Difference Between Test and Validation Datasets?

- Applied Predictive Modeling, 2013.
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2017.
- An Introduction to Statistical Learning: with Applications in R, 2017.
- Pattern Recognition and Machine Learning, 2006.
- Machine Learning: A Probabilistic Perspective, 2012.

In this post, you discovered the challenge of model selection for machine learning.

Specifically, you learned:

- Model selection is the process of choosing one among many candidate models for a predictive modeling problem.
- There may be many competing concerns when performing model selection beyond model performance, such as complexity, maintainability, and available resources.
- The two main classes of model selection techniques are probabilistic measures and resampling methods.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Model Selection for Machine Learning appeared first on Machine Learning Mastery.

]]>The post How to Use an Empirical Distribution Function in Python appeared first on Machine Learning Mastery.

]]>An empirical distribution function provides a way to model and sample cumulative probabilities for a data sample that does not fit a standard probability distribution.

As such, it is sometimes called the **empirical cumulative distribution function**, or ECDF for short.

In this tutorial, you will discover the empirical probability distribution function.

After completing this tutorial, you will know:

- Some data samples cannot be summarized using a standard distribution.
- An empirical distribution function provides a way of modeling cumulative probabilities for a data sample.
- How to use the statsmodels library to model and sample an empirical cumulative distribution function.

Discover bayes opimization, naive bayes, maximum likelihood, distributions, cross entropy, and much more in my new book, with 28 step-by-step tutorials and full Python source code.

Let’s get started.

This tutorial is divided into three parts; they are:

- Empirical Distribution Function
- Bimodal Data Distribution
- Sampling Empirical Distribution

Typically, the distribution of observations for a data sample fits a well-known probability distribution.

For example, the heights of humans will fit the normal (Gaussian) probability distribution.

This is not always the case. Sometimes the observations in a collected data sample do not fit any known probability distribution and cannot be easily forced into an existing distribution by data transforms or parameterization of the distribution function.

Instead, an empirical probability distribution must be used.

There are two main types of probability distribution functions we may need to sample; they are:

- Probability Density Function (PDF).
- Cumulative Distribution Function (CDF).

The PDF returns the expected probability for observing a value. For discrete data, the PDF is referred to as a Probability Mass Function (PMF). The CDF returns the expected probability for observing a value less than or equal to a given value.

An empirical probability density function can be fit and used for a data sampling using a nonparametric density estimation method, such as Kernel Density Estimation (KDE).

An empirical cumulative distribution function is called the Empirical Distribution Function, or EDF for short. It is also referred to as the Empirical Cumulative Distribution Function, or ECDF.

The EDF is calculated by ordering all of the unique observations in the data sample and calculating the cumulative probability for each as the number of observations less than or equal to a given observation divided by the total number of observations.

As follows:

- EDF(x) = number of observations <= x / n

Like other cumulative distribution functions, the sum of probabilities will proceed from 0.0 to 1.0 as the observations in the domain are enumerated from smallest to largest.

To make the empirical distribution function concrete, let’s look at an example with a dataset that clearly does not fit a known probability distribution.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

We can define a dataset that clearly does not match a standard probability distribution function.

A common example is when the data has two peaks (bimodal distribution) or many peaks (multimodal distribution).

We can construct a bimodal distribution by combining samples from two different normal distributions. Specifically, 300 examples with a mean of 20 and a standard deviation of five (the smaller peak), and 700 examples with a mean of 40 and a standard deviation of five (the larger peak).

The means were chosen close together to ensure the distributions overlap in the combined sample.

The complete example of creating this sample with a bimodal probability distribution and plotting the histogram is listed below.

# example of a bimodal data sample from matplotlib import pyplot from numpy.random import normal from numpy import hstack # generate a sample sample1 = normal(loc=20, scale=5, size=300) sample2 = normal(loc=40, scale=5, size=700) sample = hstack((sample1, sample2)) # plot the histogram pyplot.hist(sample, bins=50) pyplot.show()

Running the example creates the data sample and plots the histogram.

Note that your results will differ given the random nature of the data sample. Try running the example a few times.

We have fewer samples with a mean of 20 than samples with a mean of 40, which we can see reflected in the histogram with a larger density of samples around 40 than around 20.

Data with this distribution does not nicely fit into a common probability distribution by design.

Below is a plot of the probability density function (PDF) of this data sample.

It is a good case for using an empirical distribution function.

An empirical distribution function can be fit for a data sample in Python.

The statmodels Python library provides the ECDF class for fitting an empirical cumulative distribution function and calculating the cumulative probabilities for specific observations from the domain.

The distribution is fit by calling ECDF() and passing in the raw data sample.

... # fit a cdf ecdf = ECDF(sample)

Once fit, the function can be called to calculate the cumulative probability for a given observation.

... # get cumulative probability for values print('P(x<20): %.3f' % ecdf(20)) print('P(x<40): %.3f' % ecdf(40)) print('P(x<60): %.3f' % ecdf(60))

The class also provides an ordered list of unique observations in the data (the *.x* attribute) and their associated probabilities (*.y* attribute). We can access these attributes and plot the CDF function directly.

... # plot the cdf pyplot.plot(ecdf.x, ecdf.y) pyplot.show()

Tying this together, the complete example of fitting an empirical distribution function for the bimodal data sample is below.

# fit an empirical cdf to a bimodal dataset from matplotlib import pyplot from numpy.random import normal from numpy import hstack from statsmodels.distributions.empirical_distribution import ECDF # generate a sample sample1 = normal(loc=20, scale=5, size=300) sample2 = normal(loc=40, scale=5, size=700) sample = hstack((sample1, sample2)) # fit a cdf ecdf = ECDF(sample) # get cumulative probability for values print('P(x<20): %.3f' % ecdf(20)) print('P(x<40): %.3f' % ecdf(40)) print('P(x<60): %.3f' % ecdf(60)) # plot the cdf pyplot.plot(ecdf.x, ecdf.y) pyplot.show()

Running the example fits the empirical CDF to the data sample, then prints the cumulative probability for observing three values.

Your specific results will vary given the stochastic nature of the data sample. Try running the example a few times.

P(x<20): 0.149 P(x<40): 0.654 P(x<60): 1.000

Then the cumulative probability for the entire domain is calculated and shown as a line plot.

Here, we can see the familiar S-shaped curve seen for most cumulative distribution functions, here with bumps around the mean of both peaks of the bimodal distribution.

This section provides more resources on the topic if you are looking to go deeper.

- Section 2.3.4 The empirical distribution, Machine Learning: A Probabilistic Perspective, 2012.
- Section 3.9.5 The Dirac Distribution and Empirical Distribution, Deep Learning, 2016.

- Empirical distribution function, Wikipedia.
- Cumulative distribution function, Wikipedia.
- Probability Density Function, Wikipedia.
- Kernel density estimation, Wikipedia.

In this tutorial, you discovered the empirical probability distribution function.

Specifically, you learned:

- Some data samples cannot be summarized using a standard distribution.
- An empirical distribution function provides a way of modeling cumulative probabilities for a data sample.
- How to use the statsmodels library to model and sample an empirical cumulative distribution function.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Use an Empirical Distribution Function in Python appeared first on Machine Learning Mastery.

]]>The post How to Choose a Feature Selection Method For Machine Learning appeared first on Machine Learning Mastery.

]]>Feature selection is the process of reducing the number of input variables when developing a predictive model.

It is desirable to reduce the number of input variables to both reduce the computational cost of modeling and, in some cases, to improve the performance of the model.

Feature-based feature selection methods involve evaluating the relationship between each input variable and the target variable using statistics and selecting those input variables that have the strongest relationship with the target variable. These methods can be fast and effective, although the choice of statistical measures depends on the data type of both the input and output variables.

As such, it can be challenging for a machine learning practitioner to select an appropriate statistical measure for a dataset when performing filter-based feature selection.

In this post, you will discover how to choose statistical measures for filter-based feature selection with numerical and categorical data.

After reading this post, you will know:

- There are two main types of feature selection techniques: wrapper and filter methods.
- Filter-based feature selection methods use statistical measures to score the correlation or dependence between input variables that can be filtered to choose the most relevant features.
- Statistical measures for feature selection must be carefully chosen based on the data type of the input variable and the output or response variable.

Let’s get started.

**Update Nov/2019**: Added some worked examples for classification and regression.

This tutorial is divided into 4 parts; they are:

- Feature Selection Methods
- Statistics for Filter Feature Selection Methods
- Numerical Input, Numerical Output
- Numerical Input, Categorical Output
- Categorical Input, Numerical Output
- Categorical Input, Categorical Output

- Tips and Tricks for Feature Selection
- Correlation Statistics
- Selection Method
- Transform Variables
- What Is the Best Method?

- Worked Examples
- Regression Feature Selection
- Classification Feature Selection

Feature selection methods are intended to reduce the number of input variables to those that are believed to be most useful to a model in order to predict the target variable.

Some predictive modeling problems have a large number of variables that can slow the development and training of models and require a large amount of system memory. Additionally, the performance of some models can degrade when including input variables that are not relevant to the target variable.

There are two main types of feature selection algorithms: wrapper methods and filter methods.

- Wrapper Feature Selection Methods.
- Filter Feature Selection Methods.

**Wrapper feature selection methods** create many models with different subsets of input features and select those features that result in the best performing model according to a performance metric. These methods are unconcerned with the variable types, although they can be computationally expensive. RFE is a good example of a wrapper feature selection method.

Wrapper methods evaluate multiple models using procedures that add and/or remove predictors to find the optimal combination that maximizes model performance.

— Page 490, Applied Predictive Modeling, 2013.

**Filter feature selection methods** use statistical techniques to evaluate the relationship between each input variable and the target variable, and these scores are used as the basis to choose (filter) those input variables that will be used in the model.

Filter methods evaluate the relevance of the predictors outside of the predictive models and subsequently model only the predictors that pass some criterion.

— Page 490, Applied Predictive Modeling, 2013.

It is common to use correlation type statistical measures between input and output variables as the basis for filter feature selection. As such, the choice of statistical measures is highly dependent upon the variable data types.

Common data types include numerical (such as height) and categorical (such as a label), although each may be further subdivided such as integer and floating point for numerical variables, and boolean, ordinal, or nominal for categorical variables.

Common input variable data types:

**Numerical Variables**- Integer Variables.
- Floating Point Variables.

**Categorical Variables**.- Boolean Variables (dichotomous).
- Ordinal Variables.
- Nominal Variables.

The more that is known about the data type of a variable, the easier it is to choose an appropriate statistical measure for a filter-based feature selection method.

In the next section, we will review some of the statistical measures that may be used for filter-based feature selection with different input and output variable data types.

In this section, we will consider two broad categories of variable types: numerical and categorical; also, the two main groups of variables to consider: input and output.

Input variables are those that are provided as input to a model. In feature selection, it is this group of variables that we wish to reduce in size. Output variables are those for which a model is intended to predict, often called the response variable.

The type of response variable typically indicates the type of predictive modeling problem being performed. For example, a numerical output variable indicates a regression predictive modeling problem, and a categorical output variable indicates a classification predictive modeling problem.

**Numerical Output**: Regression predictive modeling problem.**Categorical Output**: Classification predictive modeling problem.

The statistical measures used in filter-based feature selection are generally calculated one input variable at a time with the target variable. As such, they are referred to as univariate statistical measures. This may mean that any interaction between input variables is not considered in the filtering process.

Most of these techniques are univariate, meaning that they evaluate each predictor in isolation. In this case, the existence of correlated predictors makes it possible to select important, but redundant, predictors. The obvious consequences of this issue are that too many predictors are chosen and, as a result, collinearity problems arise.

— Page 499, Applied Predictive Modeling, 2013.

With this framework, let’s review some univariate statistical measures that can be used for filter-based feature selection.

This is a regression predictive modeling problem with numerical input variables.

The most common techniques are to use a correlation coefficient, such as Pearson’s for a linear correlation, or rank-based methods for a nonlinear correlation.

- Pearson’s correlation coefficient (linear).
- Spearman’s rank coefficient (nonlinear)

This is a classification predictive modeling problem with numerical input variables.

This might be the most common example of a classification problem,

Again, the most common techniques are correlation based, although in this case, they must take the categorical target into account.

- ANOVA correlation coefficient (linear).
- Kendall’s rank coefficient (nonlinear).

Kendall does assume that the categorical variable is ordinal.

This is a regression predictive modeling problem with categorical input variables.

This is a strange example of a regression problem (e.g. you would not encounter it often).

Nevertheless, you can use the same “*Numerical Input, Categorical Output*” methods (described above), but in reverse.

This is a classification predictive modeling problem with categorical input variables.

The most common correlation measure for categorical data is the chi-squared test. You can also use mutual information (information gain) from the field of information theory.

- Chi-Squared test (contingency tables).
- Mutual Information.

In fact, mutual information is a powerful method that may prove useful for both categorical and numerical data, e.g. it is agnostic to the data types.

This section provides some additional considerations when using filter-based feature selection.

The scikit-learn library provides an implementation of most of the useful statistical measures.

For example:

- Pearson’s Correlation Coefficient: f_regression()
- ANOVA: f_classif()
- Chi-Squared: chi2()
- Mutual Information: mutual_info_classif() and mutual_info_regression()

Also, the SciPy library provides an implementation of many more statistics, such as Kendall’s tau (kendalltau) and Spearman’s rank correlation (spearmanr).

The scikit-learn library also provides many different filtering methods once statistics have been calculated for each input variable with the target.

Two of the more popular methods include:

- Select the top k variables: SelectKBest
- Select the top percentile variables: SelectPercentile

I often use *SelectKBest* myself.

Consider transforming the variables in order to access different statistical methods.

For example, you can transform a categorical variable to ordinal, even if it is not, and see if any interesting results come out.

You can also make a numerical variable discrete (e.g. bins); try categorical-based measures.

Some statistical measures assume properties of the variables, such as Pearson’s that assumes a Gaussian probability distribution to the observations and a linear relationship. You can transform the data to meet the expectations of the test and try the test regardless of the expectations and compare results.

There is no best feature selection method.

Just like there is no best set of input variables or best machine learning algorithm. At least not universally.

Instead, you must discover what works best for your specific problem using careful systematic experimentation.

Try a range of different models fit on different subsets of features chosen via different statistical measures and discover what works best for your specific problem.

It can be helpful to have some worked examples that you can copy-and-paste and adapt for your own project.

This section provides worked examples of feature selection cases that you can use as a starting point.

(

This section demonstrates feature selection for a regression problem that as numerical inputs and numerical outputs.

A test regression problem is prepared using the make_regression() function.

Feature selection is performed using Pearson’s Correlation Coefficient via the f_regression() function.

# pearson's correlation feature selection for numeric input and numeric output from sklearn.datasets import make_regression from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import f_regression # generate dataset X, y = make_regression(n_samples=100, n_features=100, n_informative=10) # define feature selection fs = SelectKBest(score_func=f_regression, k=10) # apply feature selection X_selected = fs.fit_transform(X, y) print(X_selected.shape)

Running the example first creates the regression dataset, then defines the feature selection and applies the feature selection procedure to the dataset, returning a subset of the selected input features.

(100, 10)

(

This section demonstrates feature selection for a classification problem that as numerical inputs and categorical outputs.

A test regression problem is prepared using the make_classification() function.

Feature selection is performed using ANOVA F measure via the f_classif() function.

# ANOVA feature selection for numeric input and categorical output from sklearn.datasets import make_classification from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import f_classif # generate dataset X, y = make_classification(n_samples=100, n_features=20, n_informative=2) # define feature selection fs = SelectKBest(score_func=f_classif, k=2) # apply feature selection X_selected = fs.fit_transform(X, y) print(X_selected.shape)

Running the example first creates the classification dataset, then defines the feature selection and applies the feature selection procedure to the dataset, returning a subset of the selected input features.

(100, 2)

(

For examples of feature selection with categorical inputs and categorical outputs, see the tutorial:

This section provides more resources on the topic if you are looking to go deeper.

- How to Calculate Nonparametric Rank Correlation in Python
- How to Calculate Correlation Between Variables in Python
- Feature Selection For Machine Learning in Python
- An Introduction to Feature Selection

- Feature selection, scikit-learn API.
- What are the feature selection options for categorical data? Quora.

In this post, you discovered how to choose statistical measures for filter-based feature selection with numerical and categorical data.

Specifically, you learned:

- There are two main types of feature selection techniques: wrapper and filter methods.
- Filter-based feature selection methods use statistical measures to score the correlation or dependence between input variables that can be filtered to choose the most relevant features.
- Statistical measures for feature selection must be carefully chosen based on the data type of the input variable and the output or response variable.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Choose a Feature Selection Method For Machine Learning appeared first on Machine Learning Mastery.

]]>The post How to Perform Feature Selection with Categorical Data appeared first on Machine Learning Mastery.

]]>Feature selection is the process of identifying and selecting a subset of input features that are most relevant to the target variable.

Feature selection is often straightforward when working with real-valued data, such as using the Pearson’s correlation coefficient, but can be challenging when working with categorical data.

The two most commonly used feature selection methods for categorical input data when the target variable is also categorical (e.g. classification predictive modeling) are the chi-squared statistic and the mutual information statistic.

In this tutorial, you will discover how to perform feature selection with categorical input data.

After completing this tutorial, you will know:

- The breast cancer predictive modeling problem with categorical inputs and binary classification target variable.
- How to evaluate the importance of categorical features using the chi-squared and mutual information statistics.
- How to perform feature selection for categorical data when fitting and evaluating a classification model.

Let’s get started.

This tutorial is divided into three parts; they are:

- Breast Cancer Categorical Dataset
- Categorical Feature Selection
- Modeling With Selected Features

As the basis of this tutorial, we will use the so-called “Breast cancer” dataset that has been widely studied as a machine learning dataset since the 1980s.

The dataset classifies breast cancer patient data as either a recurrence or no recurrence of cancer. There are 286 examples and nine input variables. It is a binary classification problem.

A naive model can achieve an accuracy of 70% on this dataset. A good score is about 76% +/- 3%. We will aim for this region, but note that the models in this tutorial are not optimized; they are designed to demonstrate encoding schemes.

You can download the dataset and save the file as “*breast-cancer.csv*” in your current working directory.

Looking at the data, we can see that all nine input variables are categorical.

Specifically, all variables are quoted strings; some are ordinal and some are not.

'40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events' '50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events' '50-59','ge40','35-39','0-2','no','2','left','left_low','no','recurrence-events' '40-49','premeno','35-39','0-2','yes','3','right','left_low','yes','no-recurrence-events' '40-49','premeno','30-34','3-5','yes','2','left','right_up','no','recurrence-events' ...

We can load this dataset into memory using the Pandas library.

... # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values

Once loaded, we can split the columns into input (*X*) and output for modeling.

... # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1]

Finally, we can force all fields in the input data to be string, just in case Pandas tried to map some automatically to numbers (it does try).

... # format all fields as string X = X.astype(str)

We can tie all of this together into a helpful function that we can reuse later.

# load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] # format all fields as string X = X.astype(str) return X, y

Once loaded, we can split the data into training and test sets so that we can fit and evaluate a learning model.

We will use the train_test_split() function form scikit-learn and use 67% of the data for training and 33% for testing.

... # load the dataset X, y = load_dataset('breast-cancer.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

Tying all of these elements together, the complete example of loading, splitting, and summarizing the raw categorical dataset is listed below.

# load and summarize the dataset from pandas import read_csv from sklearn.model_selection import train_test_split # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] # format all fields as string X = X.astype(str) return X, y # load the dataset X, y = load_dataset('breast-cancer.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # summarize print('Train', X_train.shape, y_train.shape) print('Test', X_test.shape, y_test.shape)

Running the example reports the size of the input and output elements of the train and test sets.

We can see that we have 191 examples for training and 95 for testing.

Train (191, 9) (191, 1) Test (95, 9) (95, 1)

Now that we are familiar with the dataset, let’s look at how we can encode it for modeling.

We can use the OrdinalEncoder() from scikit-learn to encode each variable to integers. This is a flexible class and does allow the order of the categories to be specified as arguments if any such order is known.

**Note**: I will leave it as an exercise to you to update the example below to try specifying the order for those variables that have a natural ordering and see if it has an impact on model performance.

The best practice when encoding variables is to fit the encoding on the training dataset, then apply it to the train and test datasets.

The function below named *prepare_inputs()* takes the input data for the train and test sets and encodes it using an ordinal encoding.

# prepare input data def prepare_inputs(X_train, X_test): oe = OrdinalEncoder() oe.fit(X_train) X_train_enc = oe.transform(X_train) X_test_enc = oe.transform(X_test) return X_train_enc, X_test_enc

We also need to prepare the target variable.

It is a binary classification problem, so we need to map the two class labels to 0 and 1. This is a type of ordinal encoding, and scikit-learn provides the LabelEncoder class specifically designed for this purpose. We could just as easily use the *OrdinalEncoder* and achieve the same result, although the *LabelEncoder* is designed for encoding a single variable.

The *prepare_targets()* function integer encodes the output data for the train and test sets.

# prepare target def prepare_targets(y_train, y_test): le = LabelEncoder() le.fit(y_train) y_train_enc = le.transform(y_train) y_test_enc = le.transform(y_test) return y_train_enc, y_test_enc

We can call these functions to prepare our data.

... # prepare input data X_train_enc, X_test_enc = prepare_inputs(X_train, X_test) # prepare output data y_train_enc, y_test_enc = prepare_targets(y_train, y_test)

Tying this all together, the complete example of loading and encoding the input and output variables for the breast cancer categorical dataset is listed below.

# example of loading and preparing the breast cancer dataset from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OrdinalEncoder # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] # format all fields as string X = X.astype(str) return X, y # prepare input data def prepare_inputs(X_train, X_test): oe = OrdinalEncoder() oe.fit(X_train) X_train_enc = oe.transform(X_train) X_test_enc = oe.transform(X_test) return X_train_enc, X_test_enc # prepare target def prepare_targets(y_train, y_test): le = LabelEncoder() le.fit(y_train) y_train_enc = le.transform(y_train) y_test_enc = le.transform(y_test) return y_train_enc, y_test_enc # load the dataset X, y = load_dataset('breast-cancer.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # prepare input data X_train_enc, X_test_enc = prepare_inputs(X_train, X_test) # prepare output data y_train_enc, y_test_enc = prepare_targets(y_train, y_test)

Now that we have loaded and prepared the breast cancer dataset, we can explore feature selection.

There are two popular feature selection techniques that can be used for categorical input data and a categorical (class) target variable.

They are:

- Chi-Squared Statistic.
- Mutual Information Statistic.

Let’s take a closer look at each in turn.

Pearson’s chi-squared statistical hypothesis test is an example of a test for independence between categorical variables.

You can learn more about this statistical test in the tutorial:

The results of this test can be used for feature selection, where those features that are independent of the target variable can be removed from the dataset.

The scikit-learn machine library provides an implementation of the chi-squared test in the chi2() function. This function can be used in a feature selection strategy, such as selecting the top *k* most relevant features (largest values) via the SelectKBest class.

For example, we can define the *SelectKBest* class to use the *chi2()* function and select all features, then transform the train and test sets.

... fs = SelectKBest(score_func=chi2, k='all') fs.fit(X_train, y_train) X_train_fs = fs.transform(X_train) X_test_fs = fs.transform(X_test)

We can then print the scores for each variable (largest is better), and plot the scores for each variable as a bar graph to get an idea of how many features we should select.

... # what are scores for the features for i in range(len(fs.scores_)): print('Feature %d: %f' % (i, fs.scores_[i])) # plot the scores pyplot.bar([i for i in range(len(fs.scores_))], fs.scores_) pyplot.show()

Tying this together with the data preparation for the breast cancer dataset in the previous section, the complete example is listed below.

# example of chi squared feature selection for categorical data from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OrdinalEncoder from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 from matplotlib import pyplot # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] # format all fields as string X = X.astype(str) return X, y # prepare input data def prepare_inputs(X_train, X_test): oe = OrdinalEncoder() oe.fit(X_train) X_train_enc = oe.transform(X_train) X_test_enc = oe.transform(X_test) return X_train_enc, X_test_enc # prepare target def prepare_targets(y_train, y_test): le = LabelEncoder() le.fit(y_train) y_train_enc = le.transform(y_train) y_test_enc = le.transform(y_test) return y_train_enc, y_test_enc # feature selection def select_features(X_train, y_train, X_test): fs = SelectKBest(score_func=chi2, k='all') fs.fit(X_train, y_train) X_train_fs = fs.transform(X_train) X_test_fs = fs.transform(X_test) return X_train_fs, X_test_fs, fs # load the dataset X, y = load_dataset('breast-cancer.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # prepare input data X_train_enc, X_test_enc = prepare_inputs(X_train, X_test) # prepare output data y_train_enc, y_test_enc = prepare_targets(y_train, y_test) # feature selection X_train_fs, X_test_fs, fs = select_features(X_train_enc, y_train_enc, X_test_enc) # what are scores for the features for i in range(len(fs.scores_)): print('Feature %d: %f' % (i, fs.scores_[i])) # plot the scores pyplot.bar([i for i in range(len(fs.scores_))], fs.scores_) pyplot.show()

Running the example first prints the scores calculated for each input feature and the target variable.

**Note**: your specific results may differ. Try running the example a few times.

In this case, we can see the scores are small and it is hard to get an idea from the number alone as to which features are more relevant.

Perhaps features 3, 4, 5, and 8 are most relevant.

Feature 0: 0.472553 Feature 1: 0.029193 Feature 2: 2.137658 Feature 3: 29.381059 Feature 4: 8.222601 Feature 5: 8.100183 Feature 6: 1.273822 Feature 7: 0.950682 Feature 8: 3.699989

A bar chart of the feature importance scores for each input feature is created.

This clearly shows that feature 3 might be the most relevant (according to chi-squared) and that perhaps four of the nine input features are the most relevant.

We could set k=4 When configuring the *SelectKBest* to select these top four features.

Mutual information from the field of information theory is the application of information gain (typically used in the construction of decision trees) to feature selection.

Mutual information is calculated between two variables and measures the reduction in uncertainty for one variable given a known value of the other variable.

You can learn more about mutual information in the following tutorial.

The scikit-learn machine learning library provides an implementation of mutual information for feature selection via the mutual_info_classif() function.

Like *chi2()*, it can be used in the *SelectKBest* feature selection strategy (and other strategies).

# feature selection def select_features(X_train, y_train, X_test): fs = SelectKBest(score_func=mutual_info_classif, k='all') fs.fit(X_train, y_train) X_train_fs = fs.transform(X_train) X_test_fs = fs.transform(X_test) return X_train_fs, X_test_fs, fs

We can perform feature selection using mutual information on the breast cancer set and print and plot the scores (larger is better) as we did in the previous section.

The complete example of using mutual information for categorical feature selection is listed below.

# example of mutual information feature selection for categorical data from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OrdinalEncoder from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import mutual_info_classif from matplotlib import pyplot # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] # format all fields as string X = X.astype(str) return X, y # prepare input data def prepare_inputs(X_train, X_test): oe = OrdinalEncoder() oe.fit(X_train) X_train_enc = oe.transform(X_train) X_test_enc = oe.transform(X_test) return X_train_enc, X_test_enc # prepare target def prepare_targets(y_train, y_test): le = LabelEncoder() le.fit(y_train) y_train_enc = le.transform(y_train) y_test_enc = le.transform(y_test) return y_train_enc, y_test_enc # feature selection def select_features(X_train, y_train, X_test): fs = SelectKBest(score_func=mutual_info_classif, k='all') fs.fit(X_train, y_train) X_train_fs = fs.transform(X_train) X_test_fs = fs.transform(X_test) return X_train_fs, X_test_fs, fs # load the dataset X, y = load_dataset('breast-cancer.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # prepare input data X_train_enc, X_test_enc = prepare_inputs(X_train, X_test) # prepare output data y_train_enc, y_test_enc = prepare_targets(y_train, y_test) # feature selection X_train_fs, X_test_fs, fs = select_features(X_train_enc, y_train_enc, X_test_enc) # what are scores for the features for i in range(len(fs.scores_)): print('Feature %d: %f' % (i, fs.scores_[i])) # plot the scores pyplot.bar([i for i in range(len(fs.scores_))], fs.scores_) pyplot.show()

Running the example first prints the scores calculated for each input feature and the target variable.

**Note**: your specific results may differ. Try running the example a few times.

In this case, we can see that some of the features have a very low score, suggesting that perhaps they can be removed.

Perhaps features 3, 6, 2, and 5 are most relevant.

Feature 0: 0.003588 Feature 1: 0.000000 Feature 2: 0.025934 Feature 3: 0.071461 Feature 4: 0.000000 Feature 5: 0.038973 Feature 6: 0.064759 Feature 7: 0.003068 Feature 8: 0.000000

A bar chart of the feature importance scores for each input feature is created.

Importantly, a different mixture of features is promoted.

Now that we know how to perform feature selection on categorical data for a classification predictive modeling problem, we can try developing a model using the selected features and compare the results.

There are many different techniques for scoring features and selecting features based on scores; how do you know which one to use?

A robust approach is to evaluate models using different feature selection methods (and numbers of features) and select the method that results in a model with the best performance.

In this section, we will evaluate a Logistic Regression model with all features compared to a model built from features selected by chi-squared and those features selected via mutual information.

Logistic regression is a good model for testing feature selection methods as it can perform better if irrelevant features are removed from the model.

As a first step, we will evaluate a LogisticRegression model using all the available features.

The model is fit on the training dataset and evaluated on the test dataset.

The complete example is listed below.

# evaluation of a model using all input features from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OrdinalEncoder from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] # format all fields as string X = X.astype(str) return X, y # prepare input data def prepare_inputs(X_train, X_test): oe = OrdinalEncoder() oe.fit(X_train) X_train_enc = oe.transform(X_train) X_test_enc = oe.transform(X_test) return X_train_enc, X_test_enc # prepare target def prepare_targets(y_train, y_test): le = LabelEncoder() le.fit(y_train) y_train_enc = le.transform(y_train) y_test_enc = le.transform(y_test) return y_train_enc, y_test_enc # load the dataset X, y = load_dataset('breast-cancer.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # prepare input data X_train_enc, X_test_enc = prepare_inputs(X_train, X_test) # prepare output data y_train_enc, y_test_enc = prepare_targets(y_train, y_test) # fit the model model = LogisticRegression(solver='lbfgs') model.fit(X_train_enc, y_train_enc) # evaluate the model yhat = model.predict(X_test_enc) # evaluate predictions accuracy = accuracy_score(y_test_enc, yhat) print('Accuracy: %.2f' % (accuracy*100))

Running the example prints the accuracy of the model on the training dataset.

**Note**: your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that the model achieves a classification accuracy of about 75%.

We would prefer to use a subset of features that achieves a classification accuracy that is as good or better than this.

Accuracy: 75.79

We can use the chi-squared test to score the features and select the four most relevant features.

The *select_features()* function below is updated to achieve this.

# feature selection def select_features(X_train, y_train, X_test): fs = SelectKBest(score_func=chi2, k=4) fs.fit(X_train, y_train) X_train_fs = fs.transform(X_train) X_test_fs = fs.transform(X_test) return X_train_fs, X_test_fs

The complete example of evaluating a logistic regression model fit and evaluated on data using this feature selection method is listed below.

# evaluation of a model fit using chi squared input features from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OrdinalEncoder from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] # format all fields as string X = X.astype(str) return X, y # prepare input data def prepare_inputs(X_train, X_test): oe = OrdinalEncoder() oe.fit(X_train) X_train_enc = oe.transform(X_train) X_test_enc = oe.transform(X_test) return X_train_enc, X_test_enc # prepare target def prepare_targets(y_train, y_test): le = LabelEncoder() le.fit(y_train) y_train_enc = le.transform(y_train) y_test_enc = le.transform(y_test) return y_train_enc, y_test_enc # feature selection def select_features(X_train, y_train, X_test): fs = SelectKBest(score_func=chi2, k=4) fs.fit(X_train, y_train) X_train_fs = fs.transform(X_train) X_test_fs = fs.transform(X_test) return X_train_fs, X_test_fs # load the dataset X, y = load_dataset('breast-cancer.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # prepare input data X_train_enc, X_test_enc = prepare_inputs(X_train, X_test) # prepare output data y_train_enc, y_test_enc = prepare_targets(y_train, y_test) # feature selection X_train_fs, X_test_fs = select_features(X_train_enc, y_train_enc, X_test_enc) # fit the model model = LogisticRegression(solver='lbfgs') model.fit(X_train_fs, y_train_enc) # evaluate the model yhat = model.predict(X_test_fs) # evaluate predictions accuracy = accuracy_score(y_test_enc, yhat) print('Accuracy: %.2f' % (accuracy*100))

Running the example reports the performance of the model on just four of the nine input features selected using the chi-squared statistic.

**Note**: your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we see that the model achieved an accuracy of about 74%, a slight drop in performance.

It is possible that some of the features removed are, in fact, adding value directly or in concert with the selected features.

At this stage, we would probably prefer to use all of the input features.

Accuracy: 74.74

We can repeat the experiment and select the top four features using a mutual information statistic.

The updated version of the *select_features()* function to achieve this is listed below.

# feature selection def select_features(X_train, y_train, X_test): fs = SelectKBest(score_func=mutual_info_classif, k=4) fs.fit(X_train, y_train) X_train_fs = fs.transform(X_train) X_test_fs = fs.transform(X_test) return X_train_fs, X_test_fs

The complete example of using mutual information for feature selection to fit a logistic regression model is listed below.

# evaluation of a model fit using mutual information input features from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OrdinalEncoder from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import mutual_info_classif from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] # format all fields as string X = X.astype(str) return X, y # prepare input data def prepare_inputs(X_train, X_test): oe = OrdinalEncoder() oe.fit(X_train) X_train_enc = oe.transform(X_train) X_test_enc = oe.transform(X_test) return X_train_enc, X_test_enc # prepare target def prepare_targets(y_train, y_test): le = LabelEncoder() le.fit(y_train) y_train_enc = le.transform(y_train) y_test_enc = le.transform(y_test) return y_train_enc, y_test_enc # feature selection def select_features(X_train, y_train, X_test): fs = SelectKBest(score_func=mutual_info_classif, k=4) fs.fit(X_train, y_train) X_train_fs = fs.transform(X_train) X_test_fs = fs.transform(X_test) return X_train_fs, X_test_fs # load the dataset X, y = load_dataset('breast-cancer.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # prepare input data X_train_enc, X_test_enc = prepare_inputs(X_train, X_test) # prepare output data y_train_enc, y_test_enc = prepare_targets(y_train, y_test) # feature selection X_train_fs, X_test_fs = select_features(X_train_enc, y_train_enc, X_test_enc) # fit the model model = LogisticRegression(solver='lbfgs') model.fit(X_train_fs, y_train_enc) # evaluate the model yhat = model.predict(X_test_fs) # evaluate predictions accuracy = accuracy_score(y_test_enc, yhat) print('Accuracy: %.2f' % (accuracy*100))

Running the example fits the model on the four top selected features chosen using mutual information.

**Note**: your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see a small lift in classification accuracy to 76%.

To be sure that the effect is real, it would be a good idea to repeat each experiment multiple times and compare the mean performance. It may also be a good idea to explore using k-fold cross-validation instead of a simple train/test split.

Accuracy: 76.84

This section provides more resources on the topic if you are looking to go deeper.

- A Gentle Introduction to the Chi-Squared Test for Machine Learning
- An Introduction to Feature Selection
- Feature Selection For Machine Learning in Python
- What is Information Gain and Mutual Information for Machine Learning

- sklearn.model_selection.train_test_split API.
- sklearn.preprocessing.OrdinalEncoder API.
- sklearn.preprocessing.LabelEncoder API.
- sklearn.feature_selection.chi2 API
- sklearn.feature_selection.SelectKBest API
- sklearn.feature_selection.mutual_info_classif API.
- sklearn.linear_model.LogisticRegression API.

- Breast Cancer Data Set, UCI Machine Learning Repository.
- Breast Cancer Raw Dataset
- Breast Cancer Description

In this tutorial, you discovered how to perform feature selection with categorical input data.

Specifically, you learned:

- The breast cancer predictive modeling problem with categorical inputs and binary classification target variable.
- How to evaluate the importance of categorical features using the chi-squared and mutual information statistics.
- How to perform feature selection for categorical data when fitting and evaluating a classification model.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Perform Feature Selection with Categorical Data appeared first on Machine Learning Mastery.

]]>The post 3 Ways to Encode Categorical Variables for Deep Learning appeared first on Machine Learning Mastery.

]]>Machine learning and deep learning models, like those in Keras, require all input and output variables to be numeric.

This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model.

The two most popular techniques are an **integer encoding** and a **one hot encoding**, although a newer technique called **learned embedding** may provide a useful middle ground between these two methods.

In this tutorial, you will discover how to encode categorical data when developing neural network models in Keras.

After completing this tutorial, you will know:

- The challenge of working with categorical data when using machine learning and deep learning models.
- How to integer encode and one hot encode categorical variables for modeling.
- How to learn an embedding distributed representation as part of a neural network for categorical variables.

Let’s get started.

This tutorial is divided into five parts; they are:

- The Challenge With Categorical Data
- Breast Cancer Categorical Dataset
- How to Ordinal Encode Categorical Data
- How to One Hot Encode Categorical Data
- How to Use a Learned Embedding for Categorical Data

A categorical variable is a variable whose values take on the value of labels.

For example, the variable may be “*color*” and may take on the values “*red*,” “*green*,” and “*blue*.”

Sometimes, the categorical data may have an ordered relationship between the categories, such as “*first*,” “*second*,” and “*third*.” This type of categorical data is referred to as ordinal and the additional ordering information can be useful.

Machine learning algorithms and deep learning neural networks require that input and output variables are numbers.

This means that categorical data must be encoded to numbers before we can use it to fit and evaluate a model.

There are many ways to encode categorical variables for modeling, although the three most common are as follows:

**Integer Encoding**: Where each unique label is mapped to an integer.**One Hot Encoding**: Where each label is mapped to a binary vector.**Learned Embedding**: Where a distributed representation of the categories is learned.

We will take a closer look at how to encode categorical data for training a deep learning neural network in Keras using each one of these methods.

As the basis of this tutorial, we will use the so-called “Breast cancer” dataset that has been widely studied in machine learning since the 1980s.

The dataset classifies breast cancer patient data as either a recurrence or no recurrence of cancer. There are 286 examples and nine input variables. It is a binary classification problem.

A reasonable classification accuracy score on this dataset is between 68% and 73%. We will aim for this region, but note that the models in this tutorial are not optimized: *they are designed to demonstrate encoding schemes*.

You can download the dataset and save the file as “*breast-cancer.csv*” in your current working directory.

Looking at the data, we can see that all nine input variables are categorical.

Specifically, all variables are quoted strings; some are ordinal and some are not.

'40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events' '50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events' '50-59','ge40','35-39','0-2','no','2','left','left_low','no','recurrence-events' '40-49','premeno','35-39','0-2','yes','3','right','left_low','yes','no-recurrence-events' '40-49','premeno','30-34','3-5','yes','2','left','right_up','no','recurrence-events' ...

We can load this dataset into memory using the Pandas library.

... # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values

Once loaded, we can split the columns into input (*X*) and output (*y*) for modeling.

... # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1]

Finally, we can force all fields in the input data to be string, just in case Pandas tried to map some automatically to numbers (it does try).

We can also reshape the output variable to be one column (e.g. a 2D shape).

... # format all fields as string X = X.astype(str) # reshape target to be a 2d array y = y.reshape((len(y), 1))

We can tie all of this together into a helpful function that we can reuse later.

# load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] # format all fields as string X = X.astype(str) # reshape target to be a 2d array y = y.reshape((len(y), 1)) return X, y

Once loaded, we can split the data into training and test sets so that we can fit and evaluate a deep learning model.

We will use the train_test_split() function from scikit-learn and use 67% of the data for training and 33% for testing.

... # load the dataset X, y = load_dataset('breast-cancer.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

Tying all of these elements together, the complete example of loading, splitting, and summarizing the raw categorical dataset is listed below.

# load and summarize the dataset from pandas import read_csv from sklearn.model_selection import train_test_split # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] # format all fields as string X = X.astype(str) # reshape target to be a 2d array y = y.reshape((len(y), 1)) return X, y # load the dataset X, y = load_dataset('breast-cancer.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # summarize print('Train', X_train.shape, y_train.shape) print('Test', X_test.shape, y_test.shape)

Running the example reports the size of the input and output elements of the train and test sets.

We can see that we have 191 examples for training and 95 for testing.

Train (191, 9) (191, 1) Test (95, 9) (95, 1)

Now that we are familiar with the dataset, let’s look at how we can encode it for modeling.

An ordinal encoding involves mapping each unique label to an integer value.

As such, it is sometimes referred to simply as an integer encoding.

This type of encoding is really only appropriate if there is a known relationship between the categories.

This relationship does exist for some of the variables in the dataset, and ideally, this should be harnessed when preparing the data.

In this case, we will ignore any possible existing ordinal relationship and assume all variables are categorical. It can still be helpful to use an ordinal encoding, at least as a point of reference with other encoding schemes.

We can use the OrdinalEncoder() from scikit-learn to encode each variable to integers. This is a flexible class and does allow the order of the categories to be specified as arguments if any such order is known.

**Note**: I will leave it as an exercise for you to update the example below to try specifying the order for those variables that have a natural ordering and see if it has an impact on model performance.

The best practice when encoding variables is to fit the encoding on the training dataset, then apply it to the train and test datasets.

The function below, named *prepare_inputs()*, takes the input data for the train and test sets and encodes it using an ordinal encoding.

# prepare input data def prepare_inputs(X_train, X_test): oe = OrdinalEncoder() oe.fit(X_train) X_train_enc = oe.transform(X_train) X_test_enc = oe.transform(X_test) return X_train_enc, X_test_enc

We also need to prepare the target variable.

It is a binary classification problem, so we need to map the two class labels to 0 and 1.

This is a type of ordinal encoding, and scikit-learn provides the LabelEncoder class specifically designed for this purpose. We could just as easily use the OrdinalEncoder and achieve the same result, although the *LabelEncoder* is designed for encoding a single variable.

The *prepare_targets()* integer encodes the output data for the train and test sets.

# prepare target def prepare_targets(y_train, y_test): le = LabelEncoder() le.fit(y_train) y_train_enc = le.transform(y_train) y_test_enc = le.transform(y_test) return y_train_enc, y_test_enc

We can call these functions to prepare our data.

... # prepare input data X_train_enc, X_test_enc = prepare_inputs(X_train, X_test) # prepare output data y_train_enc, y_test_enc = prepare_targets(y_train, y_test)

We can now define a neural network model.

We will use the same general model in all of these examples. Specifically, a MultiLayer Perceptron (MLP) neural network with one hidden layer with 10 nodes, and one node in the output layer for making binary classifications.

Without going into too much detail, the code below defines the model, fits it on the training dataset, and then evaluates it on the test dataset.

... # define the model model = Sequential() model.add(Dense(10, input_dim=X_train_enc.shape[1], activation='relu', kernel_initializer='he_normal')) model.add(Dense(1, activation='sigmoid')) # compile the keras model model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # fit the keras model on the dataset model.fit(X_train_enc, y_train_enc, epochs=100, batch_size=16, verbose=2) # evaluate the keras model _, accuracy = model.evaluate(X_test_enc, y_test_enc, verbose=0) print('Accuracy: %.2f' % (accuracy*100))

If you are new to developing neural networks in Keras, I recommend this tutorial:

Tying all of this together, the complete example of preparing the data with an ordinal encoding and fitting and evaluating a neural network on the data is listed below.

# example of ordinal encoding for a neural network from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OrdinalEncoder from keras.models import Sequential from keras.layers import Dense # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] # format all fields as string X = X.astype(str) # reshape target to be a 2d array y = y.reshape((len(y), 1)) return X, y # prepare input data def prepare_inputs(X_train, X_test): oe = OrdinalEncoder() oe.fit(X_train) X_train_enc = oe.transform(X_train) X_test_enc = oe.transform(X_test) return X_train_enc, X_test_enc # prepare target def prepare_targets(y_train, y_test): le = LabelEncoder() le.fit(y_train) y_train_enc = le.transform(y_train) y_test_enc = le.transform(y_test) return y_train_enc, y_test_enc # load the dataset X, y = load_dataset('breast-cancer.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # prepare input data X_train_enc, X_test_enc = prepare_inputs(X_train, X_test) # prepare output data y_train_enc, y_test_enc = prepare_targets(y_train, y_test) # define the model model = Sequential() model.add(Dense(10, input_dim=X_train_enc.shape[1], activation='relu', kernel_initializer='he_normal')) model.add(Dense(1, activation='sigmoid')) # compile the keras model model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # fit the keras model on the dataset model.fit(X_train_enc, y_train_enc, epochs=100, batch_size=16, verbose=2) # evaluate the keras model _, accuracy = model.evaluate(X_test_enc, y_test_enc, verbose=0) print('Accuracy: %.2f' % (accuracy*100))

Running the example will fit the model in just a few seconds on any modern hardware (no GPU required).

The loss and the accuracy of the model are reported at the end of each training epoch, and finally, the accuracy of the model on the test dataset is reported.

Your specific results will vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that the model achieved an accuracy of about 70% on the test dataset.

Not bad, given that an ordinal relationship only exists for some of the input variables, and for those where it does, it was not honored in the encoding.

... Epoch 95/100 - 0s - loss: 0.5349 - acc: 0.7696 Epoch 96/100 - 0s - loss: 0.5330 - acc: 0.7539 Epoch 97/100 - 0s - loss: 0.5316 - acc: 0.7592 Epoch 98/100 - 0s - loss: 0.5302 - acc: 0.7696 Epoch 99/100 - 0s - loss: 0.5291 - acc: 0.7644 Epoch 100/100 - 0s - loss: 0.5277 - acc: 0.7644 Accuracy: 70.53

This provides a good starting point when working with categorical data.

A better and more general approach is to use a one hot encoding.

A one hot encoding is appropriate for categorical data where no relationship exists between categories.

It involves representing each categorical variable with a binary vector that has one element for each unique label and marking the class label with a 1 and all other elements 0.

For example, if our variable was “*color*” and the labels were “*red*,” “*green*,” and “*blue*,” we would encode each of these labels as a three-element binary vector as follows:

- Red: [1, 0, 0]
- Green: [0, 1, 0]
- Blue: [0, 0, 1]

Then each label in the dataset would be replaced with a vector (one column becomes three). This is done for all categorical variables so that our nine input variables or columns become 43 in the case of the breast cancer dataset.

The scikit-learn library provides the OneHotEncoder to automatically one hot encode one or more variables.

The *prepare_inputs()* function below provides a drop-in replacement function for the example in the previous section. Instead of using an *OrdinalEncoder*, it uses a *OneHotEncoder*.

# prepare input data def prepare_inputs(X_train, X_test): ohe = OneHotEncoder() ohe.fit(X_train) X_train_enc = ohe.transform(X_train) X_test_enc = ohe.transform(X_test) return X_train_enc, X_test_enc

Tying this together, the complete example of one hot encoding the breast cancer categorical dataset and modeling it with a neural network is listed below.

# example of one hot encoding for a neural network from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder from keras.models import Sequential from keras.layers import Dense # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] # format all fields as string X = X.astype(str) # reshape target to be a 2d array y = y.reshape((len(y), 1)) return X, y # prepare input data def prepare_inputs(X_train, X_test): ohe = OneHotEncoder() ohe.fit(X_train) X_train_enc = ohe.transform(X_train) X_test_enc = ohe.transform(X_test) return X_train_enc, X_test_enc # prepare target def prepare_targets(y_train, y_test): le = LabelEncoder() le.fit(y_train) y_train_enc = le.transform(y_train) y_test_enc = le.transform(y_test) return y_train_enc, y_test_enc # load the dataset X, y = load_dataset('breast-cancer.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # prepare input data X_train_enc, X_test_enc = prepare_inputs(X_train, X_test) # prepare output data y_train_enc, y_test_enc = prepare_targets(y_train, y_test) # define the model model = Sequential() model.add(Dense(10, input_dim=X_train_enc.shape[1], activation='relu', kernel_initializer='he_normal')) model.add(Dense(1, activation='sigmoid')) # compile the keras model model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # fit the keras model on the dataset model.fit(X_train_enc, y_train_enc, epochs=100, batch_size=16, verbose=2) # evaluate the keras model _, accuracy = model.evaluate(X_test_enc, y_test_enc, verbose=0) print('Accuracy: %.2f' % (accuracy*100))

The example one hot encodes the input categorical data, and also label encodes the target variable as we did in the previous section. The same neural network model is then fit on the prepared dataset.

Your specific results will vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, the model performs reasonably well, achieving an accuracy of about 72%, close to what was seen in the previous section.

A more fair comparison would be to run each configuration 10 or 30 times and compare performance using the mean accuracy. Recall, that we are more focused on how to encode categorical data in this tutorial rather than getting the best score on this specific dataset.

... Epoch 95/100 - 0s - loss: 0.3837 - acc: 0.8272 Epoch 96/100 - 0s - loss: 0.3823 - acc: 0.8325 Epoch 97/100 - 0s - loss: 0.3814 - acc: 0.8325 Epoch 98/100 - 0s - loss: 0.3795 - acc: 0.8325 Epoch 99/100 - 0s - loss: 0.3788 - acc: 0.8325 Epoch 100/100 - 0s - loss: 0.3773 - acc: 0.8325 Accuracy: 72.63

Ordinal and one hot encoding are perhaps the two most popular methods.

A newer technique is similar to one hot encoding and was designed for use with neural networks, called a learned embedding.

A learned embedding, or simply an “*embedding*,” is a distributed representation for categorical data.

Each category is mapped to a distinct vector, and the properties of the vector are adapted or learned while training a neural network. The vector space provides a projection of the categories, allowing those categories that are close or related to cluster together naturally.

This provides both the benefits of an ordinal relationship by allowing any such relationships to be learned from data, and a one hot encoding in providing a vector representation for each category. Unlike one hot encoding, the input vectors are not sparse (do not have lots of zeros). The downside is that it requires learning as part of the model and the creation of many more input variables (columns).

The technique was originally developed to provide a distributed representation for words, e.g. allowing similar words to have similar vector representations. As such, the technique is often referred to as a word embedding, and in the case of text data, algorithms have been developed to learn a representation independent of a neural network. For more on this topic, see the post:

An additional benefit of using an embedding is that the learned vectors that each category is mapped to can be fit in a model that has modest skill, but the vectors can be extracted and used generally as input for the category on a range of different models and applications. That is, they can be learned and reused.

Embeddings can be used in Keras via the *Embedding* layer.

For an example of learning word embeddings for text data in Keras, see the post:

One embedding layer is required for each categorical variable, and the embedding expects the categories to be ordinal encoded, although no relationship between the categories is assumed.

Each embedding also requires the number of dimensions to use for the distributed representation (vector space). It is common in natural language applications to use 50, 100, or 300 dimensions. For our small example, we will fix the number of dimensions at 10, but this is arbitrary; you should experimenter with other values.

First, we can prepare the input data using an ordinal encoding.

The model we will develop will have one separate embedding for each input variable. Therefore, the model will take nine different input datasets. As such, we will split the input variables and ordinal encode (integer encoding) each separately using the *LabelEncoder* and return a list of separate prepared train and test input datasets.

The *prepare_inputs()* function below implements this, enumerating over each input variable, integer encoding each correctly using best practices, and returning lists of encoded train and test variables (or one-variable datasets) that can be used as input for our model later.

# prepare input data def prepare_inputs(X_train, X_test): X_train_enc, X_test_enc = list(), list() # label encode each column for i in range(X_train.shape[1]): le = LabelEncoder() le.fit(X_train[:, i]) # encode train_enc = le.transform(X_train[:, i]) test_enc = le.transform(X_test[:, i]) # store X_train_enc.append(train_enc) X_test_enc.append(test_enc) return X_train_enc, X_test_enc

Now we can construct the model.

We must construct the model differently in this case because we will have nine input layers, with nine embeddings the outputs of which (the nine different 10-element vectors) need to be concatenated into one long vector before being passed as input to the dense layers.

We can achieve this using the functional Keras API. If you are new to the Keras functional API, see the post:

First, we can enumerate each variable and construct an input layer and connect it to an embedding layer, and store both layers in lists. We need a reference to all of the input layers when defining the model, and we need a reference to each embedding layer to concentrate them with a merge layer.

... # prepare each input head in_layers = list() em_layers = list() for i in range(len(X_train_enc)): # calculate the number of unique inputs n_labels = len(unique(X_train_enc[i])) # define input layer in_layer = Input(shape=(1,)) # define embedding layer em_layer = Embedding(n_labels, 10)(in_layer) # store layers in_layers.append(in_layer) em_layers.append(em_layer)

We can then merge all of the embedding layers, define the hidden layer and output layer, then define the model.

... # concat all embeddings merge = concatenate(em_layers) dense = Dense(10, activation='relu', kernel_initializer='he_normal')(merge) output = Dense(1, activation='sigmoid')(dense) model = Model(inputs=in_layers, outputs=output)

When using a model with multiple inputs, we will need to specify a list that has one dataset for each input, e.g. a list of nine arrays each with one column in the case of our dataset. Thankfully, this is the format we returned from our *prepare_inputs()* function.

Therefore, fitting and evaluating the model looks like it does in the previous section.

Additionally, we will plot the model by calling the *plot_model()* function and save it to file. This requires that pygraphviz and pydot are installed, which can be a pain on some systems. **If you have trouble**, just comment out the import statement and call to *plot_model()*.

... # compile the keras model model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # plot graph plot_model(model, show_shapes=True, to_file='embeddings.png') # fit the keras model on the dataset model.fit(X_train_enc, y_train_enc, epochs=20, batch_size=16, verbose=2) # evaluate the keras model _, accuracy = model.evaluate(X_test_enc, y_test_enc, verbose=0) print('Accuracy: %.2f' % (accuracy*100))

Tying this all together, the complete example of using a separate embedding for each categorical input variable in a multi-input layer model is listed below.

# example of learned embedding encoding for a neural network from numpy import unique from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from keras.models import Model from keras.layers import Input from keras.layers import Dense from keras.layers import Embedding from keras.layers.merge import concatenate from keras.utils import plot_model # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] # format all fields as string X = X.astype(str) # reshape target to be a 2d array y = y.reshape((len(y), 1)) return X, y # prepare input data def prepare_inputs(X_train, X_test): X_train_enc, X_test_enc = list(), list() # label encode each column for i in range(X_train.shape[1]): le = LabelEncoder() le.fit(X_train[:, i]) # encode train_enc = le.transform(X_train[:, i]) test_enc = le.transform(X_test[:, i]) # store X_train_enc.append(train_enc) X_test_enc.append(test_enc) return X_train_enc, X_test_enc # prepare target def prepare_targets(y_train, y_test): le = LabelEncoder() le.fit(y_train) y_train_enc = le.transform(y_train) y_test_enc = le.transform(y_test) return y_train_enc, y_test_enc # load the dataset X, y = load_dataset('breast-cancer.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # prepare input data X_train_enc, X_test_enc = prepare_inputs(X_train, X_test) # prepare output data y_train_enc, y_test_enc = prepare_targets(y_train, y_test) # make output 3d y_train_enc = y_train_enc.reshape((len(y_train_enc), 1, 1)) y_test_enc = y_test_enc.reshape((len(y_test_enc), 1, 1)) # prepare each input head in_layers = list() em_layers = list() for i in range(len(X_train_enc)): # calculate the number of unique inputs n_labels = len(unique(X_train_enc[i])) # define input layer in_layer = Input(shape=(1,)) # define embedding layer em_layer = Embedding(n_labels, 10)(in_layer) # store layers in_layers.append(in_layer) em_layers.append(em_layer) # concat all embeddings merge = concatenate(em_layers) dense = Dense(10, activation='relu', kernel_initializer='he_normal')(merge) output = Dense(1, activation='sigmoid')(dense) model = Model(inputs=in_layers, outputs=output) # compile the keras model model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # plot graph plot_model(model, show_shapes=True, to_file='embeddings.png') # fit the keras model on the dataset model.fit(X_train_enc, y_train_enc, epochs=20, batch_size=16, verbose=2) # evaluate the keras model _, accuracy = model.evaluate(X_test_enc, y_test_enc, verbose=0) print('Accuracy: %.2f' % (accuracy*100))

Running the example prepares the data as described above, fits the model, and reports the performance.

Your specific results will vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, the model performs reasonably well, matching what we saw for the one hot encoding in the previous section.

As the learned vectors were trained in a skilled model, it is possible to save them and use them as a general representation for these variables in other models that operate on the same data. A useful and compelling reason to explore this encoding.

... Epoch 15/20 - 0s - loss: 0.4891 - acc: 0.7696 Epoch 16/20 - 0s - loss: 0.4845 - acc: 0.7749 Epoch 17/20 - 0s - loss: 0.4783 - acc: 0.7749 Epoch 18/20 - 0s - loss: 0.4763 - acc: 0.7906 Epoch 19/20 - 0s - loss: 0.4696 - acc: 0.7906 Epoch 20/20 - 0s - loss: 0.4660 - acc: 0.7958 Accuracy: 72.63

To confirm our understanding of the model, a plot is created and saved to the file embeddings.png in the current working directory.

The plot shows the nine inputs each mapped to a 10 element vector, meaning that the actual input to the model is a 90 element vector.

**Note**: Click to the image to see the large version.

This section lists some common questions and answers when encoding categorical data.

Or, what if I have a mixture of categorical and ordinal data?

You will need to prepare or encode each variable (column) in your dataset separately, then concatenate all of the prepared variables back together into a single array for fitting or evaluating the model.

Or, what if I concatenate many one hot encoded vectors to create a many thousand element input vector?

You can use a one hot encoding up to thousands and tens of thousands of categories. Also, having large vectors as input sounds intimidating, but the models can generally handle it.

Try an embedding; it offers the benefit of a smaller vector space (a projection) and the representation can have more meaning.

This is unknowable.

Test each technique (and more) on your dataset with your chosen model and discover what works best for your case.

This section provides more resources on the topic if you are looking to go deeper.

- Develop Your First Neural Network in Python Step-By-Step
- Why One-Hot Encode Data in Machine Learning?
- Data Preparation for Gradient Boosting with XGBoost in Python
- What Are Word Embeddings for Text?
- How to Use Word Embedding Layers for Deep Learning with Keras
- How to Use the Keras Functional API for Deep Learning

- sklearn.model_selection.train_test_split API.
- sklearn.preprocessing.OrdinalEncoder API.
- sklearn.preprocessing.LabelEncoder API.
- Embedding Keras API.
- Visualization Keras API.

- Breast Cancer Data Set, UCI Machine Learning Repository.
- Breast Cancer Raw Dataset
- Breast Cancer Description

In this tutorial, you discovered how to encode categorical data when developing neural network models in Keras.

Specifically, you learned:

- The challenge of working with categorical data when using machine learning and deep learning models.
- How to integer encode and one hot encode categorical variables for modeling.
- How to learn an embedding distributed representation as part of a neural network for categorical variables.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post 3 Ways to Encode Categorical Variables for Deep Learning appeared first on Machine Learning Mastery.

]]>