The post How to Use Out-of-Fold Predictions in Machine Learning appeared first on Machine Learning Mastery.

]]>Machine learning algorithms are typically evaluated using resampling techniques such as k-fold cross-validation.

During the k-fold cross-validation process, predictions are made on test sets comprised of data not used to train the model. These predictions are referred to as **out-of-fold predictions**, a type of out-of-sample predictions.

Out-of-fold predictions play an important role in machine learning in both estimating the performance of a model when making predictions on new data in the future, so-called the generalization performance of the model, and in the development of ensemble models.

In this tutorial, you will discover a gentle introduction to out-of-fold predictions in machine learning.

After completing this tutorial, you will know:

- Out-of-fold predictions are a type of out-of-sample predictions made on data not used to train a model.
- Out-of-fold predictions are most commonly used to estimate the performance of a model when making predictions on unseen data.
- Out-of-fold predictions can be used to construct an ensemble model called a stacked generalization or stacking ensemble.

Let’s get started.

This tutorial is divided into three parts; they are:

- What Are Out-of-Fold Predictions?
- Out-of-Fold Predictions for Evaluation
- Out-of-Fold Predictions for Ensembles

It is common to evaluate the performance of a machine learning algorithm on a dataset using a resampling technique such as k-fold cross-validation.

The k-fold cross-validation procedure involves splitting a training dataset into *k* groups, then using each of the *k* groups of examples on a test set while the remaining examples are used as a training set.

This means that *k* different models are trained and evaluated. The performance of the model is estimated using the predictions by the models made across all k-folds.

This procedure can be summarized as follows:

- 1. Shuffle the dataset randomly.
- 2. Split the dataset into k groups.
- 3. For each unique group:
- a. Take the group as a holdout or test data set.
- b. Take the remaining groups as a training data set.
- c. Fit a model on the training set and evaluate it on the test set.
- d. Retain the evaluation score and discard the model.

- 4. Summarize the skill of the model using the sample of model evaluation scores.

Importantly, each observation in the data sample is assigned to an individual group and stays in that group for the duration of the procedure. This means that each sample is given the opportunity to be used in the holdout set 1 time and used to train the model k-1 times.

For more on the topic of k-fold cross-validation, see the tutorial:

An out-of-fold prediction is a prediction by the model during the k-fold cross-validation procedure.

That is, out-of-fold predictions are those predictions made on the holdout datasets during the resampling procedure. If performed correctly, there will be one prediction for each example in the training dataset.

Sometimes, out-of-fold is summarized with the acronym OOF.

**Out-of-Fold Predictions**: Predictions made by models during the k-fold cross-validation procedure on the holdout examples.

The notion of out-of-fold predictions is directly related to the idea of **out-of-sample predictions**, as the predictions in both cases are made on examples that were not used during the training of the model and can be used to estimate the performance of the model when used to make predictions on new data.

As such, out-of-fold predictions are a type of out-of-sample prediction, although described in the context of a model evaluated using k-fold cross-validation.

**Out-of-Sample Predictions**: Predictions made by a model on data not used during the training of the model.

Out-of-sample predictions may also be referred to as holdout predictions.

There are two main uses for out-of-fold predictions; they are:

- Estimate the performance of the model on unseen data.
- Fit an ensemble model.

Let’s take a closer look at these two cases.

The most common use for out-of-fold predictions is to estimate the performance of the model.

That is, predictions on data that were not used to train the model can be made and evaluated using a scoring metric such as error or accuracy. This metric provides an estimate of the performance of the model when used to make predictions on new data, such as when the model will be used in practice to make predictions.

Generally, predictions made on data not used to train a model provide insight into how the model will generalize to new situations. As such, scores that evaluate these predictions are referred to as the generalized performance of a machine learning model.

There are two main approaches that these predictions can use to estimate the performance of the model.

The first is to score the model on the predictions made during each fold, then calculate the average of those scores. For example, if we are evaluating a classification model, then classification accuracy can be calculated on each group of out-of-fold predictions, then the mean accuracy can be reported.

**Approach 1**: Estimate performance as the mean score estimated on each group of out-of-fold predictions.

The second approach is to consider that each example appears just once in each test set. That is, each example in the training dataset has a single prediction made during the k-fold cross-validation process. As such, we can collect all predictions and compare them to their expected outcome and calculate a score directly across the entire training dataset.

**Approach 2:**Estimate performance using the aggregate of all out-of-fold predictions.

Both are reasonable approaches and the scores that result from each procedure should be approximately equivalent.

Calculating the mean from each group of out-of-sample predictions may be the most common approach, as the variance of the estimate can also be calculated as the standard deviation or standard error.

The

kresampled estimates of performance are summarized (usually with the mean and standard error) …

— Page 70, Applied Predictive Modeling, 2013.

We can demonstrate the difference between these two approaches to evaluating models using out-of-fold predictions with a small worked example.

We will use the make_blobs() scikit-learn function to create a test binary classification problem with 1,000 examples, two classes, and 100 input features.

The example below prepares a data sample and summarizes the shape of the input and output elements of the dataset.

# example of creating a test dataset from sklearn.datasets.samples_generator import make_blobs # create the inputs and outputs X, y = make_blobs(n_samples=1000, centers=2, n_features=100, cluster_std=20) # summarize the shape of the arrays print(X.shape, y.shape)

Running the example prints the shape of the input data showing 1,000 rows of data with 100 columns or input features and the corresponding classification labels.

(1000, 100) (1000,)

Next, we can use *k*-fold cross-validation to evaluate a KNeighborsClassifier model.

We will use *k*=10 for the KFold object, the sensible default, fit a model on each training dataset, and evaluate it on each holdout fold.

Accuracy scores will be stored in a list across each model evaluation and will report the mean and standard deviation of these scores.

The complete example is listed below.

# evaluate model by averaging performance across each fold from numpy import mean from numpy import std from sklearn.datasets.samples_generator import make_blobs from sklearn.model_selection import KFold from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score # create the inputs and outputs X, y = make_blobs(n_samples=1000, centers=2, n_features=100, cluster_std=20) # k-fold cross validation scores = list() kfold = KFold(n_splits=10, shuffle=True) # enumerate splits for train_ix, test_ix in kfold.split(X): # get data train_X, test_X = X[train_ix], X[test_ix] train_y, test_y = y[train_ix], y[test_ix] # fit model model = KNeighborsClassifier() model.fit(train_X, train_y) # evaluate model yhat = model.predict(test_X) acc = accuracy_score(test_y, yhat) # store score scores.append(acc) print('> ', acc) # summarize model performance mean_s, std_s = mean(scores), std(scores) print('Mean: %.3f, Standard Deviation: %.3f' % (mean_s, std_s))

Running the example reports the model classification accuracy on the holdout fold for each iteration.

At the end of the run, the mean and standard deviation of the accuracy scores are reported.

Your specific results will vary given the stochastic nature of the data sample and learning algorithm. Try running the example a few times.

> 0.95 > 0.92 > 0.95 > 0.95 > 0.91 > 0.97 > 0.96 > 0.96 > 0.98 > 0.91 Mean: 0.946, Standard Deviation: 0.023

We can contrast this with the alternate approach that evaluates all predictions as a single group.

Instead of evaluating the model on each holdout fold, predictions are made and stored in a list. Then, at the end of the run, the predictions are compared to the expected values for each holdout test set and a single accuracy score is reported.

The complete example is listed below.

# evaluate model by calculating the score across all predictions from sklearn.datasets.samples_generator import make_blobs from sklearn.model_selection import KFold from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score # create the inputs and outputs X, y = make_blobs(n_samples=1000, centers=2, n_features=100, cluster_std=20) # k-fold cross validation data_y, data_yhat = list(), list() kfold = KFold(n_splits=10, shuffle=True) # enumerate splits for train_ix, test_ix in kfold.split(X): # get data train_X, test_X = X[train_ix], X[test_ix] train_y, test_y = y[train_ix], y[test_ix] # fit model model = KNeighborsClassifier() model.fit(train_X, train_y) # make predictions yhat = model.predict(test_X) # store data_y.extend(test_y) data_yhat.extend(yhat) # evaluate the model acc = accuracy_score(data_y, data_yhat) print('Accuracy: %.3f' % (acc))

Running the example collects all of the expected and predicted values for each holdout dataset and reports a single accuracy score at the end of the run.

Your specific results will vary given the stochastic nature of the data sample and learning algorithm. Try running the example a few times.

Accuracy: 0.930

Again, both approaches are comparable and it may be a matter of taste as to the method you use on your own predictive modeling problem.

Another common use for out-of-fold predictions is to use them in the development of an ensemble model.

An ensemble is a machine learning model that combines the predictions from two or more models prepared on the same training dataset.

This is a very common procedure to use when working on a machine learning competition.

The out-of-fold predictions in aggregate provide information about how the model performs on each example in the training dataset when not used to train the model. This information can be used to train a model to correct or improve upon those predictions.

First, the *k*-fold cross-validation procedure is performed on each base model of interest, and all of the out-of-fold predictions are collected. Importantly, the same split of the training data into *k*-folds is performed for each model. Now we have one aggregated group of out-of-sample predictions for each model, e.g. predictions for each example in the training dataset.

**Base-Models**: Models evaluated using*k*-fold cross-validation on the training dataset and all out-of-fold predictions are retained.

Next, a second higher-order model, called a meta-model, is trained on the predictions made by the other models. This meta-model may or may not also take the input data for each example as input when making predictions. The job of this model is to learn how to best combine and correct the predictions made by the other models using their out-of-fold predictions.

**Meta-Model**: Model that takes the out-of-fold predictions made by one or more models as input and shows how to best combine and correct the predictions.

For example, we may have a two-class classification predictive modeling problem and train a decision tree and a k-nearest neighbor model as the base models. Each model predicts a 0 or 1 for each example in the training dataset via out-of-fold predictions. These predictions, along with the input data, can then form a new input to the meta-model.

**Meta-Model Input**: Input portion of a given sample concatenated with the predictions made by each base model.**Meta-Model Output**: Output portion of a given sample.

*Why use the out-of-fold predictions to train the meta-model?*

We could train each base model on the entire training dataset, then make a prediction for each example in the training dataset and use the predictions as input to the meta-model. The problem is the predictions will be optimistic because the samples were used in the training of each base model. This optimistic bias means that the predictions will be better than normal, and the meta-model will likely not learn what is required to combine and correct the predictions from the base models.

By using out-of-fold predictions from the base model to train the meta-model, the meta-model can see and harness the expected behavior of each base model when operating on unseen data, as will be the case when the ensemble is used in practice to make predictions on new data.

Finally, each of the base models are trained on the entire training dataset and these final models and the meta-model can be used to make predictions on new data. The performance of this ensemble can be evaluated on a separate holdout test dataset not used during training.

This procedure can be summarized as follows:

- 1. For each base model:
- a. Use k-fold cross-validation and collect out-of-fold predictions.
- b.Train meta-model on the out-of-fold predictions from all models.
- c. Train each base model on the entire training dataset.

This procedure is called stacked generalization, or stacking for short. Because it is common to use a linear weighted sum as the meta-model, this procedure is sometimes called ** blending**.

For more on the topic of stacking, see the tutorials:

- How to Develop a Stacking Ensemble for Deep Learning Neural Networks in Python With Keras
- How to Implement Stacked Generalization (Stacking) From Scratch With Python

We can make this procedure concrete with a worked example using the same dataset used in the previous section.

First, we will split the data into training and validation datasets. The training dataset will be used to fit the submodels and meta-model, and the validation dataset will be held back from training and used at the end to evaluate the meta-model and submodels.

... # split X, X_val, y, y_val = train_test_split(X, y, test_size=0.33)

In this example, we will use k-fold cross-validation to fit a DecisionTreeClassifier and KNeighborsClassifier model each cross-validation fold, and use the fit models to make out-of-fold predictions.

The models will make predictions of probabilities instead of class labels in an attempt to provide more useful input features for the meta-model. This is a good practice.

We will also keep track of the input data (100 features) and output data (expected label) for the out-of-fold data.

... # collect out of sample predictions data_x, data_y, knn_yhat, cart_yhat = list(), list(), list(), list() kfold = KFold(n_splits=10, shuffle=True) for train_ix, test_ix in kfold.split(X): # get data train_X, test_X = X[train_ix], X[test_ix] train_y, test_y = y[train_ix], y[test_ix] data_x.extend(test_X) data_y.extend(test_y) # fit and make predictions with cart model1 = DecisionTreeClassifier() model1.fit(train_X, train_y) yhat1 = model1.predict_proba(test_X)[:, 0] cart_yhat.extend(yhat1) # fit and make predictions with cart model2 = KNeighborsClassifier() model2.fit(train_X, train_y) yhat2 = model2.predict_proba(test_X)[:, 0] knn_yhat.extend(yhat2)

At the end of the run, we can then construct a dataset for a meta classifier comprised of 100 input features for the input data and the two columns of predicted probabilities from the kNN and decision tree models.

The *create_meta_dataset()* function below implements this, taking the out-of-fold data and predictions across the folds as input and constructs the input dataset for the meta-model.

# create a meta dataset def create_meta_dataset(data_x, yhat1, yhat2): # convert to columns yhat1 = array(yhat1).reshape((len(yhat1), 1)) yhat2 = array(yhat2).reshape((len(yhat2), 1)) # stack as separate columns meta_X = hstack((data_x, yhat1, yhat2)) return meta_X

We can then call this function to prepare data for the meta-model.

... # construct meta dataset meta_X = create_meta_dataset(data_x, knn_yhat, cart_yhat)

We can then fit each of the submodels on the entire training dataset ready for making predictions on the validation dataset.

... # fit final submodels model1 = DecisionTreeClassifier() model1.fit(X, y) model2 = KNeighborsClassifier() model2.fit(X, y)

We can then fit the meta-model on the prepared dataset, in this case, a LogisticRegression model.

... # construct meta classifier meta_model = LogisticRegression(solver='liblinear') meta_model.fit(meta_X, data_y)

Finally, we can use the meta-model to make predictions on the holdout dataset.

This requires that data first pass through the sub models, the outputs used in the construction of a dataset for the meta-model, then the meta-model is used to make a prediction. We will wrap all of this up into a function named *stack_prediction()* that takes the models and the data for which the prediction will be made.

# make predictions with stacked model def stack_prediction(model1, model2, meta_model, X): # make predictions yhat1 = model1.predict_proba(X)[:, 0] yhat2 = model2.predict_proba(X)[:, 0] # create input dataset meta_X = create_meta_dataset(X, yhat1, yhat2) # predict return meta_model.predict(meta_X)

We can then evaluate the submodels on the holdout dataset for reference, then use the meta-model to make a prediction on the holdout dataset and evaluate it.

We expect that the meta-model would achieve as good or better performance on the holdout dataset than any single submodel. If this is not the case, alternate submodels or meta-models could be used on the problem instead.

... # evaluate sub models on hold out dataset acc1 = accuracy_score(y_val, model1.predict(X_val)) acc2 = accuracy_score(y_val, model2.predict(X_val)) print('Model1 Accuracy: %.3f, Model2 Accuracy: %.3f' % (acc1, acc2)) # evaluate meta model on hold out dataset yhat = stack_prediction(model1, model2, meta_model, X_val) acc = accuracy_score(y_val, yhat) print('Meta Model Accuracy: %.3f' % (acc))

Tying this all together, the complete example is listed below.

# example of a stacked model for binary classification from numpy import hstack from numpy import array from sklearn.datasets.samples_generator import make_blobs from sklearn.model_selection import KFold from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score # create a meta dataset def create_meta_dataset(data_x, yhat1, yhat2): # convert to columns yhat1 = array(yhat1).reshape((len(yhat1), 1)) yhat2 = array(yhat2).reshape((len(yhat2), 1)) # stack as separate columns meta_X = hstack((data_x, yhat1, yhat2)) return meta_X # make predictions with stacked model def stack_prediction(model1, model2, meta_model, X): # make predictions yhat1 = model1.predict_proba(X)[:, 0] yhat2 = model2.predict_proba(X)[:, 0] # create input dataset meta_X = create_meta_dataset(X, yhat1, yhat2) # predict return meta_model.predict(meta_X) # create the inputs and outputs X, y = make_blobs(n_samples=1000, centers=2, n_features=100, cluster_std=20) # split X, X_val, y, y_val = train_test_split(X, y, test_size=0.33) # collect out of sample predictions data_x, data_y, knn_yhat, cart_yhat = list(), list(), list(), list() kfold = KFold(n_splits=10, shuffle=True) for train_ix, test_ix in kfold.split(X): # get data train_X, test_X = X[train_ix], X[test_ix] train_y, test_y = y[train_ix], y[test_ix] data_x.extend(test_X) data_y.extend(test_y) # fit and make predictions with cart model1 = DecisionTreeClassifier() model1.fit(train_X, train_y) yhat1 = model1.predict_proba(test_X)[:, 0] cart_yhat.extend(yhat1) # fit and make predictions with cart model2 = KNeighborsClassifier() model2.fit(train_X, train_y) yhat2 = model2.predict_proba(test_X)[:, 0] knn_yhat.extend(yhat2) # construct meta dataset meta_X = create_meta_dataset(data_x, knn_yhat, cart_yhat) # fit final submodels model1 = DecisionTreeClassifier() model1.fit(X, y) model2 = KNeighborsClassifier() model2.fit(X, y) # construct meta classifier meta_model = LogisticRegression(solver='liblinear') meta_model.fit(meta_X, data_y) # evaluate sub models on hold out dataset acc1 = accuracy_score(y_val, model1.predict(X_val)) acc2 = accuracy_score(y_val, model2.predict(X_val)) print('Model1 Accuracy: %.3f, Model2 Accuracy: %.3f' % (acc1, acc2)) # evaluate meta model on hold out dataset yhat = stack_prediction(model1, model2, meta_model, X_val) acc = accuracy_score(y_val, yhat) print('Meta Model Accuracy: %.3f' % (acc))

Running the example first reports the accuracy of the decision tree and kNN model, then the performance of the meta-model on the holdout dataset, not seen during training.

Your specific results will vary given the stochastic nature of the data sample and learning algorithm. Try running the example a few times.

In this case, we can see that the meta-model has out-performed both submodels.

Model1 Accuracy: 0.670, Model2 Accuracy: 0.930 Meta-Model Accuracy: 0.955

It might be interesting to try an ablative study to re-run the example with just model1, just model2, and neither model 1 and model 2 as input to the meta-model to confirm that the predictions from the submodels are actually adding value to the meta-model.

This section provides more resources on the topic if you are looking to go deeper.

- A Gentle Introduction to k-fold Cross-Validation
- How to Develop a Stacking Ensemble for Deep Learning Neural Networks in Python With Keras
- How to Implement Stacked Generalization (Stacking) From Scratch With Python
- How to Create a Bagging Ensemble of Deep Learning Models in Keras
- Ensemble Learning Methods for Deep Learning Neural Networks

- Applied Predictive Modeling, 2013.

- sklearn.datasets.make_blobs API.
- sklearn.model_selection.KFold API.
- sklearn.neighbors.KNeighborsClassifier API.
- sklearn.tree.DecisionTreeClassifier API.
- sklearn.metrics.accuracy_score API.
- sklearn.linear_model.LogisticRegression API.
- sklearn.model_selection.train_test_split API.

In this tutorial, you discovered out-of-fold predictions in machine learning.

Specifically, you learned:

- Out-of-fold predictions are a type of out-of-sample predictions made on data not used to train a model.
- Out-of-fold predictions are most commonly used to estimate the performance of a model when making predictions on unseen data.
- Out-of-fold predictions can be used to construct an ensemble model called a stacked generalization or stacking ensemble.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Use Out-of-Fold Predictions in Machine Learning appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to the Bayes Optimal Classifier appeared first on Machine Learning Mastery.

]]>The Bayes Optimal Classifier is a probabilistic model that makes the most probable prediction for a new example.

It is described using the Bayes Theorem that provides a principled way for calculating a conditional probability. It is also closely related to the Maximum a Posteriori: a probabilistic framework referred to as MAP that finds the most probable hypothesis for a training dataset.

In practice, the Bayes Optimal Classifier is computationally expensive, if not intractable to calculate, and instead, simplifications such as the Gibbs algorithm and Naive Bayes can be used to approximate the outcome.

In this post, you will discover Bayes Optimal Classifier for making the most accurate predictions for new instances of data.

After reading this post, you will know:

- Bayes Theorem provides a principled way for calculating conditional probabilities, called a posterior probability.
- Maximum a Posteriori is a probabilistic framework that finds the most probable hypothesis that describes the training dataset.
- Bayes Optimal Classifier is a probabilistic model that finds the most probable prediction using the training data and space of hypotheses to make a prediction for a new data instance.

Let’s get started.

This tutorial is divided into three parts; they are:

- Bayes Theorem
- Maximum a Posteriori (MAP)
- Bayes Optimal Classifier

Recall that the Bayes theorem provides a principled way of calculating a conditional probability.

It involves calculating the conditional probability of one outcome given another outcome, using the inverse of this relationship, stated as follows:

- P(A | B) = (P(B | A) * P(A)) / P(B)

The quantity that we are calculating is typically referred to as the posterior probability of *A* given *B* and *P(A)* is referred to as the prior probability of *A*.

The normalizing constant of *P(B)* can be removed, and the posterior can be shown to be proportional to the probability of B given A multiplied by the prior.

- P(A | B) is proportional to P(B | A) * P(A)

Or, simply:

- P(A | B) = P(B | A) * P(A)

This is a helpful simplification as we are not interested in estimating a probability, but instead in optimizing a quantity. A proportional quantity is good enough for this purpose.

For more on the topic of Bayes Theorem, see the post:

Now that we are up to speed on Bayes Theorem, let’s also take a look at the Maximum a Posteriori framework.

Machine learning involves finding a model (hypothesis) that best explains the training data.

There are two probabilistic frameworks that underlie many different machine learning algorithms.

They are:

- Maximum a Posteriori (MAP), a Bayesian method.
- Maximum Likelihood Estimation (MLE), a frequentist method.

The objective of both of these frameworks in the context of machine learning is to locate the hypothesis that is most probable given the training dataset.

Specifically, they answer the question:

What is the most probable hypothesis given the training data?

Both approaches frame the problem of fitting a model as optimization and involve searching for a distribution and set of parameters for the distribution that best describes the observed data.

MLE is a frequentist approach, and MAP provides a Bayesian alternative.

A popular replacement for maximizing the likelihood is maximizing the Bayesian posterior probability density of the parameters instead.

— Page 306, Information Theory, Inference and Learning Algorithms, 2003.

Given the simplification of Bayes Theorem to a proportional quantity, we can use it to estimate the proportional hypothesis and parameters (*theta*) that explain our dataset (*X*), stated as:

- P(theta | X) = P(X | theta) * P(theta)

Maximizing this quantity over a range of theta solves an optimization problem for estimating the central tendency of the posterior probability (e.g. the model of the distribution).

As such, this technique is referred to as “*maximum a posteriori estimation*,” or MAP estimation for short, and sometimes simply “*maximum posterior estimation*.”

- maximize P(X | theta) * P(theta)

For more on the topic of Maximum a Posteriori, see the post:

Now that we are familiar with the MAP framework, we can take a closer look at the related concept of the Bayes optimal classifier.

The Bayes optimal classifier is a probabilistic model that makes the most probable prediction for a new example, given the training dataset.

This model is also referred to as the Bayes optimal learner, the Bayes classifier, Bayes optimal decision boundary, or the Bayes optimal discriminant function.

**Bayes Classifier**: Probabilistic model that makes the most probable prediction for new examples.

Specifically, the Bayes optimal classifier answers the question:

What is the most probable classification of the new instance given the training data?

This is different from the MAP framework that seeks the most probable hypothesis (model). Instead, we are interested in making a specific prediction.

In general, the most probable classification of the new instance is obtained by combining the predictions of all hypotheses, weighted by their posterior probabilities.

— Page 175, Machine Learning, 1997.

The equation below demonstrates how to calculate the conditional probability for a new instance (*vi*) given the training data (*D*), given a space of hypotheses (*H*).

- P(vj | D) = sum {h in H} P(vj | hi) * P(hi | D)

Where *vj* is a new instance to be classified, *H* is the set of hypotheses for classifying the instance, *hi* is a given hypothesis, *P(vj | hi)* is the posterior probability for *vi* given hypothesis *hi*, and *P(hi | D)* is the posterior probability of the hypothesis *hi* given the data *D*.

Selecting the outcome with the maximum probability is an example of a Bayes optimal classification.

- max sum {h in H} P(vj | hi) * P(hi | D)

Any model that classifies examples using this equation is a Bayes optimal classifier and no other model can outperform this technique, on average.

Any system that classifies new instances according to [the equation] is called a Bayes optimal classifier, or Bayes optimal learner. No other classification method using the same hypothesis space and same prior knowledge can outperform this method on average.

— Page 175, Machine Learning, 1997.

We have to let that sink in.

It is a big deal.

It means that any other algorithm that operates on the same data, the same set of hypotheses, and same prior probabilities cannot outperform this approach, on average. Hence the name “*optimal classifier*.”

Although the classifier makes optimal predictions, it is not perfect given the uncertainty in the training data and incomplete coverage of the problem domain and hypothesis space. As such, the model will make errors. These errors are often referred to as Bayes errors.

The Bayes classifier produces the lowest possible test error rate, called the Bayes error rate. […] The Bayes error rate is analogous to the irreducible error …

— Page 38, An Introduction to Statistical Learning with Applications in R, 2017.

Because the Bayes classifier is optimal, the Bayes error is the minimum possible error that can be made.

**Bayes Error**: The minimum possible error that can be made when making predictions.

Further, the model is often described in terms of classification, e.g. the Bayes Classifier. Nevertheless, the principle applies just as well to regression: that is, predictive modeling problems where a numerical value is predicted instead of a class label.

It is a theoretical model, but it is held up as an ideal that we may wish to pursue.

In theory we would always like to predict qualitative responses using the Bayes classifier. But for real data, we do not know the conditional distribution of Y given X, and so computing the Bayes classifier is impossible. Therefore, the Bayes classifier serves as an unattainable gold standard against which to compare other methods.

— Page 39, An Introduction to Statistical Learning with Applications in R, 2017.

Because of the computational cost of this optimal strategy, we instead can work with direct simplifications of the approach.

Two of the most commonly used simplifications use a sampling algorithm for hypotheses, such as Gibbs sampling, or to use the simplifying assumptions of the Naive Bayes classifier.

**Gibbs Algorithm**. Randomly sample hypotheses biased on their posterior probability.**Naive Bayes**. Assume that variables in the input data are conditionally independent.

For more on the topic of Naive Bayes, see the post:

Nevertheless, many nonlinear machine learning algorithms are able to make predictions are that are close approximations of the Bayes classifier in practice.

Despite the fact that it is a very simple approach, KNN can often produce classifiers that are surprisingly close to the optimal Bayes classifier.

— Page 39, An Introduction to Statistical Learning with Applications in R, 2017.

This section provides more resources on the topic if you are looking to go deeper.

- A Gentle Introduction to Maximum a Posteriori (MAP) for Machine Learning
- A Gentle Introduction to Bayes Theorem for Machine Learning
- How to Develop a Naive Bayes Classifier from Scratch in Python

- Section 6.7 Bayes Optimal Classifier, Machine Learning, 1997.
- Section 2.4.2 Bayes error and noise, Foundations of Machine Learning, 2nd edition, 2018.
- Section 2.2.3 The Classification Setting, An Introduction to Statistical Learning with Applications in R, 2017.
- Information Theory, Inference and Learning Algorithms, 2003.

- The Multilayer Perceptron As An Approximation To A Bayes Optimal Discriminant Function, 1990.
- Bayes Optimal Multilabel Classification via Probabilistic Classifier Chains, 2010.
- Restricted bayes optimal classifiers, 2000.
- Bayes Classifier And Bayes Error, 2013.

In this post, you discovered the Bayes Optimal Classifier for making the most accurate predictions for new instances of data.

Specifically, you learned:

- Bayes Theorem provides a principled way for calculating conditional probabilities, called a posterior probability.
- Maximum a Posteriori is a probabilistic framework that finds the most probable hypothesis that describes the training dataset.
- Bayes Optimal Classifier is a probabilistic framework that finds the most probable prediction using the training data and space of hypotheses to make a prediction for a new data instance.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to the Bayes Optimal Classifier appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Model Selection for Machine Learning appeared first on Machine Learning Mastery.

]]>Given easy-to-use machine learning libraries like scikit-learn and Keras, it is straightforward to fit many different machine learning models on a given predictive modeling dataset.

The challenge of applied machine learning, therefore, becomes how to choose among a range of different models that you can use for your problem.

Naively, you might believe that model performance is sufficient, but should you consider other concerns, such as how long the model takes to train or how easy it is to explain to project stakeholders. Their concerns become more pressing if a chosen model must be used operationally for months or years.

Also, what are you choosing exactly: just the algorithm used to fit the model or the entire data preparation and model fitting pipeline?

In this post, you will discover the challenge of model selection for machine learning.

After reading this post, you will know:

- Model selection is the process of choosing one among many candidate models for a predictive modeling problem.
- There may be many competing concerns when performing model selection beyond model performance, such as complexity, maintainability, and available resources.
- The two main classes of model selection techniques are probabilistic measures and resampling methods.

Let’s get started.

This tutorial is divided into three parts; they are:

- What Is Model Selection
- Considerations for Model Selection
- Model Selection Techniques

Model selection is the process of selecting one final machine learning model from among a collection of candidate machine learning models for a training dataset.

Model selection is a process that can be applied both across different types of models (e.g. logistic regression, SVM, KNN, etc.) and across models of the same type configured with different model hyperparameters (e.g. different kernels in an SVM).

When we have a variety of models of different complexity (e.g., linear or logistic regression models with different degree polynomials, or KNN classifiers with different values of K), how should we pick the right one?

— Page 22, Machine Learning: A Probabilistic Perspective, 2012.

For example, we may have a dataset for which we are interested in developing a classification or regression predictive model. We do not know beforehand as to which model will perform best on this problem, as it is unknowable. Therefore, we fit and evaluate a suite of different models on the problem.

**Model selection** is the process of choosing one of the models as the final model that addresses the problem.

Model selection is different from **model assessment**.

For example, we evaluate or assess candidate models in order to choose the best one, and this is model selection. Whereas once a model is chosen, it can be evaluated in order to communicate how well it is expected to perform in general; this is model assessment.

The process of evaluating a model’s performance is known as model assessment, whereas the process of selecting the proper level of flexibility for a model is known as model selection.

— Page 175, An Introduction to Statistical Learning: with Applications in R, 2017.

Fitting models is relatively straightforward, although selecting among them is the true challenge of applied machine learning.

Firstly, we need to get over the idea of a “*best*” model.

All models have some predictive error, given the statistical noise in the data, the incompleteness of the data sample, and the limitations of each different model type. Therefore, the notion of a perfect or best model is not useful. Instead, we must seek a model that is “*good enough*.”

**What do we care about when choosing a final model?**

The project stakeholders may have specific requirements, such as maintainability and limited model complexity. As such, a model that has lower skill but is simpler and easier to understand may be preferred.

Alternately, if model skill is prized above all other concerns, then the ability of the model to perform well on out-of-sample data will be preferred regardless of the computational complexity involved.

Therefore, a “*good enough*” model may refer to many things and is specific to your project, such as:

- A model that meets the requirements and constraints of project stakeholders.
- A model that is sufficiently skillful given the time and resources available.
- A model that is skillful as compared to naive models.
- A model that is skillful relative to other tested models.
- A model that is skillful relative to the state-of-the-art.

Next, we must consider what is being selected.

For example, we are not selecting a fit model, as all models will be discarded. This is because once we choose a model, we will fit a new final model on all available data and start using it to make predictions.

Therefore, are we choosing among algorithms used to fit the models on the training dataset?

Some algorithms require specialized data preparation in order to best expose the structure of the problem to the learning algorithm. Therefore, we must go one step further and consider **model selection as the process of selecting among model development pipelines**.

Each pipeline may take in the same raw training dataset and outputs a model that can be evaluated in the same manner but may require different or overlapping computational steps, such as:

- Data filtering.
- Data transformation.
- Feature selection.
- Feature engineering.
- And more…

The closer you look at the challenge of model selection, the more nuance you will discover.

Now that we are familiar with some considerations involved in model selection, let’s review some common methods for selecting a model.

The best approach to model selection requires “*sufficient*” data, which may be nearly infinite depending on the complexity of the problem.

In this ideal situation, we would split the data into training, validation, and test sets, then fit candidate models on the training set, evaluate and select them on the validation set, and report the performance of the final model on the test set.

If we are in a data-rich situation, the best approach […] is to randomly divide the dataset into three parts: a training set, a validation set, and a test set. The training set is used to fit the models; the validation set is used to estimate prediction error for model selection; the test set is used for assessment of the generalization error of the final chosen model.

— Page 222, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2017.

This is impractical on most predictive modeling problems given that we rarely have sufficient data, or are able to even judge what would be sufficient.

In many applications, however, the supply of data for training and testing will be limited, and in order to build good models, we wish to use as much of the available data as possible for training. However, if the validation set is small, it will give a relatively noisy estimate of predictive performance.

– Page 32, Pattern Recognition and Machine Learning, 2006.

Instead, there are two main classes of techniques to approximate the ideal case of model selection; they are:

**Probabilistic Measures**: Choose a model via in-sample error and complexity.**Resampling Methods**: Choose a model via estimated out-of-sample error.

Let’s take a closer look at each in turn.

Probabilistic measures involve analytically scoring a candidate model using both its performance on the training dataset and the complexity of the model.

It is known that training error is optimistically biased, and therefore is not a good basis for choosing a model. The performance can be penalized based on how optimistic the training error is believed to be. This is typically achieved using algorithm-specific methods, often linear, that penalize the score based on the complexity of the model.

Historically various ‘information criteria’ have been proposed that attempt to correct for the bias of maximum likelihood by the addition of a penalty term to compensate for the over-fitting of more complex models.

– Page 33, Pattern Recognition and Machine Learning, 2006.

A model with fewer parameters is less complex, and because of this, is preferred because it is likely to generalize better on average.

Four commonly used probabilistic model selection measures include:

- Akaike Information Criterion (AIC).
- Bayesian Information Criterion (BIC).
- Minimum Description Length (MDL).
- Structural Risk Minimization (SRM).

Probabilistic measures are appropriate when using simpler linear models like linear regression or logistic regression where the calculating of model complexity penalty (e.g. in sample bias) is known and tractable.

Resampling methods seek to estimate the performance of a model (or more precisely, the model development process) on out-of-sample data.

This is achieved by splitting the training dataset into sub train and test sets, fitting a model on the sub train set, and evaluating it on the test set. This process may then be repeated multiple times and the mean performance across each trial is reported.

It is a type of Monte Carlo estimate of model performance on out-of-sample data, although each trial is not strictly independent as depending on the resampling method chosen, the same data may appear multiple times in different training datasets, or test datasets.

Three common resampling model selection methods include:

- Random train/test splits.
- Cross-Validation (k-fold, LOOCV, etc.).
- Bootstrap.

Most of the time probabilistic measures (described in the previous section) are not available, therefore resampling methods are used.

By far the most popular is the cross-validation family of methods that includes many subtypes.

Probably the simplest and most widely used method for estimating prediction error is cross-validation.

— Page 241, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2017.

An example is the widely used k-fold cross-validation that splits the training dataset into k folds where each example appears in a test set only once.

Another is the leave one out (LOOCV) where the test set is comprised of a single sample and each sample is given an opportunity to be the test set, requiring N (the number of samples in the training set) models to be constructed and evaluated.

This section provides more resources on the topic if you are looking to go deeper.

- Probabilistic Model Selection with AIC, BIC, and MDL
- A Gentle Introduction to Statistical Sampling and Resampling
- A Gentle Introduction to Monte Carlo Sampling for Probability
- A Gentle Introduction to k-fold Cross-Validation
- What is the Difference Between Test and Validation Datasets?

- Applied Predictive Modeling, 2013.
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2017.
- An Introduction to Statistical Learning: with Applications in R, 2017.
- Pattern Recognition and Machine Learning, 2006.
- Machine Learning: A Probabilistic Perspective, 2012.

In this post, you discovered the challenge of model selection for machine learning.

Specifically, you learned:

- Model selection is the process of choosing one among many candidate models for a predictive modeling problem.
- There may be many competing concerns when performing model selection beyond model performance, such as complexity, maintainability, and available resources.
- The two main classes of model selection techniques are probabilistic measures and resampling methods.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Model Selection for Machine Learning appeared first on Machine Learning Mastery.

]]>The post How to Use an Empirical Distribution Function in Python appeared first on Machine Learning Mastery.

]]>An empirical distribution function provides a way to model and sample cumulative probabilities for a data sample that does not fit a standard probability distribution.

As such, it is sometimes called the **empirical cumulative distribution function**, or ECDF for short.

In this tutorial, you will discover the empirical probability distribution function.

After completing this tutorial, you will know:

- Some data samples cannot be summarized using a standard distribution.
- An empirical distribution function provides a way of modeling cumulative probabilities for a data sample.
- How to use the statsmodels library to model and sample an empirical cumulative distribution function.

Discover bayes opimization, naive bayes, maximum likelihood, distributions, cross entropy, and much more in my new book, with 28 step-by-step tutorials and full Python source code.

Let’s get started.

This tutorial is divided into three parts; they are:

- Empirical Distribution Function
- Bimodal Data Distribution
- Sampling Empirical Distribution

Typically, the distribution of observations for a data sample fits a well-known probability distribution.

For example, the heights of humans will fit the normal (Gaussian) probability distribution.

This is not always the case. Sometimes the observations in a collected data sample do not fit any known probability distribution and cannot be easily forced into an existing distribution by data transforms or parameterization of the distribution function.

Instead, an empirical probability distribution must be used.

There are two main types of probability distribution functions we may need to sample; they are:

- Probability Density Function (PDF).
- Cumulative Distribution Function (CDF).

The PDF returns the expected probability for observing a value. For discrete data, the PDF is referred to as a Probability Mass Function (PMF). The CDF returns the expected probability for observing a value less than or equal to a given value.

An empirical probability density function can be fit and used for a data sampling using a nonparametric density estimation method, such as Kernel Density Estimation (KDE).

An empirical cumulative distribution function is called the Empirical Distribution Function, or EDF for short. It is also referred to as the Empirical Cumulative Distribution Function, or ECDF.

The EDF is calculated by ordering all of the unique observations in the data sample and calculating the cumulative probability for each as the number of observations less than or equal to a given observation divided by the total number of observations.

As follows:

- EDF(x) = number of observations <= x / n

Like other cumulative distribution functions, the sum of probabilities will proceed from 0.0 to 1.0 as the observations in the domain are enumerated from smallest to largest.

To make the empirical distribution function concrete, let’s look at an example with a dataset that clearly does not fit a known probability distribution.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

We can define a dataset that clearly does not match a standard probability distribution function.

A common example is when the data has two peaks (bimodal distribution) or many peaks (multimodal distribution).

We can construct a bimodal distribution by combining samples from two different normal distributions. Specifically, 300 examples with a mean of 20 and a standard deviation of five (the smaller peak), and 700 examples with a mean of 40 and a standard deviation of five (the larger peak).

The means were chosen close together to ensure the distributions overlap in the combined sample.

The complete example of creating this sample with a bimodal probability distribution and plotting the histogram is listed below.

# example of a bimodal data sample from matplotlib import pyplot from numpy.random import normal from numpy import hstack # generate a sample sample1 = normal(loc=20, scale=5, size=300) sample2 = normal(loc=40, scale=5, size=700) sample = hstack((sample1, sample2)) # plot the histogram pyplot.hist(sample, bins=50) pyplot.show()

Running the example creates the data sample and plots the histogram.

Note that your results will differ given the random nature of the data sample. Try running the example a few times.

We have fewer samples with a mean of 20 than samples with a mean of 40, which we can see reflected in the histogram with a larger density of samples around 40 than around 20.

Data with this distribution does not nicely fit into a common probability distribution by design.

Below is a plot of the probability density function (PDF) of this data sample.

It is a good case for using an empirical distribution function.

An empirical distribution function can be fit for a data sample in Python.

The statmodels Python library provides the ECDF class for fitting an empirical cumulative distribution function and calculating the cumulative probabilities for specific observations from the domain.

The distribution is fit by calling ECDF() and passing in the raw data sample.

... # fit a cdf ecdf = ECDF(sample)

Once fit, the function can be called to calculate the cumulative probability for a given observation.

... # get cumulative probability for values print('P(x<20): %.3f' % ecdf(20)) print('P(x<40): %.3f' % ecdf(40)) print('P(x<60): %.3f' % ecdf(60))

The class also provides an ordered list of unique observations in the data (the *.x* attribute) and their associated probabilities (*.y* attribute). We can access these attributes and plot the CDF function directly.

... # plot the cdf pyplot.plot(ecdf.x, ecdf.y) pyplot.show()

Tying this together, the complete example of fitting an empirical distribution function for the bimodal data sample is below.

# fit an empirical cdf to a bimodal dataset from matplotlib import pyplot from numpy.random import normal from numpy import hstack from statsmodels.distributions.empirical_distribution import ECDF # generate a sample sample1 = normal(loc=20, scale=5, size=300) sample2 = normal(loc=40, scale=5, size=700) sample = hstack((sample1, sample2)) # fit a cdf ecdf = ECDF(sample) # get cumulative probability for values print('P(x<20): %.3f' % ecdf(20)) print('P(x<40): %.3f' % ecdf(40)) print('P(x<60): %.3f' % ecdf(60)) # plot the cdf pyplot.plot(ecdf.x, ecdf.y) pyplot.show()

Running the example fits the empirical CDF to the data sample, then prints the cumulative probability for observing three values.

Your specific results will vary given the stochastic nature of the data sample. Try running the example a few times.

P(x<20): 0.149 P(x<40): 0.654 P(x<60): 1.000

Then the cumulative probability for the entire domain is calculated and shown as a line plot.

Here, we can see the familiar S-shaped curve seen for most cumulative distribution functions, here with bumps around the mean of both peaks of the bimodal distribution.

This section provides more resources on the topic if you are looking to go deeper.

- Section 2.3.4 The empirical distribution, Machine Learning: A Probabilistic Perspective, 2012.
- Section 3.9.5 The Dirac Distribution and Empirical Distribution, Deep Learning, 2016.

- Empirical distribution function, Wikipedia.
- Cumulative distribution function, Wikipedia.
- Probability Density Function, Wikipedia.
- Kernel density estimation, Wikipedia.

In this tutorial, you discovered the empirical probability distribution function.

Specifically, you learned:

- Some data samples cannot be summarized using a standard distribution.
- An empirical distribution function provides a way of modeling cumulative probabilities for a data sample.
- How to use the statsmodels library to model and sample an empirical cumulative distribution function.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Use an Empirical Distribution Function in Python appeared first on Machine Learning Mastery.

]]>The post How to Choose a Feature Selection Method For Machine Learning appeared first on Machine Learning Mastery.

]]>Feature selection is the process of reducing the number of input variables when developing a predictive model.

It is desirable to reduce the number of input variables to both reduce the computational cost of modeling and, in some cases, to improve the performance of the model.

Feature-based feature selection methods involve evaluating the relationship between each input variable and the target variable using statistics and selecting those input variables that have the strongest relationship with the target variable. These methods can be fast and effective, although the choice of statistical measures depends on the data type of both the input and output variables.

As such, it can be challenging for a machine learning practitioner to select an appropriate statistical measure for a dataset when performing filter-based feature selection.

In this post, you will discover how to choose statistical measures for filter-based feature selection with numerical and categorical data.

After reading this post, you will know:

- There are two main types of feature selection techniques: wrapper and filter methods.
- Filter-based feature selection methods use statistical measures to score the correlation or dependence between input variables that can be filtered to choose the most relevant features.
- Statistical measures for feature selection must be carefully chosen based on the data type of the input variable and the output or response variable.

Let’s get started.

**Update Nov/2019**: Added some worked examples for classification and regression.

This tutorial is divided into 4 parts; they are:

- Feature Selection Methods
- Statistics for Filter Feature Selection Methods
- Tips and Tricks for Feature Selection
- Worked Examples

Feature selection methods are intended to reduce the number of input variables to those that are believed to be most useful to a model in order to predict the target variable.

Some predictive modeling problems have a large number of variables that can slow the development and training of models and require a large amount of system memory. Additionally, the performance of some models can degrade when including input variables that are not relevant to the target variable.

There are two main types of feature selection algorithms: wrapper methods and filter methods.

- Wrapper Feature Selection Methods.
- Filter Feature Selection Methods.

**Wrapper feature selection methods** create many models with different subsets of input features and select those features that result in the best performing model according to a performance metric. These methods are unconcerned with the variable types, although they can be computationally expensive. RFE is a good example of a wrapper feature selection method.

Wrapper methods evaluate multiple models using procedures that add and/or remove predictors to find the optimal combination that maximizes model performance.

— Page 490, Applied Predictive Modeling, 2013.

**Filter feature selection methods** use statistical techniques to evaluate the relationship between each input variable and the target variable, and these scores are used as the basis to choose (filter) those input variables that will be used in the model.

Filter methods evaluate the relevance of the predictors outside of the predictive models and subsequently model only the predictors that pass some criterion.

— Page 490, Applied Predictive Modeling, 2013.

It is common to use correlation type statistical measures between input and output variables as the basis for filter feature selection. As such, the choice of statistical measures is highly dependent upon the variable data types.

Common data types include numerical (such as height) and categorical (such as a label), although each may be further subdivided such as integer and floating point for numerical variables, and boolean, ordinal, or nominal for categorical variables.

Common input variable data types:

**Numerical Variables**- Integer Variables.
- Floating Point Variables.

**Categorical Variables**.- Boolean Variables (dichotomous).
- Ordinal Variables.
- Nominal Variables.

The more that is known about the data type of a variable, the easier it is to choose an appropriate statistical measure for a filter-based feature selection method.

In the next section, we will review some of the statistical measures that may be used for filter-based feature selection with different input and output variable data types.

In this section, we will consider two broad categories of variable types: numerical and categorical; also, the two main groups of variables to consider: input and output.

Input variables are those that are provided as input to a model. In feature selection, it is this group of variables that we wish to reduce in size. Output variables are those for which a model is intended to predict, often called the response variable.

The type of response variable typically indicates the type of predictive modeling problem being performed. For example, a numerical output variable indicates a regression predictive modeling problem, and a categorical output variable indicates a classification predictive modeling problem.

**Numerical Output**: Regression predictive modeling problem.**Categorical Output**: Classification predictive modeling problem.

The statistical measures used in filter-based feature selection are generally calculated one input variable at a time with the target variable. As such, they are referred to as univariate statistical measures. This may mean that any interaction between input variables is not considered in the filtering process.

Most of these techniques are univariate, meaning that they evaluate each predictor in isolation. In this case, the existence of correlated predictors makes it possible to select important, but redundant, predictors. The obvious consequences of this issue are that too many predictors are chosen and, as a result, collinearity problems arise.

— Page 499, Applied Predictive Modeling, 2013.

With this framework, let’s review some univariate statistical measures that can be used for filter-based feature selection.

This is a regression predictive modeling problem with numerical input variables.

The most common techniques are to use a correlation coefficient, such as Pearson’s for a linear correlation, or rank-based methods for a nonlinear correlation.

- Pearson’s correlation coefficient (linear).
- Spearman’s rank coefficient (nonlinear)

This is a classification predictive modeling problem with numerical input variables.

This might be the most common example of a classification problem,

Again, the most common techniques are correlation based, although in this case, they must take the categorical target into account.

- ANOVA correlation coefficient (linear).
- Kendall’s rank coefficient (nonlinear).

Kendall does assume that the categorical variable is ordinal.

This is a regression predictive modeling problem with categorical input variables.

This is a strange example of a regression problem (e.g. you would not encounter it often).

Nevertheless, you can use the same “*Numerical Input, Categorical Output*” methods (described above), but in reverse.

This is a classification predictive modeling problem with categorical input variables.

The most common correlation measure for categorical data is the chi-squared test. You can also use mutual information (information gain) from the field of information theory.

- Chi-Squared test (contingency tables).
- Mutual Information.

In fact, mutual information is a powerful method that may prove useful for both categorical and numerical data, e.g. it is agnostic to the data types.

This section provides some additional considerations when using filter-based feature selection.

The scikit-learn library provides an implementation of most of the useful statistical measures.

For example:

- Pearson’s Correlation Coefficient: f_regression()
- ANOVA: f_classif()
- Chi-Squared: chi2()
- Mutual Information: mutual_info_classif() and mutual_info_regression()

Also, the SciPy library provides an implementation of many more statistics, such as Kendall’s tau (kendalltau) and Spearman’s rank correlation (spearmanr).

The scikit-learn library also provides many different filtering methods once statistics have been calculated for each input variable with the target.

Two of the more popular methods include:

- Select the top k variables: SelectKBest
- Select the top percentile variables: SelectPercentile

I often use *SelectKBest* myself.

Consider transforming the variables in order to access different statistical methods.

For example, you can transform a categorical variable to ordinal, even if it is not, and see if any interesting results come out.

You can also make a numerical variable discrete (e.g. bins); try categorical-based measures.

Some statistical measures assume properties of the variables, such as Pearson’s that assumes a Gaussian probability distribution to the observations and a linear relationship. You can transform the data to meet the expectations of the test and try the test regardless of the expectations and compare results.

There is no best feature selection method.

Just like there is no best set of input variables or best machine learning algorithm. At least not universally.

Instead, you must discover what works best for your specific problem using careful systematic experimentation.

Try a range of different models fit on different subsets of features chosen via different statistical measures and discover what works best for your specific problem.

It can be helpful to have some worked examples that you can copy-and-paste and adapt for your own project.

This section provides worked examples of feature selection cases that you can use as a starting point.

This section demonstrates feature selection for a regression problem that as numerical inputs and numerical outputs.

A test regression problem is prepared using the make_regression() function.

Feature selection is performed using Pearson’s Correlation Coefficient via the f_regression() function.

# pearson's correlation feature selection for numeric input and numeric output from sklearn.datasets import make_regression from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import f_regression # generate dataset X, y = make_regression(n_samples=100, n_features=100, n_informative=10) # define feature selection fs = SelectKBest(score_func=f_regression, k=10) # apply feature selection X_selected = fs.fit_transform(X, y) print(X_selected.shape)

Running the example first creates the regression dataset, then defines the feature selection and applies the feature selection procedure to the dataset, returning a subset of the selected input features.

(100, 10)

This section demonstrates feature selection for a classification problem that as numerical inputs and categorical outputs.

A test regression problem is prepared using the make_classification() function.

Feature selection is performed using ANOVA F measure via the f_classif() function.

# ANOVA feature selection for numeric input and categorical output from sklearn.datasets import make_classification from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import f_classif # generate dataset X, y = make_classification(n_samples=100, n_features=20, n_informative=2) # define feature selection fs = SelectKBest(score_func=f_classif, k=2) # apply feature selection X_selected = fs.fit_transform(X, y) print(X_selected.shape)

Running the example first creates the classification dataset, then defines the feature selection and applies the feature selection procedure to the dataset, returning a subset of the selected input features.

(100, 2)

For examples of feature selection with categorical inputs and categorical outputs, see the tutorial:

This section provides more resources on the topic if you are looking to go deeper.

- How to Calculate Nonparametric Rank Correlation in Python
- How to Calculate Correlation Between Variables in Python
- Feature Selection For Machine Learning in Python
- An Introduction to Feature Selection

- Feature selection, scikit-learn API.
- What are the feature selection options for categorical data? Quora.

In this post, you discovered how to choose statistical measures for filter-based feature selection with numerical and categorical data.

Specifically, you learned:

- There are two main types of feature selection techniques: wrapper and filter methods.
- Filter-based feature selection methods use statistical measures to score the correlation or dependence between input variables that can be filtered to choose the most relevant features.
- Statistical measures for feature selection must be carefully chosen based on the data type of the input variable and the output or response variable.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Choose a Feature Selection Method For Machine Learning appeared first on Machine Learning Mastery.

]]>The post How to Perform Feature Selection with Categorical Data appeared first on Machine Learning Mastery.

]]>Feature selection is the process of identifying and selecting a subset of input features that are most relevant to the target variable.

Feature selection is often straightforward when working with real-valued data, such as using the Pearson’s correlation coefficient, but can be challenging when working with categorical data.

The two most commonly used feature selection methods for categorical input data when the target variable is also categorical (e.g. classification predictive modeling) are the chi-squared statistic and the mutual information statistic.

In this tutorial, you will discover how to perform feature selection with categorical input data.

After completing this tutorial, you will know:

- The breast cancer predictive modeling problem with categorical inputs and binary classification target variable.
- How to evaluate the importance of categorical features using the chi-squared and mutual information statistics.
- How to perform feature selection for categorical data when fitting and evaluating a classification model.

Let’s get started.

This tutorial is divided into three parts; they are:

- Breast Cancer Categorical Dataset
- Categorical Feature Selection
- Modeling With Selected Features

As the basis of this tutorial, we will use the so-called “Breast cancer” dataset that has been widely studied as a machine learning dataset since the 1980s.

The dataset classifies breast cancer patient data as either a recurrence or no recurrence of cancer. There are 286 examples and nine input variables. It is a binary classification problem.

A naive model can achieve an accuracy of 70% on this dataset. A good score is about 76% +/- 3%. We will aim for this region, but note that the models in this tutorial are not optimized; they are designed to demonstrate encoding schemes.

You can download the dataset and save the file as “*breast-cancer.csv*” in your current working directory.

Looking at the data, we can see that all nine input variables are categorical.

Specifically, all variables are quoted strings; some are ordinal and some are not.

'40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events' '50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events' '50-59','ge40','35-39','0-2','no','2','left','left_low','no','recurrence-events' '40-49','premeno','35-39','0-2','yes','3','right','left_low','yes','no-recurrence-events' '40-49','premeno','30-34','3-5','yes','2','left','right_up','no','recurrence-events' ...

We can load this dataset into memory using the Pandas library.

... # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values

Once loaded, we can split the columns into input (*X*) and output for modeling.

... # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1]

Finally, we can force all fields in the input data to be string, just in case Pandas tried to map some automatically to numbers (it does try).

... # format all fields as string X = X.astype(str)

We can tie all of this together into a helpful function that we can reuse later.

# load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] # format all fields as string X = X.astype(str) return X, y

Once loaded, we can split the data into training and test sets so that we can fit and evaluate a learning model.

We will use the train_test_split() function form scikit-learn and use 67% of the data for training and 33% for testing.

... # load the dataset X, y = load_dataset('breast-cancer.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

Tying all of these elements together, the complete example of loading, splitting, and summarizing the raw categorical dataset is listed below.

# load and summarize the dataset from pandas import read_csv from sklearn.model_selection import train_test_split # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] # format all fields as string X = X.astype(str) return X, y # load the dataset X, y = load_dataset('breast-cancer.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # summarize print('Train', X_train.shape, y_train.shape) print('Test', X_test.shape, y_test.shape)

Running the example reports the size of the input and output elements of the train and test sets.

We can see that we have 191 examples for training and 95 for testing.

Train (191, 9) (191, 1) Test (95, 9) (95, 1)

Now that we are familiar with the dataset, let’s look at how we can encode it for modeling.

We can use the OrdinalEncoder() from scikit-learn to encode each variable to integers. This is a flexible class and does allow the order of the categories to be specified as arguments if any such order is known.

**Note**: I will leave it as an exercise to you to update the example below to try specifying the order for those variables that have a natural ordering and see if it has an impact on model performance.

The best practice when encoding variables is to fit the encoding on the training dataset, then apply it to the train and test datasets.

The function below named *prepare_inputs()* takes the input data for the train and test sets and encodes it using an ordinal encoding.

# prepare input data def prepare_inputs(X_train, X_test): oe = OrdinalEncoder() oe.fit(X_train) X_train_enc = oe.transform(X_train) X_test_enc = oe.transform(X_test) return X_train_enc, X_test_enc

We also need to prepare the target variable.

It is a binary classification problem, so we need to map the two class labels to 0 and 1. This is a type of ordinal encoding, and scikit-learn provides the LabelEncoder class specifically designed for this purpose. We could just as easily use the *OrdinalEncoder* and achieve the same result, although the *LabelEncoder* is designed for encoding a single variable.

The *prepare_targets()* function integer encodes the output data for the train and test sets.

# prepare target def prepare_targets(y_train, y_test): le = LabelEncoder() le.fit(y_train) y_train_enc = le.transform(y_train) y_test_enc = le.transform(y_test) return y_train_enc, y_test_enc

We can call these functions to prepare our data.

... # prepare input data X_train_enc, X_test_enc = prepare_inputs(X_train, X_test) # prepare output data y_train_enc, y_test_enc = prepare_targets(y_train, y_test)

Tying this all together, the complete example of loading and encoding the input and output variables for the breast cancer categorical dataset is listed below.

# example of loading and preparing the breast cancer dataset from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OrdinalEncoder # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] # format all fields as string X = X.astype(str) return X, y # prepare input data def prepare_inputs(X_train, X_test): oe = OrdinalEncoder() oe.fit(X_train) X_train_enc = oe.transform(X_train) X_test_enc = oe.transform(X_test) return X_train_enc, X_test_enc # prepare target def prepare_targets(y_train, y_test): le = LabelEncoder() le.fit(y_train) y_train_enc = le.transform(y_train) y_test_enc = le.transform(y_test) return y_train_enc, y_test_enc # load the dataset X, y = load_dataset('breast-cancer.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # prepare input data X_train_enc, X_test_enc = prepare_inputs(X_train, X_test) # prepare output data y_train_enc, y_test_enc = prepare_targets(y_train, y_test)

Now that we have loaded and prepared the breast cancer dataset, we can explore feature selection.

There are two popular feature selection techniques that can be used for categorical input data and a categorical (class) target variable.

They are:

- Chi-Squared Statistic.
- Mutual Information Statistic.

Let’s take a closer look at each in turn.

Pearson’s chi-squared statistical hypothesis test is an example of a test for independence between categorical variables.

You can learn more about this statistical test in the tutorial:

The results of this test can be used for feature selection, where those features that are independent of the target variable can be removed from the dataset.

The scikit-learn machine library provides an implementation of the chi-squared test in the chi2() function. This function can be used in a feature selection strategy, such as selecting the top *k* most relevant features (largest values) via the SelectKBest class.

For example, we can define the *SelectKBest* class to use the *chi2()* function and select all features, then transform the train and test sets.

... fs = SelectKBest(score_func=chi2, k='all') fs.fit(X_train, y_train) X_train_fs = fs.transform(X_train) X_test_fs = fs.transform(X_test)

We can then print the scores for each variable (largest is better), and plot the scores for each variable as a bar graph to get an idea of how many features we should select.

... # what are scores for the features for i in range(len(fs.scores_)): print('Feature %d: %f' % (i, fs.scores_[i])) # plot the scores pyplot.bar([i for i in range(len(fs.scores_))], fs.scores_) pyplot.show()

Tying this together with the data preparation for the breast cancer dataset in the previous section, the complete example is listed below.

# example of chi squared feature selection for categorical data from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OrdinalEncoder from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 from matplotlib import pyplot # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] # format all fields as string X = X.astype(str) return X, y # prepare input data def prepare_inputs(X_train, X_test): oe = OrdinalEncoder() oe.fit(X_train) X_train_enc = oe.transform(X_train) X_test_enc = oe.transform(X_test) return X_train_enc, X_test_enc # prepare target def prepare_targets(y_train, y_test): le = LabelEncoder() le.fit(y_train) y_train_enc = le.transform(y_train) y_test_enc = le.transform(y_test) return y_train_enc, y_test_enc # feature selection def select_features(X_train, y_train, X_test): fs = SelectKBest(score_func=chi2, k='all') fs.fit(X_train, y_train) X_train_fs = fs.transform(X_train) X_test_fs = fs.transform(X_test) return X_train_fs, X_test_fs, fs # load the dataset X, y = load_dataset('breast-cancer.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # prepare input data X_train_enc, X_test_enc = prepare_inputs(X_train, X_test) # prepare output data y_train_enc, y_test_enc = prepare_targets(y_train, y_test) # feature selection X_train_fs, X_test_fs, fs = select_features(X_train_enc, y_train_enc, X_test_enc) # what are scores for the features for i in range(len(fs.scores_)): print('Feature %d: %f' % (i, fs.scores_[i])) # plot the scores pyplot.bar([i for i in range(len(fs.scores_))], fs.scores_) pyplot.show()

Running the example first prints the scores calculated for each input feature and the target variable.

**Note**: your specific results may differ. Try running the example a few times.

In this case, we can see the scores are small and it is hard to get an idea from the number alone as to which features are more relevant.

Perhaps features 3, 4, 5, and 8 are most relevant.

Feature 0: 0.472553 Feature 1: 0.029193 Feature 2: 2.137658 Feature 3: 29.381059 Feature 4: 8.222601 Feature 5: 8.100183 Feature 6: 1.273822 Feature 7: 0.950682 Feature 8: 3.699989

A bar chart of the feature importance scores for each input feature is created.

This clearly shows that feature 3 might be the most relevant (according to chi-squared) and that perhaps four of the nine input features are the most relevant.

We could set k=4 When configuring the *SelectKBest* to select these top four features.

Mutual information from the field of information theory is the application of information gain (typically used in the construction of decision trees) to feature selection.

Mutual information is calculated between two variables and measures the reduction in uncertainty for one variable given a known value of the other variable.

You can learn more about mutual information in the following tutorial.

The scikit-learn machine learning library provides an implementation of mutual information for feature selection via the mutual_info_classif() function.

Like *chi2()*, it can be used in the *SelectKBest* feature selection strategy (and other strategies).

# feature selection def select_features(X_train, y_train, X_test): fs = SelectKBest(score_func=mutual_info_classif, k='all') fs.fit(X_train, y_train) X_train_fs = fs.transform(X_train) X_test_fs = fs.transform(X_test) return X_train_fs, X_test_fs, fs

We can perform feature selection using mutual information on the breast cancer set and print and plot the scores (larger is better) as we did in the previous section.

The complete example of using mutual information for categorical feature selection is listed below.

# example of mutual information feature selection for categorical data from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OrdinalEncoder from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import mutual_info_classif from matplotlib import pyplot # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] # format all fields as string X = X.astype(str) return X, y # prepare input data def prepare_inputs(X_train, X_test): oe = OrdinalEncoder() oe.fit(X_train) X_train_enc = oe.transform(X_train) X_test_enc = oe.transform(X_test) return X_train_enc, X_test_enc # prepare target def prepare_targets(y_train, y_test): le = LabelEncoder() le.fit(y_train) y_train_enc = le.transform(y_train) y_test_enc = le.transform(y_test) return y_train_enc, y_test_enc # feature selection def select_features(X_train, y_train, X_test): fs = SelectKBest(score_func=mutual_info_classif, k='all') fs.fit(X_train, y_train) X_train_fs = fs.transform(X_train) X_test_fs = fs.transform(X_test) return X_train_fs, X_test_fs, fs # load the dataset X, y = load_dataset('breast-cancer.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # prepare input data X_train_enc, X_test_enc = prepare_inputs(X_train, X_test) # prepare output data y_train_enc, y_test_enc = prepare_targets(y_train, y_test) # feature selection X_train_fs, X_test_fs, fs = select_features(X_train_enc, y_train_enc, X_test_enc) # what are scores for the features for i in range(len(fs.scores_)): print('Feature %d: %f' % (i, fs.scores_[i])) # plot the scores pyplot.bar([i for i in range(len(fs.scores_))], fs.scores_) pyplot.show()

Running the example first prints the scores calculated for each input feature and the target variable.

**Note**: your specific results may differ. Try running the example a few times.

In this case, we can see that some of the features have a very low score, suggesting that perhaps they can be removed.

Perhaps features 3, 6, 2, and 5 are most relevant.

Feature 0: 0.003588 Feature 1: 0.000000 Feature 2: 0.025934 Feature 3: 0.071461 Feature 4: 0.000000 Feature 5: 0.038973 Feature 6: 0.064759 Feature 7: 0.003068 Feature 8: 0.000000

A bar chart of the feature importance scores for each input feature is created.

Importantly, a different mixture of features is promoted.

Now that we know how to perform feature selection on categorical data for a classification predictive modeling problem, we can try developing a model using the selected features and compare the results.

There are many different techniques for scoring features and selecting features based on scores; how do you know which one to use?

A robust approach is to evaluate models using different feature selection methods (and numbers of features) and select the method that results in a model with the best performance.

In this section, we will evaluate a Logistic Regression model with all features compared to a model built from features selected by chi-squared and those features selected via mutual information.

Logistic regression is a good model for testing feature selection methods as it can perform better if irrelevant features are removed from the model.

As a first step, we will evaluate a LogisticRegression model using all the available features.

The model is fit on the training dataset and evaluated on the test dataset.

The complete example is listed below.

# evaluation of a model using all input features from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OrdinalEncoder from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] # format all fields as string X = X.astype(str) return X, y # prepare input data def prepare_inputs(X_train, X_test): oe = OrdinalEncoder() oe.fit(X_train) X_train_enc = oe.transform(X_train) X_test_enc = oe.transform(X_test) return X_train_enc, X_test_enc # prepare target def prepare_targets(y_train, y_test): le = LabelEncoder() le.fit(y_train) y_train_enc = le.transform(y_train) y_test_enc = le.transform(y_test) return y_train_enc, y_test_enc # load the dataset X, y = load_dataset('breast-cancer.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # prepare input data X_train_enc, X_test_enc = prepare_inputs(X_train, X_test) # prepare output data y_train_enc, y_test_enc = prepare_targets(y_train, y_test) # fit the model model = LogisticRegression(solver='lbfgs') model.fit(X_train_enc, y_train_enc) # evaluate the model yhat = model.predict(X_test_enc) # evaluate predictions accuracy = accuracy_score(y_test_enc, yhat) print('Accuracy: %.2f' % (accuracy*100))

Running the example prints the accuracy of the model on the training dataset.

**Note**: your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that the model achieves a classification accuracy of about 75%.

We would prefer to use a subset of features that achieves a classification accuracy that is as good or better than this.

Accuracy: 75.79

We can use the chi-squared test to score the features and select the four most relevant features.

The *select_features()* function below is updated to achieve this.

# feature selection def select_features(X_train, y_train, X_test): fs = SelectKBest(score_func=chi2, k=4) fs.fit(X_train, y_train) X_train_fs = fs.transform(X_train) X_test_fs = fs.transform(X_test) return X_train_fs, X_test_fs

The complete example of evaluating a logistic regression model fit and evaluated on data using this feature selection method is listed below.

# evaluation of a model fit using chi squared input features from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OrdinalEncoder from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] # format all fields as string X = X.astype(str) return X, y # prepare input data def prepare_inputs(X_train, X_test): oe = OrdinalEncoder() oe.fit(X_train) X_train_enc = oe.transform(X_train) X_test_enc = oe.transform(X_test) return X_train_enc, X_test_enc # prepare target def prepare_targets(y_train, y_test): le = LabelEncoder() le.fit(y_train) y_train_enc = le.transform(y_train) y_test_enc = le.transform(y_test) return y_train_enc, y_test_enc # feature selection def select_features(X_train, y_train, X_test): fs = SelectKBest(score_func=chi2, k=4) fs.fit(X_train, y_train) X_train_fs = fs.transform(X_train) X_test_fs = fs.transform(X_test) return X_train_fs, X_test_fs # load the dataset X, y = load_dataset('breast-cancer.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # prepare input data X_train_enc, X_test_enc = prepare_inputs(X_train, X_test) # prepare output data y_train_enc, y_test_enc = prepare_targets(y_train, y_test) # feature selection X_train_fs, X_test_fs = select_features(X_train_enc, y_train_enc, X_test_enc) # fit the model model = LogisticRegression(solver='lbfgs') model.fit(X_train_fs, y_train_enc) # evaluate the model yhat = model.predict(X_test_fs) # evaluate predictions accuracy = accuracy_score(y_test_enc, yhat) print('Accuracy: %.2f' % (accuracy*100))

Running the example reports the performance of the model on just four of the nine input features selected using the chi-squared statistic.

**Note**: your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we see that the model achieved an accuracy of about 74%, a slight drop in performance.

It is possible that some of the features removed are, in fact, adding value directly or in concert with the selected features.

At this stage, we would probably prefer to use all of the input features.

Accuracy: 74.74

We can repeat the experiment and select the top four features using a mutual information statistic.

The updated version of the *select_features()* function to achieve this is listed below.

# feature selection def select_features(X_train, y_train, X_test): fs = SelectKBest(score_func=mutual_info_classif, k=4) fs.fit(X_train, y_train) X_train_fs = fs.transform(X_train) X_test_fs = fs.transform(X_test) return X_train_fs, X_test_fs

The complete example of using mutual information for feature selection to fit a logistic regression model is listed below.

# evaluation of a model fit using mutual information input features from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OrdinalEncoder from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import mutual_info_classif from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] # format all fields as string X = X.astype(str) return X, y # prepare input data def prepare_inputs(X_train, X_test): oe = OrdinalEncoder() oe.fit(X_train) X_train_enc = oe.transform(X_train) X_test_enc = oe.transform(X_test) return X_train_enc, X_test_enc # prepare target def prepare_targets(y_train, y_test): le = LabelEncoder() le.fit(y_train) y_train_enc = le.transform(y_train) y_test_enc = le.transform(y_test) return y_train_enc, y_test_enc # feature selection def select_features(X_train, y_train, X_test): fs = SelectKBest(score_func=mutual_info_classif, k=4) fs.fit(X_train, y_train) X_train_fs = fs.transform(X_train) X_test_fs = fs.transform(X_test) return X_train_fs, X_test_fs # load the dataset X, y = load_dataset('breast-cancer.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # prepare input data X_train_enc, X_test_enc = prepare_inputs(X_train, X_test) # prepare output data y_train_enc, y_test_enc = prepare_targets(y_train, y_test) # feature selection X_train_fs, X_test_fs = select_features(X_train_enc, y_train_enc, X_test_enc) # fit the model model = LogisticRegression(solver='lbfgs') model.fit(X_train_fs, y_train_enc) # evaluate the model yhat = model.predict(X_test_fs) # evaluate predictions accuracy = accuracy_score(y_test_enc, yhat) print('Accuracy: %.2f' % (accuracy*100))

Running the example fits the model on the four top selected features chosen using mutual information.

**Note**: your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see a small lift in classification accuracy to 76%.

To be sure that the effect is real, it would be a good idea to repeat each experiment multiple times and compare the mean performance. It may also be a good idea to explore using k-fold cross-validation instead of a simple train/test split.

Accuracy: 76.84

This section provides more resources on the topic if you are looking to go deeper.

- A Gentle Introduction to the Chi-Squared Test for Machine Learning
- An Introduction to Feature Selection
- Feature Selection For Machine Learning in Python
- What is Information Gain and Mutual Information for Machine Learning

- sklearn.model_selection.train_test_split API.
- sklearn.preprocessing.OrdinalEncoder API.
- sklearn.preprocessing.LabelEncoder API.
- sklearn.feature_selection.chi2 API
- sklearn.feature_selection.SelectKBest API
- sklearn.feature_selection.mutual_info_classif API.
- sklearn.linear_model.LogisticRegression API.

- Breast Cancer Data Set, UCI Machine Learning Repository.
- Breast Cancer Raw Dataset
- Breast Cancer Description

In this tutorial, you discovered how to perform feature selection with categorical input data.

Specifically, you learned:

- The breast cancer predictive modeling problem with categorical inputs and binary classification target variable.
- How to evaluate the importance of categorical features using the chi-squared and mutual information statistics.
- How to perform feature selection for categorical data when fitting and evaluating a classification model.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Perform Feature Selection with Categorical Data appeared first on Machine Learning Mastery.

]]>The post 3 Ways to Encode Categorical Variables for Deep Learning appeared first on Machine Learning Mastery.

]]>Machine learning and deep learning models, like those in Keras, require all input and output variables to be numeric.

This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model.

The two most popular techniques are an **integer encoding** and a **one hot encoding**, although a newer technique called **learned embedding** may provide a useful middle ground between these two methods.

In this tutorial, you will discover how to encode categorical data when developing neural network models in Keras.

After completing this tutorial, you will know:

- The challenge of working with categorical data when using machine learning and deep learning models.
- How to integer encode and one hot encode categorical variables for modeling.
- How to learn an embedding distributed representation as part of a neural network for categorical variables.

Let’s get started.

This tutorial is divided into five parts; they are:

- The Challenge With Categorical Data
- Breast Cancer Categorical Dataset
- How to Ordinal Encode Categorical Data
- How to One Hot Encode Categorical Data
- How to Use a Learned Embedding for Categorical Data

A categorical variable is a variable whose values take on the value of labels.

For example, the variable may be “*color*” and may take on the values “*red*,” “*green*,” and “*blue*.”

Sometimes, the categorical data may have an ordered relationship between the categories, such as “*first*,” “*second*,” and “*third*.” This type of categorical data is referred to as ordinal and the additional ordering information can be useful.

Machine learning algorithms and deep learning neural networks require that input and output variables are numbers.

This means that categorical data must be encoded to numbers before we can use it to fit and evaluate a model.

There are many ways to encode categorical variables for modeling, although the three most common are as follows:

**Integer Encoding**: Where each unique label is mapped to an integer.**One Hot Encoding**: Where each label is mapped to a binary vector.**Learned Embedding**: Where a distributed representation of the categories is learned.

We will take a closer look at how to encode categorical data for training a deep learning neural network in Keras using each one of these methods.

As the basis of this tutorial, we will use the so-called “Breast cancer” dataset that has been widely studied in machine learning since the 1980s.

The dataset classifies breast cancer patient data as either a recurrence or no recurrence of cancer. There are 286 examples and nine input variables. It is a binary classification problem.

A reasonable classification accuracy score on this dataset is between 68% and 73%. We will aim for this region, but note that the models in this tutorial are not optimized: *they are designed to demonstrate encoding schemes*.

You can download the dataset and save the file as “*breast-cancer.csv*” in your current working directory.

Looking at the data, we can see that all nine input variables are categorical.

Specifically, all variables are quoted strings; some are ordinal and some are not.

'40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events' '50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events' '50-59','ge40','35-39','0-2','no','2','left','left_low','no','recurrence-events' '40-49','premeno','35-39','0-2','yes','3','right','left_low','yes','no-recurrence-events' '40-49','premeno','30-34','3-5','yes','2','left','right_up','no','recurrence-events' ...

We can load this dataset into memory using the Pandas library.

... # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values

Once loaded, we can split the columns into input (*X*) and output (*y*) for modeling.

... # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1]

Finally, we can force all fields in the input data to be string, just in case Pandas tried to map some automatically to numbers (it does try).

We can also reshape the output variable to be one column (e.g. a 2D shape).

... # format all fields as string X = X.astype(str) # reshape target to be a 2d array y = y.reshape((len(y), 1))

We can tie all of this together into a helpful function that we can reuse later.

# load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] # format all fields as string X = X.astype(str) # reshape target to be a 2d array y = y.reshape((len(y), 1)) return X, y

Once loaded, we can split the data into training and test sets so that we can fit and evaluate a deep learning model.

We will use the train_test_split() function from scikit-learn and use 67% of the data for training and 33% for testing.

... # load the dataset X, y = load_dataset('breast-cancer.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

Tying all of these elements together, the complete example of loading, splitting, and summarizing the raw categorical dataset is listed below.

# load and summarize the dataset from pandas import read_csv from sklearn.model_selection import train_test_split # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] # format all fields as string X = X.astype(str) # reshape target to be a 2d array y = y.reshape((len(y), 1)) return X, y # load the dataset X, y = load_dataset('breast-cancer.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # summarize print('Train', X_train.shape, y_train.shape) print('Test', X_test.shape, y_test.shape)

Running the example reports the size of the input and output elements of the train and test sets.

We can see that we have 191 examples for training and 95 for testing.

Train (191, 9) (191, 1) Test (95, 9) (95, 1)

Now that we are familiar with the dataset, let’s look at how we can encode it for modeling.

An ordinal encoding involves mapping each unique label to an integer value.

As such, it is sometimes referred to simply as an integer encoding.

This type of encoding is really only appropriate if there is a known relationship between the categories.

This relationship does exist for some of the variables in the dataset, and ideally, this should be harnessed when preparing the data.

In this case, we will ignore any possible existing ordinal relationship and assume all variables are categorical. It can still be helpful to use an ordinal encoding, at least as a point of reference with other encoding schemes.

We can use the OrdinalEncoder() from scikit-learn to encode each variable to integers. This is a flexible class and does allow the order of the categories to be specified as arguments if any such order is known.

**Note**: I will leave it as an exercise for you to update the example below to try specifying the order for those variables that have a natural ordering and see if it has an impact on model performance.

The best practice when encoding variables is to fit the encoding on the training dataset, then apply it to the train and test datasets.

The function below, named *prepare_inputs()*, takes the input data for the train and test sets and encodes it using an ordinal encoding.

# prepare input data def prepare_inputs(X_train, X_test): oe = OrdinalEncoder() oe.fit(X_train) X_train_enc = oe.transform(X_train) X_test_enc = oe.transform(X_test) return X_train_enc, X_test_enc

We also need to prepare the target variable.

It is a binary classification problem, so we need to map the two class labels to 0 and 1.

This is a type of ordinal encoding, and scikit-learn provides the LabelEncoder class specifically designed for this purpose. We could just as easily use the OrdinalEncoder and achieve the same result, although the *LabelEncoder* is designed for encoding a single variable.

The *prepare_targets()* integer encodes the output data for the train and test sets.

# prepare target def prepare_targets(y_train, y_test): le = LabelEncoder() le.fit(y_train) y_train_enc = le.transform(y_train) y_test_enc = le.transform(y_test) return y_train_enc, y_test_enc

We can call these functions to prepare our data.

... # prepare input data X_train_enc, X_test_enc = prepare_inputs(X_train, X_test) # prepare output data y_train_enc, y_test_enc = prepare_targets(y_train, y_test)

We can now define a neural network model.

We will use the same general model in all of these examples. Specifically, a MultiLayer Perceptron (MLP) neural network with one hidden layer with 10 nodes, and one node in the output layer for making binary classifications.

Without going into too much detail, the code below defines the model, fits it on the training dataset, and then evaluates it on the test dataset.

... # define the model model = Sequential() model.add(Dense(10, input_dim=X_train_enc.shape[1], activation='relu', kernel_initializer='he_normal')) model.add(Dense(1, activation='sigmoid')) # compile the keras model model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # fit the keras model on the dataset model.fit(X_train_enc, y_train_enc, epochs=100, batch_size=16, verbose=2) # evaluate the keras model _, accuracy = model.evaluate(X_test_enc, y_test_enc, verbose=0) print('Accuracy: %.2f' % (accuracy*100))

If you are new to developing neural networks in Keras, I recommend this tutorial:

Tying all of this together, the complete example of preparing the data with an ordinal encoding and fitting and evaluating a neural network on the data is listed below.

# example of ordinal encoding for a neural network from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OrdinalEncoder from keras.models import Sequential from keras.layers import Dense # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] # format all fields as string X = X.astype(str) # reshape target to be a 2d array y = y.reshape((len(y), 1)) return X, y # prepare input data def prepare_inputs(X_train, X_test): oe = OrdinalEncoder() oe.fit(X_train) X_train_enc = oe.transform(X_train) X_test_enc = oe.transform(X_test) return X_train_enc, X_test_enc # prepare target def prepare_targets(y_train, y_test): le = LabelEncoder() le.fit(y_train) y_train_enc = le.transform(y_train) y_test_enc = le.transform(y_test) return y_train_enc, y_test_enc # load the dataset X, y = load_dataset('breast-cancer.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # prepare input data X_train_enc, X_test_enc = prepare_inputs(X_train, X_test) # prepare output data y_train_enc, y_test_enc = prepare_targets(y_train, y_test) # define the model model = Sequential() model.add(Dense(10, input_dim=X_train_enc.shape[1], activation='relu', kernel_initializer='he_normal')) model.add(Dense(1, activation='sigmoid')) # compile the keras model model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # fit the keras model on the dataset model.fit(X_train_enc, y_train_enc, epochs=100, batch_size=16, verbose=2) # evaluate the keras model _, accuracy = model.evaluate(X_test_enc, y_test_enc, verbose=0) print('Accuracy: %.2f' % (accuracy*100))

Running the example will fit the model in just a few seconds on any modern hardware (no GPU required).

The loss and the accuracy of the model are reported at the end of each training epoch, and finally, the accuracy of the model on the test dataset is reported.

Your specific results will vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that the model achieved an accuracy of about 70% on the test dataset.

Not bad, given that an ordinal relationship only exists for some of the input variables, and for those where it does, it was not honored in the encoding.

... Epoch 95/100 - 0s - loss: 0.5349 - acc: 0.7696 Epoch 96/100 - 0s - loss: 0.5330 - acc: 0.7539 Epoch 97/100 - 0s - loss: 0.5316 - acc: 0.7592 Epoch 98/100 - 0s - loss: 0.5302 - acc: 0.7696 Epoch 99/100 - 0s - loss: 0.5291 - acc: 0.7644 Epoch 100/100 - 0s - loss: 0.5277 - acc: 0.7644 Accuracy: 70.53

This provides a good starting point when working with categorical data.

A better and more general approach is to use a one hot encoding.

A one hot encoding is appropriate for categorical data where no relationship exists between categories.

It involves representing each categorical variable with a binary vector that has one element for each unique label and marking the class label with a 1 and all other elements 0.

For example, if our variable was “*color*” and the labels were “*red*,” “*green*,” and “*blue*,” we would encode each of these labels as a three-element binary vector as follows:

- Red: [1, 0, 0]
- Green: [0, 1, 0]
- Blue: [0, 0, 1]

Then each label in the dataset would be replaced with a vector (one column becomes three). This is done for all categorical variables so that our nine input variables or columns become 43 in the case of the breast cancer dataset.

The scikit-learn library provides the OneHotEncoder to automatically one hot encode one or more variables.

The *prepare_inputs()* function below provides a drop-in replacement function for the example in the previous section. Instead of using an *OrdinalEncoder*, it uses a *OneHotEncoder*.

# prepare input data def prepare_inputs(X_train, X_test): ohe = OneHotEncoder() ohe.fit(X_train) X_train_enc = ohe.transform(X_train) X_test_enc = ohe.transform(X_test) return X_train_enc, X_test_enc

Tying this together, the complete example of one hot encoding the breast cancer categorical dataset and modeling it with a neural network is listed below.

# example of one hot encoding for a neural network from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder from keras.models import Sequential from keras.layers import Dense # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] # format all fields as string X = X.astype(str) # reshape target to be a 2d array y = y.reshape((len(y), 1)) return X, y # prepare input data def prepare_inputs(X_train, X_test): ohe = OneHotEncoder() ohe.fit(X_train) X_train_enc = ohe.transform(X_train) X_test_enc = ohe.transform(X_test) return X_train_enc, X_test_enc # prepare target def prepare_targets(y_train, y_test): le = LabelEncoder() le.fit(y_train) y_train_enc = le.transform(y_train) y_test_enc = le.transform(y_test) return y_train_enc, y_test_enc # load the dataset X, y = load_dataset('breast-cancer.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # prepare input data X_train_enc, X_test_enc = prepare_inputs(X_train, X_test) # prepare output data y_train_enc, y_test_enc = prepare_targets(y_train, y_test) # define the model model = Sequential() model.add(Dense(10, input_dim=X_train_enc.shape[1], activation='relu', kernel_initializer='he_normal')) model.add(Dense(1, activation='sigmoid')) # compile the keras model model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # fit the keras model on the dataset model.fit(X_train_enc, y_train_enc, epochs=100, batch_size=16, verbose=2) # evaluate the keras model _, accuracy = model.evaluate(X_test_enc, y_test_enc, verbose=0) print('Accuracy: %.2f' % (accuracy*100))

The example one hot encodes the input categorical data, and also label encodes the target variable as we did in the previous section. The same neural network model is then fit on the prepared dataset.

Your specific results will vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, the model performs reasonably well, achieving an accuracy of about 72%, close to what was seen in the previous section.

A more fair comparison would be to run each configuration 10 or 30 times and compare performance using the mean accuracy. Recall, that we are more focused on how to encode categorical data in this tutorial rather than getting the best score on this specific dataset.

... Epoch 95/100 - 0s - loss: 0.3837 - acc: 0.8272 Epoch 96/100 - 0s - loss: 0.3823 - acc: 0.8325 Epoch 97/100 - 0s - loss: 0.3814 - acc: 0.8325 Epoch 98/100 - 0s - loss: 0.3795 - acc: 0.8325 Epoch 99/100 - 0s - loss: 0.3788 - acc: 0.8325 Epoch 100/100 - 0s - loss: 0.3773 - acc: 0.8325 Accuracy: 72.63

Ordinal and one hot encoding are perhaps the two most popular methods.

A newer technique is similar to one hot encoding and was designed for use with neural networks, called a learned embedding.

A learned embedding, or simply an “*embedding*,” is a distributed representation for categorical data.

Each category is mapped to a distinct vector, and the properties of the vector are adapted or learned while training a neural network. The vector space provides a projection of the categories, allowing those categories that are close or related to cluster together naturally.

This provides both the benefits of an ordinal relationship by allowing any such relationships to be learned from data, and a one hot encoding in providing a vector representation for each category. Unlike one hot encoding, the input vectors are not sparse (do not have lots of zeros). The downside is that it requires learning as part of the model and the creation of many more input variables (columns).

The technique was originally developed to provide a distributed representation for words, e.g. allowing similar words to have similar vector representations. As such, the technique is often referred to as a word embedding, and in the case of text data, algorithms have been developed to learn a representation independent of a neural network. For more on this topic, see the post:

An additional benefit of using an embedding is that the learned vectors that each category is mapped to can be fit in a model that has modest skill, but the vectors can be extracted and used generally as input for the category on a range of different models and applications. That is, they can be learned and reused.

Embeddings can be used in Keras via the *Embedding* layer.

For an example of learning word embeddings for text data in Keras, see the post:

One embedding layer is required for each categorical variable, and the embedding expects the categories to be ordinal encoded, although no relationship between the categories is assumed.

Each embedding also requires the number of dimensions to use for the distributed representation (vector space). It is common in natural language applications to use 50, 100, or 300 dimensions. For our small example, we will fix the number of dimensions at 10, but this is arbitrary; you should experimenter with other values.

First, we can prepare the input data using an ordinal encoding.

The model we will develop will have one separate embedding for each input variable. Therefore, the model will take nine different input datasets. As such, we will split the input variables and ordinal encode (integer encoding) each separately using the *LabelEncoder* and return a list of separate prepared train and test input datasets.

The *prepare_inputs()* function below implements this, enumerating over each input variable, integer encoding each correctly using best practices, and returning lists of encoded train and test variables (or one-variable datasets) that can be used as input for our model later.

# prepare input data def prepare_inputs(X_train, X_test): X_train_enc, X_test_enc = list(), list() # label encode each column for i in range(X_train.shape[1]): le = LabelEncoder() le.fit(X_train[:, i]) # encode train_enc = le.transform(X_train[:, i]) test_enc = le.transform(X_test[:, i]) # store X_train_enc.append(train_enc) X_test_enc.append(test_enc) return X_train_enc, X_test_enc

Now we can construct the model.

We must construct the model differently in this case because we will have nine input layers, with nine embeddings the outputs of which (the nine different 10-element vectors) need to be concatenated into one long vector before being passed as input to the dense layers.

We can achieve this using the functional Keras API. If you are new to the Keras functional API, see the post:

First, we can enumerate each variable and construct an input layer and connect it to an embedding layer, and store both layers in lists. We need a reference to all of the input layers when defining the model, and we need a reference to each embedding layer to concentrate them with a merge layer.

... # prepare each input head in_layers = list() em_layers = list() for i in range(len(X_train_enc)): # calculate the number of unique inputs n_labels = len(unique(X_train_enc[i])) # define input layer in_layer = Input(shape=(1,)) # define embedding layer em_layer = Embedding(n_labels, 10)(in_layer) # store layers in_layers.append(in_layer) em_layers.append(em_layer)

We can then merge all of the embedding layers, define the hidden layer and output layer, then define the model.

... # concat all embeddings merge = concatenate(em_layers) dense = Dense(10, activation='relu', kernel_initializer='he_normal')(merge) output = Dense(1, activation='sigmoid')(dense) model = Model(inputs=in_layers, outputs=output)

When using a model with multiple inputs, we will need to specify a list that has one dataset for each input, e.g. a list of nine arrays each with one column in the case of our dataset. Thankfully, this is the format we returned from our *prepare_inputs()* function.

Therefore, fitting and evaluating the model looks like it does in the previous section.

Additionally, we will plot the model by calling the *plot_model()* function and save it to file. This requires that pygraphviz and pydot are installed, which can be a pain on some systems. **If you have trouble**, just comment out the import statement and call to *plot_model()*.

... # compile the keras model model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # plot graph plot_model(model, show_shapes=True, to_file='embeddings.png') # fit the keras model on the dataset model.fit(X_train_enc, y_train_enc, epochs=20, batch_size=16, verbose=2) # evaluate the keras model _, accuracy = model.evaluate(X_test_enc, y_test_enc, verbose=0) print('Accuracy: %.2f' % (accuracy*100))

Tying this all together, the complete example of using a separate embedding for each categorical input variable in a multi-input layer model is listed below.

# example of learned embedding encoding for a neural network from numpy import unique from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from keras.models import Model from keras.layers import Input from keras.layers import Dense from keras.layers import Embedding from keras.layers.merge import concatenate from keras.utils import plot_model # load the dataset def load_dataset(filename): # load the dataset as a pandas DataFrame data = read_csv(filename, header=None) # retrieve numpy array dataset = data.values # split into input (X) and output (y) variables X = dataset[:, :-1] y = dataset[:,-1] # format all fields as string X = X.astype(str) # reshape target to be a 2d array y = y.reshape((len(y), 1)) return X, y # prepare input data def prepare_inputs(X_train, X_test): X_train_enc, X_test_enc = list(), list() # label encode each column for i in range(X_train.shape[1]): le = LabelEncoder() le.fit(X_train[:, i]) # encode train_enc = le.transform(X_train[:, i]) test_enc = le.transform(X_test[:, i]) # store X_train_enc.append(train_enc) X_test_enc.append(test_enc) return X_train_enc, X_test_enc # prepare target def prepare_targets(y_train, y_test): le = LabelEncoder() le.fit(y_train) y_train_enc = le.transform(y_train) y_test_enc = le.transform(y_test) return y_train_enc, y_test_enc # load the dataset X, y = load_dataset('breast-cancer.csv') # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # prepare input data X_train_enc, X_test_enc = prepare_inputs(X_train, X_test) # prepare output data y_train_enc, y_test_enc = prepare_targets(y_train, y_test) # make output 3d y_train_enc = y_train_enc.reshape((len(y_train_enc), 1, 1)) y_test_enc = y_test_enc.reshape((len(y_test_enc), 1, 1)) # prepare each input head in_layers = list() em_layers = list() for i in range(len(X_train_enc)): # calculate the number of unique inputs n_labels = len(unique(X_train_enc[i])) # define input layer in_layer = Input(shape=(1,)) # define embedding layer em_layer = Embedding(n_labels, 10)(in_layer) # store layers in_layers.append(in_layer) em_layers.append(em_layer) # concat all embeddings merge = concatenate(em_layers) dense = Dense(10, activation='relu', kernel_initializer='he_normal')(merge) output = Dense(1, activation='sigmoid')(dense) model = Model(inputs=in_layers, outputs=output) # compile the keras model model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # plot graph plot_model(model, show_shapes=True, to_file='embeddings.png') # fit the keras model on the dataset model.fit(X_train_enc, y_train_enc, epochs=20, batch_size=16, verbose=2) # evaluate the keras model _, accuracy = model.evaluate(X_test_enc, y_test_enc, verbose=0) print('Accuracy: %.2f' % (accuracy*100))

Running the example prepares the data as described above, fits the model, and reports the performance.

Your specific results will vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, the model performs reasonably well, matching what we saw for the one hot encoding in the previous section.

As the learned vectors were trained in a skilled model, it is possible to save them and use them as a general representation for these variables in other models that operate on the same data. A useful and compelling reason to explore this encoding.

... Epoch 15/20 - 0s - loss: 0.4891 - acc: 0.7696 Epoch 16/20 - 0s - loss: 0.4845 - acc: 0.7749 Epoch 17/20 - 0s - loss: 0.4783 - acc: 0.7749 Epoch 18/20 - 0s - loss: 0.4763 - acc: 0.7906 Epoch 19/20 - 0s - loss: 0.4696 - acc: 0.7906 Epoch 20/20 - 0s - loss: 0.4660 - acc: 0.7958 Accuracy: 72.63

To confirm our understanding of the model, a plot is created and saved to the file embeddings.png in the current working directory.

The plot shows the nine inputs each mapped to a 10 element vector, meaning that the actual input to the model is a 90 element vector.

**Note**: Click to the image to see the large version.

This section lists some common questions and answers when encoding categorical data.

Or, what if I have a mixture of categorical and ordinal data?

You will need to prepare or encode each variable (column) in your dataset separately, then concatenate all of the prepared variables back together into a single array for fitting or evaluating the model.

Or, what if I concatenate many one hot encoded vectors to create a many thousand element input vector?

You can use a one hot encoding up to thousands and tens of thousands of categories. Also, having large vectors as input sounds intimidating, but the models can generally handle it.

Try an embedding; it offers the benefit of a smaller vector space (a projection) and the representation can have more meaning.

This is unknowable.

Test each technique (and more) on your dataset with your chosen model and discover what works best for your case.

This section provides more resources on the topic if you are looking to go deeper.

- Develop Your First Neural Network in Python Step-By-Step
- Why One-Hot Encode Data in Machine Learning?
- Data Preparation for Gradient Boosting with XGBoost in Python
- What Are Word Embeddings for Text?
- How to Use Word Embedding Layers for Deep Learning with Keras
- How to Use the Keras Functional API for Deep Learning

- sklearn.model_selection.train_test_split API.
- sklearn.preprocessing.OrdinalEncoder API.
- sklearn.preprocessing.LabelEncoder API.
- Embedding Keras API.
- Visualization Keras API.

- Breast Cancer Data Set, UCI Machine Learning Repository.
- Breast Cancer Raw Dataset
- Breast Cancer Description

In this tutorial, you discovered how to encode categorical data when developing neural network models in Keras.

Specifically, you learned:

- The challenge of working with categorical data when using machine learning and deep learning models.
- How to integer encode and one hot encode categorical variables for modeling.
- How to learn an embedding distributed representation as part of a neural network for categorical variables.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post 3 Ways to Encode Categorical Variables for Deep Learning appeared first on Machine Learning Mastery.

]]>The post How to Save and Reuse Data Preparation Objects in Scikit-Learn appeared first on Machine Learning Mastery.

]]>It is critical that any data preparation performed on a training dataset is also performed on a new dataset in the future.

This may include a test dataset when evaluating a model or new data from the domain when using a model to make predictions.

Typically, the model fit on the training dataset is saved for later use. The correct solution to preparing new data for the model in the future is to also save any data preparation objects, like data scaling methods, to file along with the model.

In this tutorial, you will discover how to save a model and data preparation object to file for later use.

After completing this tutorial, you will know:

- The challenge of correctly preparing test data and new data for a machine learning model.
- The solution of saving the model and data preparation objects to file for later use.
- How to save and later load and use a machine learning model and data preparation model on new data.

Let’s get started.

This tutorial is divided into three parts; they are:

- Challenging of Preparing New Data for a Model
- Save Data Preparation Objects
- How to Save and Later Use a Data Preparation Object

Each input variable in a dataset may have different units.

For example, one variable may be in inches, another in miles, another in days, and so on.

As such, it is often important to scale data prior to fitting a model.

This is particularly important for models that use a weighted sum of the input or distance measures like logistic regression, neural networks, and k-nearest neighbors. This is because variables with larger values or ranges may dominate or wash out the effects of variables with smaller values or ranges.

Scaling techniques, such as normalization or standardization, have the effect of transforming the distribution of each input variable to be the same, such as the same minimum and maximum in the case of normalization or the same mean and standard deviation in the case of standardization.

A scaling technique must be fit, which just means it needs to calculate coefficients from data, such as the observed min and max, or the observed mean and standard deviation. These values can also be set by domain experts.

The best practice when using scaling techniques for evaluating models is to fit them on the training dataset, then apply them to the training and test datasets.

Or, when working with a final model, to fit the scaling method on the training dataset and apply the transform to the training dataset and any new dataset in the future.

It is critical that any data preparation or transformation applied to the training dataset is also applied to the test or other dataset in the future.

This is straightforward when all of the data and the model are in memory.

This is challenging when a model is saved and used later.

What is the best practice to scale data when saving a fit model for later use, such as a final model?

The solution is to save the data preparation object to file along with the model.

For example, it is common to use the pickle framework (built-in to Python) for saving machine learning models for later use, such as saving a final model.

This same framework can be used to save the object that was used for data preparation.

Later, the model and the data preparation object can be loaded and used.

It is convenient to save the entire objects to file, such as the model object and the data preparation object. Nevertheless, experts may prefer to save just the model parameters to file, then load them later and set them into a new model object. This approach can also be used with the coefficients used for scaling the data, such as the min and max values for each variable, or the mean and standard deviation for each variable.

The choice of which approach is appropriate for your project is up to you, but I recommend saving the model and data preparation object (or objects) to file directly for later use.

To make the idea of saving the object and data transform object to file concrete, let’s look at a worked example.

In this section, we will demonstrate preparing a dataset, fitting a model on the dataset, saving the model and data transform object to file, and later loading the model and transform and using them on new data.

First, we need a dataset.

We will use a test dataset from the scikit-learn dataset, specifically a binary classification problem with two input variables created randomly via the make_blobs() function.

The example below creates a test dataset with 100 examples, two input features, and two class labels (0 and 1). The dataset is then split into training and test sets and the min and max values of each variable are then reported.

Importantly, the *random_state* is set when creating the dataset and when splitting the data so that the same dataset is created and the same split of data is performed each time that the code is run.

# example of creating a test dataset and splitting it into train and test sets from sklearn.datasets.samples_generator import make_blobs from sklearn.model_selection import train_test_split # prepare dataset X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1) # split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # summarize dataset for i in range(X.shape[1]): print('Train', i, X_train[i].min(), X_train[i].max()) print('Test', i, X_test[i].min(), X_test[i].max())

Running the example reports the min and max values for each variable in both the train and test datasets.

We can see that each variable has a different scale, and that the scales differ between the train and test datasets. This is a realistic scenario that we may encounter with a real dataset.

Train 0 -8.958887901793688 -1.766368900388947 Test 0 -0.5279305184970926 5.92630668526536 Train 1 -1.9657639185768914 5.234464511450407 Test 1 -2.351220657673829 4.0097363419871845

Next, we can scale the dataset.

We will use the MinMaxScaler to scale each input variable to the range [0, 1]. The best practice use of this scaler is to fit it on the training dataset and then apply the transform to the training dataset, and other datasets: in this case, the test dataset.

The complete example of scaling the data and summarizing the effects is listed below.

# example of scaling the dataset from sklearn.datasets.samples_generator import make_blobs from sklearn.model_selection import train_test_split from sklearn.preprocessing import MinMaxScaler # prepare dataset X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1) # split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # define scaler scaler = MinMaxScaler() # fit scaler on the training dataset scaler.fit(X_train) # transform both datasets X_train_scaled = scaler.transform(X_train) X_test_scaled = scaler.transform(X_test) # summarize dataset for i in range(X.shape[1]): print('Train', i, X_train_scaled[i].min(), X_train_scaled[i].max()) print('Test', i, X_test_scaled[i].min(), X_test_scaled[i].max())

Running the example prints the effect of the scaled data showing the min and max values for each variable in the train and test datasets.

We can see that all variables in both datasets now have values in the desired range of 0 to 1.

Train 0 0.23395851599080797 0.35842125076406284 Test 0 0.9148787986264039 0.9549870948672079 Train 1 0.7987532056890075 0.901334837494266 Test 1 0.7676220660334238 0.8063573506527247

Next, we can fit a model on the training dataset and save both the model and the scaler object to file.

We will use a LogisticRegression model because the problem is a simple binary classification task.

The training dataset is scaled as before, and in this case, we will assume the test dataset is currently not available. Once scaled, the dataset is used to fit a logistic regression model.

We will use the pickle framework to save the *LogisticRegression* model to one file, and the *MinMaxScaler* to another file.

The complete example is listed below.

# example of fitting a model on the scaled dataset from sklearn.datasets.samples_generator import make_blobs from sklearn.model_selection import train_test_split from sklearn.preprocessing import MinMaxScaler from sklearn.linear_model import LogisticRegression from pickle import dump # prepare dataset X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1) # split data into train and test sets X_train, _, y_train, _ = train_test_split(X, y, test_size=0.33, random_state=1) # define scaler scaler = MinMaxScaler() # fit scaler on the training dataset scaler.fit(X_train) # transform the training dataset X_train_scaled = scaler.transform(X_train) # define model model = LogisticRegression(solver='lbfgs') model.fit(X_train_scaled, y_train) # save the model dump(model, open('model.pkl', 'wb')) # save the scaler dump(scaler, open('scaler.pkl', 'wb'))

Running the example scales the data, fits the model, and saves the model and scaler to files using pickle.

You should have two files in your current working directory:

*model.pkl**scaler.pkl*

Finally, we can load the model and the scaler object and make use of them.

In this case, we will assume that the training dataset is not available, and that only new data or the test dataset is available.

We will load the model and the scaler, then use the scaler to prepare the new data and use the model to make predictions. Because it is a test dataset, we have the expected target values, so we will compare the predictions to the expected target values and calculate the accuracy of the model.

The complete example is listed below.

# load model and scaler and make predictions on new data from sklearn.datasets.samples_generator import make_blobs from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from pickle import load # prepare dataset X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1) # split data into train and test sets _, X_test, _, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # load the model model = load(open('model.pkl', 'rb')) # save the scaler scaler = load(open('scaler.pkl', 'rb')) # transform the test dataset X_test_scaled = scaler.transform(X_test) # make predictions on the test set yhat = model.predict(X_test_scaled) # evaluate accuracy acc = accuracy_score(y_test, yhat) print('Test Accuracy:', acc)

Running the example loads the model and scaler, then uses the scaler to prepare the test dataset correctly for the model, meeting the expectations of the model when it was trained.

The model then makes a prediction for the examples in the test set and the classification accuracy is calculated. In this case, the model achieved 100% accuracy on the test set because the test problem is trivial.

Test Accuracy: 1.0

This provides a template that you can use to save both your model and scaler object (or objects) to file on your own projects.

This section provides more resources on the topic if you are looking to go deeper.

- Save and Load Machine Learning Models in Python with scikit-learn
- How to Train a Final Machine Learning Model

- sklearn.datasets.make_blobs API.
- sklearn.model_selection.train_test_split API.
- sklearn.preprocessing.MinMaxScaler API.
- sklearn.metrics.accuracy_score API.
- sklearn.linear_model.LogisticRegression API.
- pickle API.

In this tutorial, you discovered how to save a model and data preparation object to file for later use.

Specifically, you learned:

- The challenge of correctly preparing test data and new data for a machine learning model.
- The solution of saving the model and data preparation objects to file for later use.
- How to save and later load and use a machine learning model and data preparation model on new data.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Save and Reuse Data Preparation Objects in Scikit-Learn appeared first on Machine Learning Mastery.

]]>The post What Does Stochastic Mean in Machine Learning? appeared first on Machine Learning Mastery.

]]>The behavior and performance of many machine learning algorithms are referred to as stochastic.

Stochastic refers to a variable process where the outcome involves some randomness and has some uncertainty. It is a mathematical term and is closely related to “*randomness*” and “*probabilistic*” and can be contrasted to the idea of “*deterministic*.”

The stochastic nature of machine learning algorithms is an important foundational concept in machine learning and required to understand in order to interpret the behavior of many predictive models.

In this post, you will discover a gentle introduction to stochasticity in machine learning.

After reading this post, you will know:

- A variable or process is stochastic if there is uncertainty or randomness involved in the outcomes.
- Stochastic is a synonym for random and probabilistic, although is different from non-deterministic.
- Many machine learning algorithms are stochastic because they explicitly use randomness during optimization or learning.

Discover bayes opimization, naive bayes, maximum likelihood, distributions, cross entropy, and much more in my new book, with 28 step-by-step tutorials and full Python source code.

Let’s get started.

This tutorial is divided into three parts; they are:

- What Does “
*Stochastic*” Mean? - Stochastic vs. Random, Probabilistic, and Nondeterministic
- Stochastic in Machine Learning

A variable is stochastic if the occurrence of events or outcomes involves randomness or uncertainty.

… “stochastic” means that the model has some kind of randomness in it

— Page 66, Think Bayes.

A process is stochastic if it governs one or more stochastic variables.

Games are stochastic because they include an element of randomness, such as shuffling or rolling of a dice in card games and board games.

In real life, many unpredictable external events can put us into unforeseen situations. Many games mirror this unpredictability by including a random element, such as the throwing of dice. We call these stochastic games.

— Page 177, Artificial Intelligence: A Modern Approach, 3rd edition, 2009.

Stochastic is commonly used to describe mathematical processes that use or harness randomness. Common examples include Brownian motion, Markov Processes, Monte Carlo Sampling, and more.

Now that we have some definitions, let’s try and add some more context by comparing stochastic with other notions of uncertainty.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In this section, we’ll try to better understand the idea of a variable or process being stochastic by comparing it to the related terms of “*random*,” “*probabilistic*,” and “*non-deterministic*.”

In statistics and probability, a variable is called a “random variable” and can take on one or more outcomes or events.

It is the common name used for a thing that can be measured.

In general, stochastic is a synonym for random.

For example, a stochastic variable is a random variable. A stochastic process is a random process.

Typically, random is used to refer to a lack of dependence between observations in a sequence. For example, the rolls of a fair die are random, so are the flips of a fair coin.

Strictly speaking, a random variable or a random sequence can still be summarized using a probability distribution; it just may be a uniform distribution.

We may choose to describe something as stochastic over random if we are interested in focusing on the probabilistic nature of the variable, such as a partial dependence of the next event on the current event. We may choose random over stochastic if we wish to focus attention on the independence of the events.

In general, stochastic is a synonym for probabilistic.

For example, a stochastic variable or process is probabilistic. It can be summarized and analyzed using the tools of probability.

Most notably, the distribution of events or the next event in a sequence can be described in terms of a probability distribution.

We may choose to describe a variable or process as probabilistic over stochastic if we wish to emphasize the dependence, such as if we are using a parametric model or known probability distribution to summarize the variable or sequence.

A variable or process is deterministic if the next event in the sequence can be determined exactly from the current event.

For example, a deterministic algorithm will always give the same outcome given the same input. Conversely, a non-deterministic algorithm will give different outcomes for the same input.

A stochastic variable or process is not deterministic because there is uncertainty associated with the outcome.

Nevertheless, a stochastic variable or process is also not non-deterministic because non-determinism only describes the possibility of outcomes, rather than probability.

Describing something as stochastic is a stronger claim than describing it as non-deterministic because we can use the tools of probability in analysis, such as expected outcome and variance.

… “stochastic” generally implies that uncertainty about outcomes is quantified in terms of probabilities; a nondeterministic environment is one in which actions are characterized by their possible outcomes, but no probabilities are attached to them.

— Page 43, Artificial Intelligence: A Modern Approach, 3rd edition, 2009.

Many machine learning algorithms and models are described in terms of being stochastic.

This is because many optimization and learning algorithms both must operate in stochastic domains and because some algorithms make use of randomness or probabilistic decisions.

Let’s take a closer look at the source of uncertainty and the nature of stochastic algorithms in machine learning.

Stochastic domains are those that involve uncertainty.

… machine learning must always deal with uncertain quantities, and sometimes may also need to deal with stochastic (non-deterministic) quantities. Uncertainty and stochasticity can arise from many sources.

— Page 54, Deep Learning, 2016.

This uncertainty can come from a target or objective function that is subjected to statistical noise or random errors.

It can also come from the fact that the data used to fit a model is an incomplete sample from a broader population.

Finally, the models chosen are rarely able to capture all of the aspects of the domain, and instead must generalize to unseen circumstances and lose some fidelity.

Stochastic optimization refers to a field of optimization algorithms that explicitly use randomness to find the optima of an objective function, or optimize an objective function that itself has randomness (statistical noise).

Most commonly, stochastic optimization algorithms seek a balance between exploring the search space and exploiting what has already been learned about the search space in order to hone in on the optima. The choice of the next locations in the search space are chosen stochastically, that is probabilistically based on what areas have been searched recently.

Stochastic hill climbing chooses at random from among the uphill moves; the probability of selection can vary with the steepness of the uphill move.

— Page 124, Artificial Intelligence: A Modern Approach, 3rd edition, 2009.

Popular examples of stochastic optimization algorithms are:

- Simulated Annealing
- Genetic Algorithm
- Particle Swarm Optimization

Particle swarm optimization (PSO) is a stochastic optimization approach, modeled on the social behavior of bird flocks.

— Page 9, Computational Intelligence: An Introduction.

Most machine learning algorithms are stochastic because they make use of randomness during learning.

Using randomness is a feature, not a bug. It allows the algorithms to avoid getting stuck and achieve results that deterministic (non-stochastic) algorithms cannot achieve.

For example, some machine learning algorithms even include “*stochastic*” in their name such as:

- Stochastic Gradient Descent (optimization algorithm).
- Stochastic Gradient Boosting (ensemble algorithm).

Stochastic gradient descent optimizes the parameters of a model, such as an artificial neural network, that involves randomly shuffling the training dataset before each iteration that causes different orders of updates to the model parameters. In addition, model weights in a neural network are often initialized to a random starting point.

Most deep learning algorithms are based on an optimization algorithm called stochastic gradient descent.

— Page 98, Deep Learning, 2016.

Stochastic gradient boosting is an ensemble of decision trees algorithms. The stochastic aspect refers to the random subset of rows chosen from the training dataset used to construct trees, specifically the split points of trees.

Because many machine learning algorithms make use of randomness, their nature (e.g. behavior and performance) is also stochastic.

The stochastic nature of machine learning algorithms is most commonly seen on complex and nonlinear methods used for classification and regression predictive modeling problems.

These algorithms make use of randomness during the process of constructing a model from the training data which has the effect of fitting a different model each time same algorithm is run on the same data. In turn, the slightly different models have different performance when evaluated on a hold out test dataset.

This stochastic behavior of nonlinear machine learning algorithms is challenging for beginners who assume that learning algorithms will be deterministic, e.g. fit the same model when the algorithm is run on the same data.

This stochastic behavior requires that the performance of the model must be summarized using summary statistics that describe the mean or expected performance of the model, rather than the performance of the model from any single training run.

For more on this topic, see the post:

This section provides more resources on the topic if you are looking to go deeper.

- How to Generate Random Numbers in Python
- Introduction to Random Number Generators for Machine Learning in Python
- Embrace Randomness in Machine Learning
- Why Initialize a Neural Network with Random Weights?
- Stochastic Gradient Boosting with XGBoost and scikit-learn in Python

- Random variable, Wikipedia.
- Statistical randomness, Wikipedia.
- Stochastic, Wikipedia.
- Stochastic process, Wikipedia.
- Stochastic optimization.

In this post, you discovered a gentle introduction to stochasticity in machine learning.

Specifically, you learned:

- A variable or process is stochastic if there is uncertainty or randomness involved in the outcomes.
- Stochastic is a synonym for random and probabilistic, although is different from non-deterministic.
- Many machine learning algorithms are stochastic because they explicitly use randomness during optimization or learning.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post What Does Stochastic Mean in Machine Learning? appeared first on Machine Learning Mastery.

]]>The post How to Connect Model Input Data With Predictions for Machine Learning appeared first on Machine Learning Mastery.

]]>Fitting a model to a training dataset is so easy today with libraries like scikit-learn.

A model can be fit and evaluated on a dataset in just a few lines of code. It is so easy that it has become a problem.

The same few lines of code are repeated again and again and it may not be obvious how to actually use the model to make a prediction. Or, if a prediction is made, how to relate the predicted values to the actual input values.

I know that this is the case because I get many emails with the question:

How do I connect the predicted values with the input data?

This a common problem.

In this tutorial, you will discover how to relate the predicted values with the inputs to a machine learning model.

After completing this tutorial, you will know:

- How to fit and evaluate the model on a training dataset.
- How to use the fit model to make predictions one at a time and in batches.
- How to connect the predicted values with the inputs to the model.

Let’s get started.

This tutorial is divided into three parts; they are:

- Prepare a Training Dataset
- How to Fit a Model on the Training Dataset
- How to Connect Predictions With Inputs to the Model

Let’s start off by defining a dataset that we can use with our model.

You may have your own dataset in a CSV file or in a NumPy array in memory.

In this case, we will use a simple two-class or binary classification problem with two numerical input variables.

**Inputs**: Two numerical input variables:**Outputs**: A class label as either a 0 or 1.

We can use the make_blobs() scikit-learn function to create this dataset with 1,000 examples.

The example below creates the dataset with separate arrays for the input (*X*) and outputs (*y*).

# example of creating a test dataset from sklearn.datasets.samples_generator import make_blobs # create the inputs and outputs X, y = make_blobs(n_samples=1000, centers=2, n_features=2, random_state=2) # summarize the shape of the arrays print(X.shape, y.shape)

Running the example creates the dataset and prints the shape of each of the arrays.

We can see that there are 1,000 rows for the 1,000 samples in the dataset. We can also see that the input data has two columns for the two input variables and that the output array is one long array of class labels for each of the rows in the input data.

(1000, 2) (1000,)

Next, we will fit a model on this training dataset.

Now that we have a training dataset, we can fit a model on the data.

This means that we will provide all of the training data to a learning algorithm and let the learning algorithm to discover the mapping between the inputs and the output class label that minimizes the prediction error.

In this case, because it is a two-class problem, we will try the logistic regression classification algorithm.

This can be achieved using the LogisticRegression class from scikit-learn.

First, the model must be defined with any specific configuration we require. In this case, we will use the efficient ‘*lbfgs*‘ solver.

Next, the model is fit on the training dataset by calling the *fit()* function and passing in the training dataset.

Finally, we can evaluate the model by first using it to make predictions on the training dataset by calling *predict()* and then comparing the predictions to the expected class labels and calculating the accuracy.

The complete example is listed below.

# fit a logistic regression on the training dataset from sklearn.linear_model import LogisticRegression from sklearn.datasets.samples_generator import make_blobs from sklearn.metrics import accuracy_score # create the inputs and outputs X, y = make_blobs(n_samples=1000, centers=2, n_features=2, random_state=2) # define model model = LogisticRegression(solver='lbfgs') # fit model model.fit(X, y) # make predictions yhat = model.predict(X) # evaluate predictions acc = accuracy_score(y, yhat) print(acc)

Running the example fits the model on the training dataset and then prints the classification accuracy.

In this case, we can see that the model has a 100% classification accuracy on the training dataset.

1.0

Now that we know how to fit and evaluate a model on the training dataset, let’s get to the root of the question.

*How do you connect inputs of the model to the outputs?*

A fit machine learning model takes inputs and makes a prediction.

This could be one row of data at a time; for example:

**Input**: 2.12309797 -1.41131072**Output**: 1

This is straightforward with our model.

For example, we can make a prediction with an array input and get one output and we know that the two are directly connected.

The input must be defined as an array of numbers, specifically 1 row with 2 columns. We can achieve this by defining the example as a list of rows with a list of columns for each row; for example:

... # define input new_input = [[2.12309797, -1.41131072]]

We can then provide this as input to the model and make a prediction.

... # get prediction for new input new_output = model.predict(new_input)

Tying this together with fitting the model from the previous section, the complete example is listed below.

# make a single prediction with the model from sklearn.linear_model import LogisticRegression from sklearn.datasets.samples_generator import make_blobs # create the inputs and outputs X, y = make_blobs(n_samples=1000, centers=2, n_features=2, random_state=2) # define model model = LogisticRegression(solver='lbfgs') # fit model model.fit(X, y) # define input new_input = [[2.12309797, -1.41131072]] # get prediction for new input new_output = model.predict(new_input) # summarize input and output print(new_input, new_output)

Running the example defines the new input and makes a prediction, then prints both the input and the output.

We can see that in this case, the model predicts class label 1 for the inputs.

[[2.12309797, -1.41131072]] [1]

If we were using the model in our own application, this usage of the model would allow us to directly relate the inputs and outputs for each prediction made.

If we needed to replace the labels 0 and 1 with something meaningful like “*spam*” and “*not spam*“, we could do that with a simple if-statement.

So far so good.

**What happens when the model is used to make multiple predictions at once?**

That is, how do we relate the predictions to the inputs when multiple rows or multiple samples are provided to the model at once?

For example, we could make a prediction for each of the 1,000 examples in the training dataset as we did in the previous section when evaluating the model. In this case, the model would make 1,000 distinct predictions and return an array of 1,000 integer values. One prediction for each of the 1,000 input rows of data.

Importantly, the order of the predictions in the output array matches the order of rows provided as input to the model when making a prediction. This means that the input row at index 0 matches the prediction at index 0; the same is true for index 1, index 2, all the way to index 999.

Therefore, we can relate the inputs and outputs directly based on their index, with the knowledge that the order is preserved when making a prediction on many rows of inputs.

Let’s make this concrete with an example.

First, we can make a prediction for each row of input in the training dataset:

... # make predictions on the entire training dataset yhat = model.predict(X)

We can then step through the indexes and access the input and the predicted output for each.

This shows precisely how to connect the predictions with the input rows. For example, the input at row 0 and the prediction at index 0:

... print(X[0], yhat[0])

In this case, we will just look at the first 10 rows and their predictions.

... # connect predictions with outputs for i in range(10): print(X[i], yhat[i])

Tying this together, the complete example of making a prediction for each row in the training data and connecting the predictions with the inputs is listed below.

# make a single prediction with the model from sklearn.linear_model import LogisticRegression from sklearn.datasets.samples_generator import make_blobs # create the inputs and outputs X, y = make_blobs(n_samples=1000, centers=2, n_features=2, random_state=2) # define model model = LogisticRegression(solver='lbfgs') # fit model model.fit(X, y) # make predictions on the entire training dataset yhat = model.predict(X) # connect predictions with outputs for i in range(10): print(X[i], yhat[i])

Running the example, the model makes 1,000 predictions for the 1,000 rows in the training dataset, then connects the inputs to the predicted values for the first 10 examples.

This provides a template that you can use and adapt for your own predictive modeling projects to connect predictions to the input rows via their row index.

[ 1.23839154 -2.8475005 ] 1 [-1.25884111 -8.57055785] 0 [ -0.86599821 -10.50446358] 0 [ 0.59831673 -1.06451727] 1 [ 2.12309797 -1.41131072] 1 [-1.53722693 -9.61845366] 0 [ 0.92194131 -0.68709327] 1 [-1.31478732 -8.78528161] 0 [ 1.57989896 -1.462412 ] 1 [ 1.36989667 -1.3964704 ] 1

This section provides more resources on the topic if you are looking to go deeper.

- Your First Machine Learning Project in Python Step-By-Step
- How to Make Predictions with scikit-learn

- sklearn.datasets.make_blobs API
- sklearn.metrics.accuracy_score API
- sklearn.linear_model.LogisticRegression API

In this tutorial, you discovered how to relate the predicted values with the inputs to a machine learning model.

Specifically, you learned:

- How to fit and evaluate the model on a training dataset.
- How to use the fit model to make predictions one at a time and in batches.
- How to connect the predicted values with the inputs to the model.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Connect Model Input Data With Predictions for Machine Learning appeared first on Machine Learning Mastery.

]]>