The post Stacking Ensemble Machine Learning With Python appeared first on Machine Learning Mastery.

]]>Stacking or Stacked Generalization is an ensemble machine learning algorithm.

It uses a meta-learning algorithm to learn how to best combine the predictions from two or more base machine learning algorithms.

The benefit of stacking is that it can harness the capabilities of a range of well-performing models on a classification or regression task and make predictions that have better performance than any single model in the ensemble.

In this tutorial, you will discover the stacked generalization ensemble or stacking in Python.

After completing this tutorial, you will know:

- Stacking is an ensemble machine learning algorithm that learns how to best combine the predictions from multiple well-performing machine learning models.
- The scikit-learn library provides a standard implementation of the stacking ensemble in Python.
- How to use stacking ensembles for regression and classification predictive modeling.

Let’s get started.

This tutorial is divided into four parts; they are:

- Stacked Generalization
- Stacking Scikit-Learn API
- Stacking for Classification
- Stacking for Regression

Stacked Generalization or “*Stacking*” for short is an ensemble machine learning algorithm.

It involves combining the predictions from multiple machine learning models on the same dataset, like bagging and boosting.

Stacking addresses the question:

- Given multiple machine learning models that are skillful on a problem, but in different ways, how do you choose which model to use (trust)?

The approach to this question is to use another machine learning model that learns when to use or trust each model in the ensemble.

- Unlike bagging, in stacking, the models are typically different (e.g. not all decision trees) and fit on the same dataset (e.g. instead of samples of the training dataset).
- Unlike boosting, in stacking, a single model is used to learn how to best combine the predictions from the contributing models (e.g. instead of a sequence of models that correct the predictions of prior models).

The architecture of a stacking model involves two or more base models, often referred to as level-0 models, and a meta-model that combines the predictions of the base models, referred to as a level-1 model.

**Level-0 Models (**: Models fit on the training data and whose predictions are compiled.*Base-Models*)**Level-1 Model (**: Model that learns how to best combine the predictions of the base models.*Meta-Model*)

The meta-model is trained on the predictions made by base models on out-of-sample data. That is, data not used to train the base models is fed to the base models, predictions are made, and these predictions, along with the expected outputs, provide the input and output pairs of the training dataset used to fit the meta-model.

The outputs from the base models used as input to the meta-model may be real value in the case of regression, and probability values, probability like values, or class labels in the case of classification.

The most common approach to preparing the training dataset for the meta-model is via k-fold cross-validation of the base models, where the out-of-fold predictions are used as the basis for the training dataset for the meta-model.

The training data for the meta-model may also include the inputs to the base models, e.g. input elements of the training data. This can provide an additional context to the meta-model as to how to best combine the predictions from the meta-model.

Once the training dataset is prepared for the meta-model, the meta-model can be trained in isolation on this dataset, and the base-models can be trained on the entire original training dataset.

Stacking is appropriate when multiple different machine learning models have skill on a dataset, but have skill in different ways. Another way to say this is that the predictions made by the models or the errors in predictions made by the models are uncorrelated or have a low correlation.

Base-models are often complex and diverse. As such, it is often a good idea to use a range of models that make very different assumptions about how to solve the predictive modeling task, such as linear models, decision trees, support vector machines, neural networks, and more. Other ensemble algorithms may also be used as base-models, such as random forests.

**Base-Models**: Use a diverse range of models that make different assumptions about the prediction task.

The meta-model is often simple, providing a smooth interpretation of the predictions made by the base models. As such, linear models are often used as the meta-model, such as linear regression for regression tasks (predicting a numeric value) and logistic regression for classification tasks (predicting a class label). Although this is common, it is not required.

**Regression Meta-Model**: Linear Regression.**Classification Meta-Model**: Logistic Regression.

The use of a simple linear model as the meta-model often gives stacking the colloquial name “*blending*.” As in the prediction is a weighted average or blending of the predictions made by the base models.

The super learner may be considered a specialized type of stacking.

Stacking is designed to improve modeling performance, although is not guaranteed to result in an improvement in all cases.

Achieving an improvement in performance depends on the complexity of the problem and whether it is sufficiently well represented by the training data and complex enough that there is more to learn by combining predictions. It is also dependent upon the choice of base models and whether they are sufficiently skillful and sufficiently uncorrelated in their predictions (or errors).

If a base-model performs as well as or better than the stacking ensemble, the base model should be used instead, given its lower complexity (e.g. it’s simpler to describe, train and maintain).

Stacking can be implemented from scratch, although this can be challenging for beginners.

For an example of implementing stacking from scratch in Python, see the tutorial:

For an example of implementing stacking from scratch for deep learning, see the tutorial:

The scikit-learn Python machine learning library provides an implementation of stacking for machine learning.

It is available in version 0.22 of the library and higher.

First, confirm that you are using a modern version of the library by running the following script:

# check scikit-learn version import sklearn print(sklearn.__version__)

Running the script will print your version of scikit-learn.

Your version should be the same or higher. If not, you must upgrade your version of the scikit-learn library.

0.22.1

Stacking is provided via the StackingRegressor and StackingClassifier classes.

Both models operate the same way and take the same arguments. Using the model requires that you specify a list of estimators (level-0 models), and a final estimator (level-1 or meta-model).

A list of level-0 models or base models is provided via the “*estimators*” argument. This is a Python list where each element in the list is a tuple with the name of the model and the configured model instance.

For example, below defines two level-0 models:

... models = [('lr',LogisticRegression()),('svm',SVC()) stacking = StackingClassifier(estimators=models]

Each model in the list may also be a Pipeline, including any data preparation required by the model prior to fitting the model on the training dataset. For example:

... models = [('lr',LogisticRegression()),('svm',make_pipeline(StandardScaler(),SVC())) stacking = StackingClassifier(estimators=models]

The level-1 model or meta-model is provided via the “*final_estimator*” argument. By default, this is set to *LinearRegression* for regression and *LogisticRegression* for classification, and these are sensible defaults that you probably do not want to change.

The dataset for the meta-model is prepared using cross-validation. By default, 5-fold cross-validation is used, although this can be changed via the “*cv*” argument and set to either a number (e.g. 10 for 10-fold cross-validation) or a cross-validation object (e.g. *StratifiedKFold*).

Sometimes, better performance can be achieved if the dataset prepared for the meta-model also includes inputs to the level-0 models, e.g. the input training data. This can be achieved by setting the “*passthrough*” argument to True and is not enabled by default.

Now that we are familiar with the stacking API in scikit-learn, let’s look at some worked examples.

In this section, we will look at using stacking for a classification problem.

First, we can use the make_classification() function to create a synthetic binary classification problem with 1,000 examples and 20 input features.

The complete example is listed below.

# test classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and summarizes the shape of the input and output components.

(1000, 20) (1000,)

Next, we can evaluate a suite of different machine learning models on the dataset.

Specifically, we will evaluate the following five algorithms:

- Logistic Regression.
- k-Nearest Neighbors.
- Decision Tree.
- Support Vector Machine.
- Naive Bayes.

Each algorithm will be evaluated using default model hyperparameters. The function *get_models()* below creates the models we wish to evaluate.

# get a list of models to evaluate def get_models(): models = dict() models['lr'] = LogisticRegression() models['knn'] = KNeighborsClassifier() models['cart'] = DecisionTreeClassifier() models['svm'] = SVC() models['bayes'] = GaussianNB() return models

Each model will be evaluated using repeated k-fold cross-validation.

The *evaluate_model()* function below takes a model instance and returns a list of scores from three repeats of stratified 10-fold cross-validation.

# evaluate a given model using cross-validation def evaluate_model(model): cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') return scores

We can then report the mean performance of each algorithm and also create a box and whisker plot to compare the distribution of accuracy scores for each algorithm.

Tying this together, the complete example is listed below.

# compare standalone models for binary classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.svm import SVC from sklearn.naive_bayes import GaussianNB from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) return X, y # get a list of models to evaluate def get_models(): models = dict() models['lr'] = LogisticRegression() models['knn'] = KNeighborsClassifier() models['cart'] = DecisionTreeClassifier() models['svm'] = SVC() models['bayes'] = GaussianNB() return models # evaluate a given model using cross-validation def evaluate_model(model): cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): scores = evaluate_model(model) results.append(scores) names.append(name) print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example first reports the mean and standard deviation accuracy for each model.

We can see that in this case, SVM performs the best with about 95.7 percent mean accuracy.

>lr 0.866 (0.029) >knn 0.931 (0.025) >cart 0.821 (0.050) >svm 0.957 (0.020) >bayes 0.833 (0.031)

A box-and-whisker plot is then created comparing the distribution accuracy scores for each model, allowing us to clearly see that KNN and SVM perform better on average than LR, CART, and Bayes.

Here we have five different algorithms that perform well, presumably in different ways on this dataset.

Next, we can try to combine these five models into a single ensemble model using stacking.

We can use a logistic regression model to learn how to best combine the predictions from each of the separate five models.

The *get_stacking()* function below defines the StackingClassifier model by first defining a list of tuples for the five base models, then defining the logistic regression meta-model to combine the predictions from the base models using 5-fold cross-validation.

# get a stacking ensemble of models def get_stacking(): # define the base models level0 = list() level0.append(('lr', LogisticRegression())) level0.append(('knn', KNeighborsClassifier())) level0.append(('cart', DecisionTreeClassifier())) level0.append(('svm', SVC())) level0.append(('bayes', GaussianNB())) # define meta learner model level1 = LogisticRegression() # define the stacking ensemble model = StackingClassifier(estimators=level0, final_estimator=level1, cv=5) return model

We can include the stacking ensemble in the list of models to evaluate, along with the standalone models.

# get a list of models to evaluate def get_models(): models = dict() models['lr'] = LogisticRegression() models['knn'] = KNeighborsClassifier() models['cart'] = DecisionTreeClassifier() models['svm'] = SVC() models['bayes'] = GaussianNB() models['stacking'] = get_stacking() return models

Our expectation is that the stacking ensemble will perform better than any single base model.

This is not always the case and if it is not the case, then the base model should be used in favor of the ensemble model.

The complete example of evaluating the stacking ensemble model alongside the standalone models is listed below.

# compare ensemble to each baseline classifier from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.svm import SVC from sklearn.naive_bayes import GaussianNB from sklearn.ensemble import StackingClassifier from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) return X, y # get a stacking ensemble of models def get_stacking(): # define the base models level0 = list() level0.append(('lr', LogisticRegression())) level0.append(('knn', KNeighborsClassifier())) level0.append(('cart', DecisionTreeClassifier())) level0.append(('svm', SVC())) level0.append(('bayes', GaussianNB())) # define meta learner model level1 = LogisticRegression() # define the stacking ensemble model = StackingClassifier(estimators=level0, final_estimator=level1, cv=5) return model # get a list of models to evaluate def get_models(): models = dict() models['lr'] = LogisticRegression() models['knn'] = KNeighborsClassifier() models['cart'] = DecisionTreeClassifier() models['svm'] = SVC() models['bayes'] = GaussianNB() models['stacking'] = get_stacking() return models # evaluate a give model using cross-validation def evaluate_model(model): cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): scores = evaluate_model(model) results.append(scores) names.append(name) print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example first reports the performance of each model.

This includes the performance of each base model, then the stacking ensemble.

In this case, we can see that the stacking ensemble appears to perform better than any single model on average, achieving an accuracy of about 96.4 percent.

>lr 0.866 (0.029) >knn 0.931 (0.025) >cart 0.820 (0.044) >svm 0.957 (0.020) >bayes 0.833 (0.031) >stacking 0.964 (0.019)

A box plot is created showing the distribution of model classification accuracies.

Here, we can see that the mean and median accuracy for the stacking model sits slightly higher than the SVM model.

If we choose a stacking ensemble as our final model, we can fit and use it to make predictions on new data just like any other model.

First, the stacking ensemble is fit on all available data, then the *predict()* function can be called to make predictions on new data.

The example below demonstrates this on our binary classification dataset.

# make a prediction with a stacking ensemble from sklearn.datasets import make_classification from sklearn.ensemble import StackingClassifier from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.svm import SVC from sklearn.naive_bayes import GaussianNB # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) # define the base models level0 = list() level0.append(('lr', LogisticRegression())) level0.append(('knn', KNeighborsClassifier())) level0.append(('cart', DecisionTreeClassifier())) level0.append(('svm', SVC())) level0.append(('bayes', GaussianNB())) # define meta learner model level1 = LogisticRegression() # define the stacking ensemble model = StackingClassifier(estimators=level0, final_estimator=level1, cv=5) # fit the model on all available data model.fit(X, y) # make a prediction for one example data = [[2.47475454,0.40165523,1.68081787,2.88940715,0.91704519,-3.07950644,4.39961206,0.72464273,-4.86563631,-6.06338084,-1.22209949,-0.4699618,1.01222748,-0.6899355,-0.53000581,6.86966784,-3.27211075,-6.59044146,-2.21290585,-3.139579]] yhat = model.predict(data) print('Predicted Class: %d' % (yhat))

Running the example fits the stacking ensemble model on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.

Predicted Class: 0

In this section, we will look at using stacking for a regression problem.

First, we can use the make_regression() function to create a synthetic regression problem with 1,000 examples and 20 input features.

The complete example is listed below.

# test regression dataset from sklearn.datasets import make_regression # define dataset X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=1) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and summarizes the shape of the input and output components.

(1000, 20) (1000,)

Next, we can evaluate a suite of different machine learning models on the dataset.

Specifically, we will evaluate the following three algorithms:

- k-Nearest Neighbors.
- Decision Tree.
- Support Vector Regression.

**Note**: The test dataset can be trivially solved using a linear regression model as the dataset was created using a linear model under the covers. As such, we will leave this model out of the example so we can demonstrate the benefit of the stacking ensemble method.

Each algorithm will be evaluated using the default model hyperparameters. The function *get_models()* below creates the models we wish to evaluate.

# get a list of models to evaluate def get_models(): models = dict() models['knn'] = KNeighborsRegressor() models['cart'] = DecisionTreeRegressor() models['svm'] = SVR() return models

Each model will be evaluated using repeated k-fold cross-validation. The *evaluate_model()* function below takes a model instance and returns a list of scores from three repeats of 10-fold cross-validation.

# evaluate a given model using cross-validation def evaluate_model(model): cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise') return scores

We can then report the mean performance of each algorithm and also create a box and whisker plot to compare the distribution of accuracy scores for each algorithm.

In this case, model performance will be reported using the mean absolute error (MAE). The scikit-learn library inverts the sign on this error to make it maximizing, from -infinity to 0 for the best score.

Tying this together, the complete example is listed below.

# compare machine learning models for regression from numpy import mean from numpy import std from sklearn.datasets import make_regression from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from sklearn.linear_model import LinearRegression from sklearn.neighbors import KNeighborsRegressor from sklearn.tree import DecisionTreeRegressor from sklearn.svm import SVR from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=1) return X, y # get a list of models to evaluate def get_models(): models = dict() models['knn'] = KNeighborsRegressor() models['cart'] = DecisionTreeRegressor() models['svm'] = SVR() return models # evaluate a given model using cross-validation def evaluate_model(model): cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise') return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): scores = evaluate_model(model) results.append(scores) names.append(name) print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example first reports the mean and standard deviation MAE for each model.

We can see that in this case, KNN performs the best with a mean negative MAE of about -100.

>knn -101.019 (7.161) >cart -148.100 (11.039) >svm -162.419 (12.565)

A box-and-whisker plot is then created comparing the distribution negative MAE scores for each model.

Here we have three different algorithms that perform well, presumably in different ways on this dataset.

Next, we can try to combine these three models into a single ensemble model using stacking.

We can use a linear regression model to learn how to best combine the predictions from each of the separate three models.

The *get_stacking()* function below defines the StackingRegressor model by first defining a list of tuples for the three base models, then defining the linear regression meta-model to combine the predictions from the base models using 5-fold cross-validation.

# get a stacking ensemble of models def get_stacking(): # define the base models level0 = list() level0.append(('knn', KNeighborsRegressor())) level0.append(('cart', DecisionTreeRegressor())) level0.append(('svm', SVR())) # define meta learner model level1 = LinearRegression() # define the stacking ensemble model = StackingRegressor(estimators=level0, final_estimator=level1, cv=5) return model

We can include the stacking ensemble in the list of models to evaluate, along with the standalone models.

# get a list of models to evaluate def get_models(): models = dict() models['knn'] = KNeighborsRegressor() models['cart'] = DecisionTreeRegressor() models['svm'] = SVR() models['stacking'] = get_stacking() return models

Our expectation is that the stacking ensemble will perform better than any single base model.

This is not always the case, and if it is not the case, then the base model should be used in favor of the ensemble model.

The complete example of evaluating the stacking ensemble model alongside the standalone models is listed below.

# compare ensemble to each standalone models for regression from numpy import mean from numpy import std from sklearn.datasets import make_regression from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from sklearn.linear_model import LinearRegression from sklearn.neighbors import KNeighborsRegressor from sklearn.tree import DecisionTreeRegressor from sklearn.svm import SVR from sklearn.ensemble import StackingRegressor from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=1) return X, y # get a stacking ensemble of models def get_stacking(): # define the base models level0 = list() level0.append(('knn', KNeighborsRegressor())) level0.append(('cart', DecisionTreeRegressor())) level0.append(('svm', SVR())) # define meta learner model level1 = LinearRegression() # define the stacking ensemble model = StackingRegressor(estimators=level0, final_estimator=level1, cv=5) return model # get a list of models to evaluate def get_models(): models = dict() models['knn'] = KNeighborsRegressor() models['cart'] = DecisionTreeRegressor() models['svm'] = SVR() models['stacking'] = get_stacking() return models # evaluate a given model using cross-validation def evaluate_model(model): cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise') return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): scores = evaluate_model(model) results.append(scores) names.append(name) print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example first reports the performance of each model. This includes the performance of each base model, then the stacking ensemble.

In this case, we can see that the stacking ensemble appears to perform better than any single model on average, achieving a mean negative MAE of about -56.

>knn -101.019 (7.161) >cart -148.017 (10.635) >svm -162.419 (12.565) >stacking -56.893 (5.253)

A box plot is created showing the distribution of model error scores. Here, we can see that the mean and median scores for the stacking model sit much higher than any individual model.

If we choose a stacking ensemble as our final model, we can fit and use it to make predictions on new data just like any other model.

First, the stacking ensemble is fit on all available data, then the *predict()* function can be called to make predictions on new data.

The example below demonstrates this on our regression dataset.

# make a prediction with a stacking ensemble from sklearn.datasets import make_regression from sklearn.linear_model import LinearRegression from sklearn.neighbors import KNeighborsRegressor from sklearn.tree import DecisionTreeRegressor from sklearn.svm import SVR from sklearn.ensemble import StackingRegressor # define dataset X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=1) # define the base models level0 = list() level0.append(('knn', KNeighborsRegressor())) level0.append(('cart', DecisionTreeRegressor())) level0.append(('svm', SVR())) # define meta learner model level1 = LinearRegression() # define the stacking ensemble model = StackingRegressor(estimators=level0, final_estimator=level1, cv=5) # fit the model on all available data model.fit(X, y) # make a prediction for one example data = [[0.59332206,-0.56637507,1.34808718,-0.57054047,-0.72480487,1.05648449,0.77744852,0.07361796,0.88398267,2.02843157,1.01902732,0.11227799,0.94218853,0.26741783,0.91458143,-0.72759572,1.08842814,-0.61450942,-0.69387293,1.69169009]] yhat = model.predict(data) print('Predicted Value: %.3f' % (yhat))

Running the example fits the stacking ensemble model on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.

Predicted Value: 556.264

This section provides more resources on the topic if you are looking to go deeper.

- How to Implement Stacked Generalization (Stacking) From Scratch With Python
- How to Develop a Stacking Ensemble for Deep Learning Neural Networks in Python With Keras
- How to Develop Super Learner Ensembles in Python
- How to Use Out-of-Fold Predictions in Machine Learning
- A Gentle Introduction to k-fold Cross-Validation

- Data Mining: Practical Machine Learning Tools and Techniques, 2016.
- The Elements of Statistical Learning, 2017.
- Machine Learning: A Probabilistic Perspective, 2012.

- sklearn.ensemble.StackingClassifier API.
- sklearn.ensemble.StackingRegressor API.
- sklearn.datasets.make_classification API.
- sklearn.datasets.make_regression API.

In this tutorial, you discovered the stacked generalization ensemble or stacking in Python.

Specifically, you learned:

- Stacking is an ensemble machine learning algorithm that learns how to best combine the predictions from multiple well-performing machine learning models.
- The scikit-learn library provides a standard implementation of the stacking ensemble in Python.
- How to use stacking ensembles for regression and classification predictive modeling.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Stacking Ensemble Machine Learning With Python appeared first on Machine Learning Mastery.

]]>The post 4 Types of Classification Tasks in Machine Learning appeared first on Machine Learning Mastery.

]]>Machine learning is a field of study and is concerned with algorithms that learn from examples.

Classification is a task that requires the use of machine learning algorithms that learn how to assign a class label to examples from the problem domain. An easy to understand example is classifying emails as “*spam*” or “*not spam*.”

There are many different types of classification tasks that you may encounter in machine learning and specialized approaches to modeling that may be used for each.

In this tutorial, you will discover different types of classification predictive modeling in machine learning.

After completing this tutorial, you will know:

- Classification predictive modeling involves assigning a class label to input examples.
- Binary classification refers to predicting one of two classes and multi-class classification involves predicting one of more than two classes.
- Multi-label classification involves predicting one or more classes for each example and imbalanced classification refers to classification tasks where the distribution of examples across the classes is not equal.

Let’s get started.

This tutorial is divided into five parts; they are:

- Classification Predictive Modeling
- Binary Classification
- Multi-Class Classification
- Multi-Label Classification
- Imbalanced Classification

In machine learning, classification refers to a predictive modeling problem where a class label is predicted for a given example of input data.

Examples of classification problems include:

- Given an example, classify if it is spam or not.
- Given a handwritten character, classify it as one of the known characters.
- Given recent user behavior, classify as churn or not.

From a modeling perspective, classification requires a training dataset with many examples of inputs and outputs from which to learn.

A model will use the training dataset and will calculate how to best map examples of input data to specific class labels. As such, the training dataset must be sufficiently representative of the problem and have many examples of each class label.

Class labels are often string values, e.g. “*spam*,” “*not spam*,” and must be mapped to numeric values before being provided to an algorithm for modeling. This is often referred to as label encoding, where a unique integer is assigned to each class label, e.g. “*spam*” = 0, “*no spam*” = 1.

There are many different types of classification algorithms for modeling classification predictive modeling problems.

There is no good theory on how to map algorithms onto problem types; instead, it is generally recommended that a practitioner use controlled experiments and discover which algorithm and algorithm configuration results in the best performance for a given classification task.

Classification predictive modeling algorithms are evaluated based on their results. Classification accuracy is a popular metric used to evaluate the performance of a model based on the predicted class labels. Classification accuracy is not perfect but is a good starting point for many classification tasks.

Instead of class labels, some tasks may require the prediction of a probability of class membership for each example. This provides additional uncertainty in the prediction that an application or user can then interpret. A popular diagnostic for evaluating predicted probabilities is the ROC Curve.

There are perhaps four main types of classification tasks that you may encounter; they are:

- Binary Classification
- Multi-Class Classification
- Multi-Label Classification
- Imbalanced Classification

Let’s take a closer look at each in turn.

Binary classification refers to those classification tasks that have two class labels.

Examples include:

- Email spam detection (spam or not).
- Churn prediction (churn or not).
- Conversion prediction (buy or not).

Typically, binary classification tasks involve one class that is the normal state and another class that is the abnormal state.

For example “*not spam*” is the normal state and “*spam*” is the abnormal state. Another example is “*cancer not detected*” is the normal state of a task that involves a medical test and “*cancer detected*” is the abnormal state.

The class for the normal state is assigned the class label 0 and the class with the abnormal state is assigned the class label 1.

It is common to model a binary classification task with a model that predicts a Bernoulli probability distribution for each example.

The Bernoulli distribution is a discrete probability distribution that covers a case where an event will have a binary outcome as either a 0 or 1. For classification, this means that the model predicts a probability of an example belonging to class 1, or the abnormal state.

Popular algorithms that can be used for binary classification include:

- Logistic Regression
- k-Nearest Neighbors
- Decision Trees
- Support Vector Machine
- Naive Bayes

Some algorithms are specifically designed for binary classification and do not natively support more than two classes; examples include Logistic Regression and Support Vector Machines.

Next, let’s take a closer look at a dataset to develop an intuition for binary classification problems.

We can use the make_blobs() function to generate a synthetic binary classification dataset.

The example below generates a dataset with 1,000 examples that belong to one of two classes, each with two input features.

# example of binary classification task from numpy import where from collections import Counter from sklearn.datasets import make_blobs from matplotlib import pyplot # define dataset X, y = make_blobs(n_samples=1000, centers=2, random_state=1) # summarize dataset shape print(X.shape, y.shape) # summarize observations by class label counter = Counter(y) print(counter) # summarize first few examples for i in range(10): print(X[i], y[i]) # plot the dataset and color the by class label for label, _ in counter.items(): row_ix = where(y == label)[0] pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) pyplot.legend() pyplot.show()

Running the example first summarizes the created dataset showing the 1,000 examples divided into input (*X*) and output (*y*) elements.

The distribution of the class labels is then summarized, showing that instances belong to either class 0 or class 1 and that there are 500 examples in each class.

Next, the first 10 examples in the dataset are summarized, showing the input values are numeric and the target values are integers that represent the class membership.

(1000, 2) (1000,) Counter({0: 500, 1: 500}) [-3.05837272 4.48825769] 0 [-8.60973869 -3.72714879] 1 [1.37129721 5.23107449] 0 [-9.33917563 -2.9544469 ] 1 [-11.57178593 -3.85275513] 1 [-11.42257341 -4.85679127] 1 [-10.44518578 -3.76476563] 1 [-10.44603561 -3.26065964] 1 [-0.61947075 3.48804983] 0 [-10.91115591 -4.5772537 ] 1

Finally, a scatter plot is created for the input variables in the dataset and the points are colored based on their class value.

We can see two distinct clusters that we might expect would be easy to discriminate.

Multi-class classification refers to those classification tasks that have more than two class labels.

Examples include:

- Face classification.
- Plant species classification.
- Optical character recognition.

Unlike binary classification, multi-class classification does not have the notion of normal and abnormal outcomes. Instead, examples are classified as belonging to one among a range of known classes.

The number of class labels may be very large on some problems. For example, a model may predict a photo as belonging to one among thousands or tens of thousands of faces in a face recognition system.

Problems that involve predicting a sequence of words, such as text translation models, may also be considered a special type of multi-class classification. Each word in the sequence of words to be predicted involves a multi-class classification where the size of the vocabulary defines the number of possible classes that may be predicted and could be tens or hundreds of thousands of words in size.

It is common to model a multi-class classification task with a model that predicts a Multinoulli probability distribution for each example.

The Multinoulli distribution is a discrete probability distribution that covers a case where an event will have a categorical outcome, e.g. *K* in {1, 2, 3, …, *K*}. For classification, this means that the model predicts the probability of an example belonging to each class label.

Many algorithms used for binary classification can be used for multi-class classification.

Popular algorithms that can be used for multi-class classification include:

- k-Nearest Neighbors.
- Decision Trees.
- Naive Bayes.
- Random Forest.
- Gradient Boosting.

Algorithms that are designed for binary classification can be adapted for use for multi-class problems.

This involves using a strategy of fitting multiple binary classification models for each class vs. all other classes (called one-vs-rest) or one model for each pair of classes (called one-vs-one).

**One-vs-Rest**: Fit one binary classification model for each class vs. all other classes.**One-vs-One**: Fit one binary classification model for each pair of classes.

Binary classification algorithms that can use these strategies for multi-class classification include:

- Logistic Regression.
- Support Vector Machine.

Next, let’s take a closer look at a dataset to develop an intuition for multi-class classification problems.

We can use the make_blobs() function to generate a synthetic multi-class classification dataset.

The example below generates a dataset with 1,000 examples that belong to one of three classes, each with two input features.

# example of multi-class classification task from numpy import where from collections import Counter from sklearn.datasets import make_blobs from matplotlib import pyplot # define dataset X, y = make_blobs(n_samples=1000, centers=3, random_state=1) # summarize dataset shape print(X.shape, y.shape) # summarize observations by class label counter = Counter(y) print(counter) # summarize first few examples for i in range(10): print(X[i], y[i]) # plot the dataset and color the by class label for label, _ in counter.items(): row_ix = where(y == label)[0] pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) pyplot.legend() pyplot.show()

Running the example first summarizes the created dataset showing the 1,000 examples divided into input (*X*) and output (*y*) elements.

The distribution of the class labels is then summarized, showing that instances belong to class 0, class 1, or class 2 and that there are approximately 333 examples in each class.

Next, the first 10 examples in the dataset are summarized showing the input values are numeric and the target values are integers that represent the class membership.

(1000, 2) (1000,) Counter({0: 334, 1: 333, 2: 333}) [-3.05837272 4.48825769] 0 [-8.60973869 -3.72714879] 1 [1.37129721 5.23107449] 0 [-9.33917563 -2.9544469 ] 1 [-8.63895561 -8.05263469] 2 [-8.48974309 -9.05667083] 2 [-7.51235546 -7.96464519] 2 [-7.51320529 -7.46053919] 2 [-0.61947075 3.48804983] 0 [-10.91115591 -4.5772537 ] 1

Finally, a scatter plot is created for the input variables in the dataset and the points are colored based on their class value.

We can see three distinct clusters that we might expect would be easy to discriminate.

Multi-label classification refers to those classification tasks that have two or more class labels, where one or more class labels may be predicted for each example.

Consider the example of photo classification, where a given photo may have multiple objects in the scene and a model may predict the presence of multiple known objects in the photo, such as “*bicycle*,” “*apple*,” “*person*,” etc.

This is unlike binary classification and multi-class classification, where a single class label is predicted for each example.

It is common to model multi-label classification tasks with a model that predicts multiple outputs, with each output taking predicted as a Bernoulli probability distribution. This is essentially a model that makes multiple binary classification predictions for each example.

Classification algorithms used for binary or multi-class classification cannot be used directly for multi-label classification. Specialized versions of standard classification algorithms can be used, so-called multi-label versions of the algorithms, including:

- Multi-label Decision Trees
- Multi-label Random Forests
- Multi-label Gradient Boosting

Another approach is to use a separate classification algorithm to predict the labels for each class.

Next, let’s take a closer look at a dataset to develop an intuition for multi-label classification problems.

We can use the make_multilabel_classification() function to generate a synthetic multi-label classification dataset.

The example below generates a dataset with 1,000 examples, each with two input features. There are three classes, each of which may take on one of two labels (0 or 1).

# example of a multi-label classification task from sklearn.datasets import make_multilabel_classification # define dataset X, y = make_multilabel_classification(n_samples=1000, n_features=2, n_classes=3, n_labels=2, random_state=1) # summarize dataset shape print(X.shape, y.shape) # summarize first few examples for i in range(10): print(X[i], y[i])

Running the example first summarizes the created dataset showing the 1,000 examples divided into input (*X*) and output (*y*) elements.

Next, the first 10 examples in the dataset are summarized showing the input values are numeric and the target values are integers that represent the class label membership.

(1000, 2) (1000, 3) [18. 35.] [1 1 1] [22. 33.] [1 1 1] [26. 36.] [1 1 1] [24. 28.] [1 1 0] [23. 27.] [1 1 0] [15. 31.] [0 1 0] [20. 37.] [0 1 0] [18. 31.] [1 1 1] [29. 27.] [1 0 0] [29. 28.] [1 1 0]

Imbalanced classification refers to classification tasks where the number of examples in each class is unequally distributed.

Typically, imbalanced classification tasks are binary classification tasks where the majority of examples in the training dataset belong to the normal class and a minority of examples belong to the abnormal class.

Examples include:

- Fraud detection.
- Outlier detection.
- Medical diagnostic tests.

These problems are modeled as binary classification tasks, although may require specialized techniques.

Specialized techniques may be used to change the composition of samples in the training dataset by undersampling the majority class or oversampling the majority class.

Examples include:

Specialized modeling algorithms may be used that pay more attention to the minority class when fitting the model on the training dataset, such as cost-sensitive machine learning algorithms.

Examples include:

- Cost-sensitive Logistic Regression.
- Cost-sensitive Decision Trees.
- Cost-sensitive Support Vector Machines.

Finally, alternative performance metrics may be required as reporting the classification accuracy may be misleading.

Examples include:

- Precision.
- Recall.
- F-Measure.

Next, let’s take a closer look at a dataset to develop an intuition for imbalanced classification problems.

We can use the make_classification() function to generate a synthetic imbalanced binary classification dataset.

The example below generates a dataset with 1,000 examples that belong to one of two classes, each with two input features.

# example of an imbalanced binary classification task from numpy import where from collections import Counter from sklearn.datasets import make_classification from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_classes=2, n_clusters_per_class=1, weights=[0.99,0.01], random_state=1) # summarize dataset shape print(X.shape, y.shape) # summarize observations by class label counter = Counter(y) print(counter) # summarize first few examples for i in range(10): print(X[i], y[i]) # plot the dataset and color the by class label for label, _ in counter.items(): row_ix = where(y == label)[0] pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) pyplot.legend() pyplot.show()

*X*) and output (*y*) elements.

The distribution of the class labels is then summarized, showing the severe class imbalance with about 980 examples belonging to class 0 and about 20 examples belonging to class 1.

Next, the first 10 examples in the dataset are summarized showing the input values are numeric and the target values are integers that represent the class membership. In this case, we can see that most examples belong to class 0, as we expect.

(1000, 2) (1000,) Counter({0: 983, 1: 17}) [0.86924745 1.18613612] 0 [1.55110839 1.81032905] 0 [1.29361936 1.01094607] 0 [1.11988947 1.63251786] 0 [1.04235568 1.12152929] 0 [1.18114858 0.92397607] 0 [1.1365562 1.17652556] 0 [0.46291729 0.72924998] 0 [0.18315826 1.07141766] 0 [0.32411648 0.53515376] 0

Finally, a scatter plot is created for the input variables in the dataset and the points are colored based on their class value.

We can see one main cluster for examples that belong to class 0 and a few scattered examples that belong to class 1. The intuition is that datasets with this property of imbalanced class labels are more challenging to model.

This section provides more resources on the topic if you are looking to go deeper.

- Statistical classification, Wikipedia.
- Binary classification, Wikipedia.
- Multiclass classification, Wikipedia.
- Multi-label classification, Wikipedia.
- Multiclass and multilabel algorithms, scikit-learn API.

In this tutorial, you discovered different types of classification predictive modeling in machine learning.

Specifically, you learned:

- Classification predictive modeling involves assigning a class label to input examples.
- Binary classification refers to predicting one of two classes and multi-class classification involves predicting one of more than two classes.
- Multi-label classification involves predicting one or more classes for each example and imbalanced classification refers to classification tasks where the distribution of examples across the classes is not equal.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post 4 Types of Classification Tasks in Machine Learning appeared first on Machine Learning Mastery.

]]>The post 10 Clustering Algorithms With Python appeared first on Machine Learning Mastery.

]]>Clustering or cluster analysis is an unsupervised learning problem.

It is often used as a data analysis technique for discovering interesting patterns in data, such as groups of customers based on their behavior.

There are many clustering algorithms to choose from and no single best clustering algorithm for all cases. Instead, it is a good idea to explore a range of clustering algorithms and different configurations for each algorithm.

In this tutorial, you will discover how to fit and use top clustering algorithms in python.

After completing this tutorial, you will know:

- Clustering is an unsupervised problem of finding natural groups in the feature space of input data.
- There are many different clustering algorithms and no single best method for all datasets.
- How to implement, fit, and use top clustering algorithms in Python with the scikit-learn machine learning library.

Let’s get started.

This tutorial is divided into three parts; they are:

- Clustering
- Clustering Algorithms
- Examples of Clustering Algorithms
- Library Installation
- Clustering Dataset
- Affinity Propagation
- Agglomerative Clustering
- BIRCH
- DBSCAN
- K-Means
- Mini-Batch K-Means
- Mean Shift
- OPTICS
- Spectral Clustering
- Gaussian Mixture Model

Cluster analysis, or clustering, is an unsupervised machine learning task.

It involves automatically discovering natural grouping in data. Unlike supervised learning (like predictive modeling), clustering algorithms only interpret the input data and find natural groups or clusters in feature space.

Clustering techniques apply when there is no class to be predicted but rather when the instances are to be divided into natural groups.

— Page 141, Data Mining: Practical Machine Learning Tools and Techniques, 2016.

A cluster is often an area of density in the feature space where examples from the domain (observations or rows of data) are closer to the cluster than other clusters. The cluster may have a center (the centroid) that is a sample or a point feature space and may have a boundary or extent.

These clusters presumably reflect some mechanism at work in the domain from which instances are drawn, a mechanism that causes some instances to bear a stronger resemblance to each other than they do to the remaining instances.

— Pages 141-142, Data Mining: Practical Machine Learning Tools and Techniques, 2016.

Clustering can be helpful as a data analysis activity in order to learn more about the problem domain, so-called pattern discovery or knowledge discovery.

For example:

- The phylogenetic tree could be considered the result of a manual clustering analysis.
- Separating normal data from outliers or anomalies may be considered a clustering problem.
- Separating clusters based on their natural behavior is a clustering problem, referred to as market segmentation.

Clustering can also be useful as a type of feature engineering, where existing and new examples can be mapped and labeled as belonging to one of the identified clusters in the data.

Evaluation of identified clusters is subjective and may require a domain expert, although many clustering-specific quantitative measures do exist. Typically, clustering algorithms are compared academically on synthetic datasets with pre-defined clusters, which an algorithm is expected to discover.

Clustering is an unsupervised learning technique, so it is hard to evaluate the quality of the output of any given method.

— Page 534, Machine Learning: A Probabilistic Perspective, 2012.

There are many types of clustering algorithms.

Many algorithms use similarity or distance measures between examples in the feature space in an effort to discover dense regions of observations. As such, it is often good practice to scale data prior to using clustering algorithms.

Central to all of the goals of cluster analysis is the notion of the degree of similarity (or dissimilarity) between the individual objects being clustered. A clustering method attempts to group the objects based on the definition of similarity supplied to it.

— Page 502, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2016.

Some clustering algorithms require you to specify or guess at the number of clusters to discover in the data, whereas others require the specification of some minimum distance between observations in which examples may be considered “*close*” or “*connected*.”

As such, cluster analysis is an iterative process where subjective evaluation of the identified clusters is fed back into changes to algorithm configuration until a desired or appropriate result is achieved.

The scikit-learn library provides a suite of different clustering algorithms to choose from.

A list of 10 of the more popular algorithms is as follows:

- Affinity Propagation
- Agglomerative Clustering
- BIRCH
- DBSCAN
- K-Means
- Mini-Batch K-Means
- Mean Shift
- OPTICS
- Spectral Clustering
- Mixture of Gaussians

Each algorithm offers a different approach to the challenge of discovering natural groups in data.

There is no best clustering algorithm, and no easy way to find the best algorithm for your data without using controlled experiments.

In this tutorial, we will review how to use each of these 10 popular clustering algorithms from the scikit-learn library.

The examples will provide the basis for you to copy-paste the examples and test the methods on your own data.

We will not dive into the theory behind how the algorithms work or compare them directly. For a good starting point on this topic, see:

Let’s dive in.

In this section, we will review how to use 10 popular clustering algorithms in scikit-learn.

This includes an example of fitting the model and an example of visualizing the result.

The examples are designed for you to copy-paste into your own project and apply the methods to your own data.

First, let’s install the library.

Don’t skip this step as you will need to ensure you have the latest version installed.

You can install the scikit-learn library using the pip Python installer, as follows:

sudo pip install scikit-learn

For additional installation instructions specific to your platform, see:

Next, let’s confirm that the library is installed and you are using a modern version.

Run the following script to print the library version number.

# check scikit-learn version import sklearn print(sklearn.__version__)

Running the example, you should see the following version number or higher.

0.22.1

We will use the make_classification() function to create a test binary classification dataset.

The dataset will have 1,000 examples, with two input features and one cluster per class. The clusters are visually obvious in two dimensions so that we can plot the data with a scatter plot and color the points in the plot by the assigned cluster. This will help to see, at least on the test problem, how “well” the clusters were identified.

The clusters in this test problem are based on a multivariate Gaussian, and not all clustering algorithms will be effective at identifying these types of clusters. As such, the results in this tutorial should not be used as the basis for comparing the methods generally.

An example of creating and summarizing the synthetic clustering dataset is listed below.

# synthetic classification dataset from numpy import where from sklearn.datasets import make_classification from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4) # create scatter plot for samples from each class for class_value in range(2): # get row indexes for samples with this class row_ix = where(y == class_value) # create scatter of these samples pyplot.scatter(X[row_ix, 0], X[row_ix, 1]) # show the plot pyplot.show()

Running the example creates the synthetic clustering dataset, then creates a scatter plot of the input data with points colored by class label (idealized clusters).

We can clearly see two distinct groups of data in two dimensions and the hope would be that an automatic clustering algorithm can detect these groupings.

Next, we can start looking at examples of clustering algorithms applied to this dataset.

I have made some minimal attempts to tune each method to the dataset.

**Can you get a better result for one of the algorithms?**

Let me know in the comments below.

Affinity Propagation involves finding a set of exemplars that best summarize the data.

We devised a method called “affinity propagation,” which takes as input measures of similarity between pairs of data points. Real-valued messages are exchanged between data points until a high-quality set of exemplars and corresponding clusters gradually emerges

— Clustering by Passing Messages Between Data Points, 2007.

The technique is described in the paper:

It is implemented via the AffinityPropagation class and the main configuration to tune is the “*damping*” set between 0.5 and 1, and perhaps “preference.”

The complete example is listed below.

# affinity propagation clustering from numpy import unique from numpy import where from sklearn.datasets import make_classification from sklearn.cluster import AffinityPropagation from matplotlib import pyplot # define dataset X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4) # define the model model = AffinityPropagation(damping=0.9) # fit the model model.fit(X) # assign a cluster to each example yhat = model.predict(X) # retrieve unique clusters clusters = unique(yhat) # create scatter plot for samples from each cluster for cluster in clusters: # get row indexes for samples with this cluster row_ix = where(yhat == cluster) # create scatter of these samples pyplot.scatter(X[row_ix, 0], X[row_ix, 1]) # show the plot pyplot.show()

Running the example fits the model on the training dataset and predicts a cluster for each example in the dataset. A scatter plot is then created with points colored by their assigned cluster.

In this case, I could not achieve a good result.

Agglomerative clustering involves merging examples until the desired number of clusters is achieved.

It is a part of a broader class of hierarchical clustering methods and you can learn more here:

It is implemented via the AgglomerativeClustering class and the main configuration to tune is the “*n_clusters*” set, an estimate of the number of clusters in the data, e.g. 2.

The complete example is listed below.

# agglomerative clustering from numpy import unique from numpy import where from sklearn.datasets import make_classification from sklearn.cluster import AgglomerativeClustering from matplotlib import pyplot # define dataset X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4) # define the model model = AgglomerativeClustering(n_clusters=2) # fit model and predict clusters yhat = model.fit_predict(X) # retrieve unique clusters clusters = unique(yhat) # create scatter plot for samples from each cluster for cluster in clusters: # get row indexes for samples with this cluster row_ix = where(yhat == cluster) # create scatter of these samples pyplot.scatter(X[row_ix, 0], X[row_ix, 1]) # show the plot pyplot.show()

Running the example fits the model on the training dataset and predicts a cluster for each example in the dataset. A scatter plot is then created with points colored by their assigned cluster.

In this case, a reasonable grouping is found.

BIRCH Clustering (BIRCH is short for Balanced Iterative Reducing and Clustering using

Hierarchies) involves constructing a tree structure from which cluster centroids are extracted.

BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i. e., available memory and time constraints).

— BIRCH: An efficient data clustering method for large databases, 1996.

The technique is described in the paper:

It is implemented via the Birch class and the main configuration to tune is the “*threshold*” and “*n_clusters*” hyperparameters, the latter of which provides an estimate of the number of clusters.

The complete example is listed below.

# birch clustering from numpy import unique from numpy import where from sklearn.datasets import make_classification from sklearn.cluster import Birch from matplotlib import pyplot # define dataset X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4) # define the model model = Birch(threshold=0.01, n_clusters=2) # fit the model model.fit(X) # assign a cluster to each example yhat = model.predict(X) # retrieve unique clusters clusters = unique(yhat) # create scatter plot for samples from each cluster for cluster in clusters: # get row indexes for samples with this cluster row_ix = where(yhat == cluster) # create scatter of these samples pyplot.scatter(X[row_ix, 0], X[row_ix, 1]) # show the plot pyplot.show()

Running the example fits the model on the training dataset and predicts a cluster for each example in the dataset. A scatter plot is then created with points colored by their assigned cluster.

In this case, an excellent grouping is found.

DBSCAN Clustering (where DBSCAN is short for Density-Based Spatial Clustering of Applications with Noise) involves finding high-density areas in the domain and expanding those areas of the feature space around them as clusters.

… we present the new clustering algorithm DBSCAN relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape. DBSCAN requires only one input parameter and supports the user in determining an appropriate value for it

— A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, 1996.

The technique is described in the paper:

It is implemented via the DBSCAN class and the main configuration to tune is the “*eps*” and “*min_samples*” hyperparameters.

The complete example is listed below.

# dbscan clustering from numpy import unique from numpy import where from sklearn.datasets import make_classification from sklearn.cluster import DBSCAN from matplotlib import pyplot # define dataset X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4) # define the model model = DBSCAN(eps=0.30, min_samples=9) # fit model and predict clusters yhat = model.fit_predict(X) # retrieve unique clusters clusters = unique(yhat) # create scatter plot for samples from each cluster for cluster in clusters: # get row indexes for samples with this cluster row_ix = where(yhat == cluster) # create scatter of these samples pyplot.scatter(X[row_ix, 0], X[row_ix, 1]) # show the plot pyplot.show()

In this case, a reasonable grouping is found, although more tuning is required.

K-Means Clustering may be the most widely known clustering algorithm and involves assigning examples to clusters in an effort to minimize the variance within each cluster.

The main purpose of this paper is to describe a process for partitioning an N-dimensional population into k sets on the basis of a sample. The process, which is called ‘k-means,’ appears to give partitions which are reasonably efficient in the sense of within-class variance.

— Some methods for classification and analysis of multivariate observations, 1967.

The technique is described here:

It is implemented via the KMeans class and the main configuration to tune is the “*n_clusters*” hyperparameter set to the estimated number of clusters in the data.

The complete example is listed below.

# k-means clustering from numpy import unique from numpy import where from sklearn.datasets import make_classification from sklearn.cluster import KMeans from matplotlib import pyplot # define dataset X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4) # define the model model = KMeans(n_clusters=2) # fit the model model.fit(X) # assign a cluster to each example yhat = model.predict(X) # retrieve unique clusters clusters = unique(yhat) # create scatter plot for samples from each cluster for cluster in clusters: # get row indexes for samples with this cluster row_ix = where(yhat == cluster) # create scatter of these samples pyplot.scatter(X[row_ix, 0], X[row_ix, 1]) # show the plot pyplot.show()

In this case, a reasonable grouping is found, although the unequal equal variance in each dimension makes the method less suited to this dataset.

Mini-Batch K-Means is a modified version of k-means that makes updates to the cluster centroids using mini-batches of samples rather than the entire dataset, which can make it faster for large datasets, and perhaps more robust to statistical noise.

… we propose the use of mini-batch optimization for k-means clustering. This reduces computation cost by orders of magnitude compared to the classic batch algorithm while yielding significantly better solutions than online stochastic gradient descent.

— Web-Scale K-Means Clustering, 2010.

The technique is described in the paper:

- Web-Scale K-Means Clustering, 2010.

It is implemented via the MiniBatchKMeans class and the main configuration to tune is the “*n_clusters*” hyperparameter set to the estimated number of clusters in the data.

The complete example is listed below.

# mini-batch k-means clustering from numpy import unique from numpy import where from sklearn.datasets import make_classification from sklearn.cluster import MiniBatchKMeans from matplotlib import pyplot # define dataset X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4) # define the model model = MiniBatchKMeans(n_clusters=2) # fit the model model.fit(X) # assign a cluster to each example yhat = model.predict(X) # retrieve unique clusters clusters = unique(yhat) # create scatter plot for samples from each cluster for cluster in clusters: # get row indexes for samples with this cluster row_ix = where(yhat == cluster) # create scatter of these samples pyplot.scatter(X[row_ix, 0], X[row_ix, 1]) # show the plot pyplot.show()

In this case, a result equivalent to the standard k-means algorithm is found.

Mean shift clustering involves finding and adapting centroids based on the density of examples in the feature space.

We prove for discrete data the convergence of a recursive mean shift procedure to the nearest stationary point of the underlying density function and thus its utility in detecting the modes of the density.

— Mean Shift: A robust approach toward feature space analysis, 2002.

The technique is described in the paper:

It is implemented via the MeanShift class and the main configuration to tune is the “*bandwidth*” hyperparameter.

The complete example is listed below.

# mean shift clustering from numpy import unique from numpy import where from sklearn.datasets import make_classification from sklearn.cluster import MeanShift from matplotlib import pyplot # define dataset X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4) # define the model model = MeanShift() # fit model and predict clusters yhat = model.fit_predict(X) # retrieve unique clusters clusters = unique(yhat) # create scatter plot for samples from each cluster for cluster in clusters: # get row indexes for samples with this cluster row_ix = where(yhat == cluster) # create scatter of these samples pyplot.scatter(X[row_ix, 0], X[row_ix, 1]) # show the plot pyplot.show()

In this case, a reasonable set of clusters are found in the data.

OPTICS clustering (where OPTICS is short for Ordering Points To Identify the Clustering Structure) is a modified version of DBSCAN described above.

We introduce a new algorithm for the purpose of cluster analysis which does not produce a clustering of a data set explicitly; but instead creates an augmented ordering of the database representing its density-based clustering structure. This cluster-ordering contains information which is equivalent to the density-based clusterings corresponding to a broad range of parameter settings.

— OPTICS: ordering points to identify the clustering structure, 1999.

The technique is described in the paper:

It is implemented via the OPTICS class and the main configuration to tune is the “*eps*” and “*min_samples*” hyperparameters.

The complete example is listed below.

# optics clustering from numpy import unique from numpy import where from sklearn.datasets import make_classification from sklearn.cluster import OPTICS from matplotlib import pyplot # define dataset X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4) # define the model model = OPTICS(eps=0.8, min_samples=10) # fit model and predict clusters yhat = model.fit_predict(X) # retrieve unique clusters clusters = unique(yhat) # create scatter plot for samples from each cluster for cluster in clusters: # get row indexes for samples with this cluster row_ix = where(yhat == cluster) # create scatter of these samples pyplot.scatter(X[row_ix, 0], X[row_ix, 1]) # show the plot pyplot.show()

In this case, I could not achieve a reasonable result on this dataset.

Spectral Clustering is a general class of clustering methods, drawn from linear algebra.

A promising alternative that has recently emerged in a number of fields is to use spectral methods for clustering. Here, one uses the top eigenvectors of a matrix derived from the distance between points.

— On Spectral Clustering: Analysis and an algorithm, 2002.

The technique is described in the paper:

It is implemented via the SpectralClustering class and the main Spectral Clustering is a general class of clustering methods, drawn from linear algebra. to tune is the “*n_clusters*” hyperparameter used to specify the estimated number of clusters in the data.

The complete example is listed below.

# spectral clustering from numpy import unique from numpy import where from sklearn.datasets import make_classification from sklearn.cluster import SpectralClustering from matplotlib import pyplot # define dataset X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4) # define the model model = SpectralClustering(n_clusters=2) # fit model and predict clusters yhat = model.fit_predict(X) # retrieve unique clusters clusters = unique(yhat) # create scatter plot for samples from each cluster for cluster in clusters: # get row indexes for samples with this cluster row_ix = where(yhat == cluster) # create scatter of these samples pyplot.scatter(X[row_ix, 0], X[row_ix, 1]) # show the plot pyplot.show()

In this case, reasonable clusters were found.

A Gaussian mixture model summarizes a multivariate probability density function with a mixture of Gaussian probability distributions as its name suggests.

For more on the model, see:

It is implemented via the GaussianMixture class and the main configuration to tune is the “*n_clusters*” hyperparameter used to specify the estimated number of clusters in the data.

The complete example is listed below.

# gaussian mixture clustering from numpy import unique from numpy import where from sklearn.datasets import make_classification from sklearn.mixture import GaussianMixture from matplotlib import pyplot # define dataset X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=4) # define the model model = GaussianMixture(n_components=2) # fit the model model.fit(X) # assign a cluster to each example yhat = model.predict(X) # retrieve unique clusters clusters = unique(yhat) # create scatter plot for samples from each cluster for cluster in clusters: # get row indexes for samples with this cluster row_ix = where(yhat == cluster) # create scatter of these samples pyplot.scatter(X[row_ix, 0], X[row_ix, 1]) # show the plot pyplot.show()

In this case, we can see that the clusters were identified perfectly. This is not surprising given that the dataset was generated as a mixture of Gaussians.

This section provides more resources on the topic if you are looking to go deeper.

- Clustering by Passing Messages Between Data Points, 2007.
- BIRCH: An efficient data clustering method for large databases, 1996.
- A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, 1996.
- Some methods for classification and analysis of multivariate observations, 1967.
- Web-Scale K-Means Clustering, 2010.
- Mean Shift: A robust approach toward feature space analysis, 2002.
- On Spectral Clustering: Analysis and an algorithm, 2002.

- Data Mining: Practical Machine Learning Tools and Techniques, 2016.
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2016.
- Machine Learning: A Probabilistic Perspective, 2012.

- Cluster analysis, Wikipedia.
- Hierarchical clustering, Wikipedia.
- k-means clustering, Wikipedia.
- Mixture model, Wikipedia.

In this tutorial, you discovered how to fit and use top clustering algorithms in python.

Specifically, you learned:

- Clustering is an unsupervised problem of finding natural groups in the feature space of input data.
- There are many different clustering algorithms, and no single best method for all datasets.
- How to implement, fit, and use top clustering algorithms in Python with the scikit-learn machine learning library.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post 10 Clustering Algorithms With Python appeared first on Machine Learning Mastery.

]]>The post What Is Argmax in Machine Learning? appeared first on Machine Learning Mastery.

]]>Argmax is a mathematical function that you may encounter in applied machine learning.

For example, you may see “*argmax*” or “*arg max*” used in a research paper used to describe an algorithm. You may also be instructed to use the argmax function in your algorithm implementation.

This may be the first time that you encounter the argmax function and you may wonder what it is and how it works.

In this tutorial, you will discover the argmax function and how it is used in machine learning.

After completing this tutorial, you will know:

- Argmax is an operation that finds the argument that gives the maximum value from a target function.
- Argmax is most commonly used in machine learning for finding the class with the largest predicted probability.
- Argmax can be implemented manually, although the argmax() NumPy function is preferred in practice.

Let’s get started.

This tutorial is divided into three parts; they are:

- What Is Argmax?
- How Is Argmax Used in Machine Learning?
- How to Implement Argmax in Python

Argmax is a mathematical function.

It is typically applied to another function that takes an argument. For example, given a function *g()* that takes the argument *x*, the *argmax* operation of that function would be described as follows:

- result = argmax(g(x))

The *argmax* function returns the argument or arguments (*arg*) for the target function that returns the maximum (*max*) value from the target function.

Consider the example where *g(x)* is calculated as the square of the *x* value and the domain or extent of input values (*x*) is limited to integers from 1 to 5:

- g(1) = 1^2 = 1
- g(2) = 2^2 = 4
- g(3) = 3^2 = 9
- g(4) = 4^2 = 16
- g(5) = 5^2 = 25

We can intuitively see that the argmax for the function *g(x)* is 5.

That is, the argument (*x*) to the target function *g()* that results in the largest value from the target function (25) is 5. Argmax provides a shorthand for specifying this argument in an abstract way without knowing what the value might be in a specific case.

- argmax(g(x)) = 5

Note that this is not the *max()* of the values returned from function. This would be 25.

It is also not the *max()* of the arguments, although in this case the argmax and max of the arguments is the same, e.g. 5. The *argmax()* is 5 because g returns the largest value (25) when 5 is provided, not because 5 is the largest argument.

Typically, “*argmax*” is written as two separate words, e.g. “*arg max*“. For example:

- result = arg max(g(x))

It is also common to use the arg max function as an operation without brackets surrounding the target function. This is often how you will see the operation written and used in a research paper or textbook. For example:

- result = arg max g(x)

You can also use a similar operation to find the arguments to the target function that result in the minimum value from the target function, called *argmin* or “*arg min*.”

The argmax function is used throughout the field of mathematics and machine learning.

Nevertheless, there are specific situations where you will see argmax used in applied machine learning and may need to implement it yourself.

The most common situation for using argmax that you will encounter in applied machine learning is in finding the index of an array that results in the largest value.

Recall that an array is a list or vector of numbers.

It is common for multi-class classification models to predict a vector of probabilities (or probability-like values), with one probability for each class label. The probabilities represent the likelihood that a sample belongs to each of the class labels.

The predicted probabilities are ordered such that the predicted probability at index 0 belongs to the first class, the predicted probability at index 1 belongs to the second class, and so on.

Often, a single class label prediction is required from a set of predicted probabilities for a multi-class classification problem.

This conversion from a vector of predicted probabilities to a class label is most often described using the argmax operation and most often implemented using the argmax function.

Let’s make this concrete with an example.

Consider a multi-class classification problem with three classes: “*red*“, “*blue*,” and “*green*.” The class labels are mapped to integer values for modeling, as follows:

- red = 0
- blue = 1
- green = 2

Each class label integer values maps to an index of a 3-element vector that may be predicted by a model specifying the likelihood that an example belongs to each class.

Consider a model has made one prediction for an input sample and predicted the following vector of probabilities:

- yhat = [0.4, 0.5, 0.1]

We can see that the example has a 40 percent probability of belonging to red, a 50 percent probability of belonging to blue, and a 10 percent probability of belonging to green.

We can apply the argmax function to the vector of probabilities. The vector is the function, the output of the function is the probabilities, and the input to the function is a vector element index or an array index.

- arg max yhat

We can intuitively see that in this case, the argmax of the vector of predicted probabilities (yhat) is 1, as the probability at array index 1 is the largest value.

Note that this is not the max() of the probabilities, which would be 0.5. Also note that this is not the max of the arguments, which would be 2. Instead it is the argument that results in the maximum value, e.g. 1 that results in 0.5.

- arg max yhat = 1

We can then map this integer value back to a class label, which would be “*blue*.”

- arg max yhat = “blue”

The argmax function can be implemented in Python for a given vector of numbers.

First, we can define a function called *argmax()* that enumerates a provided vector and returns the index with the largest value.

The complete example is listed below.

# argmax function def argmax(vector): index, value = 0, vector[0] for i,v in enumerate(vector): if v > value: index, value = i,v return index # define vector vector = [0.4, 0.5, 0.1] # get argmax result = argmax(vector) print('arg max of %s: %d' % (vector, result))

Running the example prints the argmax of our test data used in the previous section, which in this case is an index of 1.

arg max of [0.4, 0.5, 0.1]: 1

Thankfully, there is a built-in version of the argmax() function provided with the NumPy library.

This is the version that you should use in practice.

The example below demonstrates the *argmax()* NumPy function on the same vector of probabilities.

# numpy implementation of argmax from numpy import argmax # define vector vector = [0.4, 0.5, 0.1] # get argmax result = argmax(vector) print('arg max of %s: %d' % (vector, result))

Running the example prints an index of 1, as is expected.

arg max of [0.4, 0.5, 0.1]: 1

It is more likely that you will have a collection of predicted probabilities for multiple samples.

This would be stored as a matrix with rows of predicted probabilities and each column representing a class label. The desired result of an argmax on this matrix would be a vector with one index (or class label integer) for each row of predictions.

This can be achieved with the *argmax()* NumPy function by setting the “*axis*” argument. By default, the argmax would be calculated for the entire matrix, returning a single number. Instead, we can set the axis value to 1 and calculate the argmax across the columns for each row of data.

The example below demonstrates this with a matrix of four rows of predicted probabilities for the three class labels.

# numpy implementation of argmax from numpy import argmax from numpy import asarray # define vector probs = asarray([[0.4, 0.5, 0.1], [0.0, 0.0, 1.0], [0.9, 0.0, 0.1], [0.3, 0.3, 0.4]]) print(probs.shape) # get argmax result = argmax(probs, axis=1) print(result)

Running the example first prints the shape of the matrix of predicted probabilities, confirming we have four rows with three columns per row.

The argmax of the matrix is then calculated and printed as a vector, showing four values. This is what we expect, where each row results in a single argmax value or index with the largest probability.

(4, 3) [1 2 0 2]

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered the argmax function and how it is used in machine learning.

Specifically, you learned:

- Argmax is an operation that finds the argument that gives the maximum value from a target function.
- Argmax is most commonly used in machine learning for finding the class with the largest predicted probability.
- Argmax can be implemented manually, although the argmax() NumPy function is preferred in practice.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post What Is Argmax in Machine Learning? appeared first on Machine Learning Mastery.

]]>The post Gradient Boosting with Scikit-Learn, XGBoost, LightGBM, and CatBoost appeared first on Machine Learning Mastery.

]]>Gradient boosting is a powerful ensemble machine learning algorithm.

It’s popular for structured predictive modeling problems, such as classification and regression on tabular data, and is often the main algorithm or one of the main algorithms used in winning solutions to machine learning competitions, like those on Kaggle.

There are many implementations of gradient boosting available, including standard implementations in SciPy and efficient third-party libraries. Each uses a different interface and even different names for the algorithm.

In this tutorial, you will discover how to use gradient boosting models for classification and regression in Python.

Standardized code examples are provided for the four major implementations of gradient boosting in Python, ready for you to copy-paste and use in your own predictive modeling project.

After completing this tutorial, you will know:

- Gradient boosting is an ensemble algorithm that fits boosted decision trees by minimizing an error gradient.
- How to evaluate and use gradient boosting with scikit-learn, including gradient boosting machines and the histogram-based algorithm.
- How to evaluate and use third-party gradient boosting algorithms, including XGBoost, LightGBM, and CatBoost.

Let’s get started.

This tutorial is divided into five parts; they are:

- Gradient Boosting Overview
- Gradient Boosting With Scikit-Learn
- Library Installation
- Test Problems
- Gradient Boosting
- Histogram-Based Gradient Boosting

- Gradient Boosting With XGBoost
- Library Installation
- XGBoost for Classification
- XGBoost for Regression

- Gradient Boosting With LightGBM
- Library Installation
- LightGBM for Classification
- LightGBM for Regression

- Gradient Boosting With CatBoost
- Library Installation
- CatBoost for Classification
- CatBoost for Regression

Gradient boosting refers to a class of ensemble machine learning algorithms that can be used for classification or regression predictive modeling problems.

Gradient boosting is also known as gradient tree boosting, stochastic gradient boosting (an extension), and gradient boosting machines, or GBM for short.

Ensembles are constructed from decision tree models. Trees are added one at a time to the ensemble and fit to correct the prediction errors made by prior models. This is a type of ensemble machine learning model referred to as boosting.

Models are fit using any arbitrary differentiable loss function and gradient descent optimization algorithm. This gives the technique its name, “*gradient boosting*,” as the loss gradient is minimized as the model is fit, much like a neural network.

Gradient boosting is an effective machine learning algorithm and is often the main, or one of the main, algorithms used to win machine learning competitions (like Kaggle) on tabular and similar structured datasets.

**Note**: We will not be going into the theory behind how the gradient boosting algorithm works in this tutorial.

For more on the gradient boosting algorithm, see the tutorial:

The algorithm provides hyperparameters that should, and perhaps must, be tuned for a specific dataset. Although there are many hyperparameters to tune, perhaps the most important are as follows:

- The number of trees or estimators in the model.
- The learning rate of the model.
- The row and column sampling rate for stochastic models.
- The maximum tree depth.
- The minimum tree weight.
- The regularization terms alpha and lambda.

**Note**: We will not be exploring how to configure or tune the configuration of gradient boosting algorithms in this tutorial.

For more on tuning the hyperparameters of gradient boosting algorithms, see the tutorial:

There are many implementations of the gradient boosting algorithm available in Python. Perhaps the most used implementation is the version provided with the scikit-learn library.

Additional third-party libraries are available that provide computationally efficient alternate implementations of the algorithm that often achieve better results in practice. Examples include the XGBoost library, the LightGBM library, and the CatBoost library.

**Do you have a different favorite gradient boosting implementation?**

Let me know in the comments below.

When using gradient boosting on your predictive modeling project, you may want to test each implementation of the algorithm.

This tutorial provides examples of each implementation of the gradient boosting algorithm on classification and regression predictive modeling problems that you can copy-paste into your project.

Let’s take a look at each in turn.

**Note**: We are not comparing the performance of the algorithms in this tutorial. Instead, we are providing code examples to demonstrate how to use each different implementation. As such, we are using synthetic test datasets to demonstrate evaluating and making a prediction with each implementation.

This tutorial assumes you have Python and SciPy installed. If you need help, see the tutorial:

In this section, we will review how to use the gradient boosting algorithm implementation in the scikit-learn library.

First, let’s install the library.

Don’t skip this step as you will need to ensure you have the latest version installed.

You can install the scikit-learn library using the pip Python installer, as follows:

sudo pip install scikit-learn

For additional installation instructions specific to your platform, see:

Next, let’s confirm that the library is installed and you are using a modern version.

Run the following script to print the library version number.

# check scikit-learn version import sklearn print(sklearn.__version__)

Running the example, you should see the following version number or higher.

0.22.1

We will demonstrate the gradient boosting algorithm for classification and regression.

As such, we will use synthetic test problems from the scikit-learn library.

We will use the make_classification() function to create a test binary classification dataset.

The dataset will have 1,000 examples, with 10 input features, five of which will be informative and the remaining five that will be redundant. We will fix the random number seed to ensure we get the same examples each time the code is run.

An example of creating and summarizing the dataset is listed below.

# test classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and confirms the expected number of samples and features.

(1000, 10) (1000,)

We will use the make_regression() function to create a test regression dataset.

Like the classification dataset, the regression dataset will have 1,000 examples, with 10 input features, five of which will be informative and the remaining five that will be redundant.

# test regression dataset from sklearn.datasets import make_regression # define dataset X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and confirms the expected number of samples and features.

(1000, 10) (1000,)

Next, let’s look at how we can develop gradient boosting models in scikit-learn.

The scikit-learn library provides the GBM algorithm for regression and classification via the *GradientBoostingClassifier* and *GradientBoostingRegressor* classes.

Let’s take a closer look at each in turn.

The example below first evaluates a GradientBoostingClassifier on the test problem using repeated k-fold cross-validation and reports the mean accuracy. Then a single model is fit on all available data and a single prediction is made.

The complete example is listed below.

# gradient boosting for classification in scikit-learn from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.ensemble import GradientBoostingClassifier from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # evaluate the model model = GradientBoostingClassifier() cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores))) # fit the model on the whole dataset model = GradientBoostingClassifier() model.fit(X, y) # make a single prediction row = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]] yhat = model.predict(row) print('Prediction: %d' % yhat[0])

Running the example first reports the evaluation of the model using repeated k-fold cross-validation, then the result of making a single prediction with a model fit on the entire dataset.

Accuracy: 0.915 (0.025) Prediction: 1

The example below first evaluates a GradientBoostingRegressor on the test problem using repeated k-fold cross-validation and reports the mean absolute error. Then a single model is fit on all available data and a single prediction is made.

The complete example is listed below.

# gradient boosting for regression in scikit-learn from numpy import mean from numpy import std from sklearn.datasets import make_regression from sklearn.ensemble import GradientBoostingRegressor from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from matplotlib import pyplot # define dataset X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) # evaluate the model model = GradientBoostingRegressor() cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise') print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores))) # fit the model on the whole dataset model = GradientBoostingRegressor() model.fit(X, y) # make a single prediction row = [[2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]] yhat = model.predict(row) print('Prediction: %.3f' % yhat[0])

Running the example first reports the evaluation of the model using repeated k-fold cross-validation, then the result of making a single prediction with a model fit on the entire dataset.

MAE: -11.854 (1.121) Prediction: -80.661

The scikit-learn library provides an alternate implementation of the gradient boosting algorithm, referred to as histogram-based gradient boosting.

This is an alternate approach to implement gradient tree boosting inspired by the LightGBM library (described more later). This implementation is provided via the *HistGradientBoostingClassifier* and *HistGradientBoostingRegressor* classes.

The primary benefit of the histogram-based approach to gradient boosting is speed. These implementations are designed to be much faster to fit on training data.

At the time of writing, this is an experimental implementation and requires that you add the following line to your code to enable access to these classes.

from sklearn.experimental import enable_hist_gradient_boosting

Without this line, you will see an error like:

ImportError: cannot import name 'HistGradientBoostingClassifier'

or

ImportError: cannot import name 'HistGradientBoostingRegressor'

Let’s take a close look at how to use this implementation.

The example below first evaluates a HistGradientBoostingClassifier on the test problem using repeated k-fold cross-validation and reports the mean accuracy. Then a single model is fit on all available data and a single prediction is made.

The complete example is listed below.

# histogram-based gradient boosting for classification in scikit-learn from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.experimental import enable_hist_gradient_boosting from sklearn.ensemble import HistGradientBoostingClassifier from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # evaluate the model model = HistGradientBoostingClassifier() cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores))) # fit the model on the whole dataset model = HistGradientBoostingClassifier() model.fit(X, y) # make a single prediction row = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]] yhat = model.predict(row) print('Prediction: %d' % yhat[0])

Running the example first reports the evaluation of the model using repeated k-fold cross-validation, then the result of making a single prediction with a model fit on the entire dataset.

Accuracy: 0.935 (0.024) Prediction: 1

The example below first evaluates a HistGradientBoostingRegressor on the test problem using repeated k-fold cross-validation and reports the mean absolute error. Then a single model is fit on all available data and a single prediction is made.

The complete example is listed below.

# histogram-based gradient boosting for regression in scikit-learn from numpy import mean from numpy import std from sklearn.datasets import make_regression from sklearn.experimental import enable_hist_gradient_boosting from sklearn.ensemble import HistGradientBoostingRegressor from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from matplotlib import pyplot # define dataset X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) # evaluate the model model = HistGradientBoostingRegressor() cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise') print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores))) # fit the model on the whole dataset model = HistGradientBoostingRegressor() model.fit(X, y) # make a single prediction row = [[2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]] yhat = model.predict(row) print('Prediction: %.3f' % yhat[0])

MAE: -12.723 (1.540) Prediction: -77.837

XGBoost, which is short for “*Extreme Gradient Boosting*,” is a library that provides an efficient implementation of the gradient boosting algorithm.

The main benefit of the XGBoost implementation is computational efficiency and often better model performance.

For more on the benefits and capability of XGBoost, see the tutorial:

You can install the XGBoost library using the pip Python installer, as follows:

sudo pip install xgboost

For additional installation instructions specific to your platform see:

Next, let’s confirm that the library is installed and you are using a modern version.

Run the following script to print the library version number.

# check xgboost version import xgboost print(xgboost.__version__)

Running the example, you should see the following version number or higher.

1.0.1

The XGBoost library provides wrapper classes so that the efficient algorithm implementation can be used with the scikit-learn library, specifically via the *XGBClassifier* and *XGBregressor* classes.

Let’s take a closer look at each in turn.

The example below first evaluates an XGBClassifier on the test problem using repeated k-fold cross-validation and reports the mean accuracy. Then a single model is fit on all available data and a single prediction is made.

The complete example is listed below.

# xgboost for classification from numpy import asarray from numpy import mean from numpy import std from sklearn.datasets import make_classification from xgboost import XGBClassifier from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # evaluate the model model = XGBClassifier() cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores))) # fit the model on the whole dataset model = XGBClassifier() model.fit(X, y) # make a single prediction row = [2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145] row = asarray(row).reshape((1, len(row))) yhat = model.predict(row) print('Prediction: %d' % yhat[0])

Accuracy: 0.936 (0.019) Prediction: 1

The example below first evaluates an XGBRegressor on the test problem using repeated k-fold cross-validation and reports the mean absolute error. Then a single model is fit on all available data and a single prediction is made.

The complete example is listed below.

# xgboost for regression from numpy import asarray from numpy import mean from numpy import std from sklearn.datasets import make_regression from xgboost import XGBRegressor from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from matplotlib import pyplot # define dataset X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) # evaluate the model model = XGBRegressor(objective='reg:squarederror') cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise') print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores))) # fit the model on the whole dataset model = XGBRegressor(objective='reg:squarederror') model.fit(X, y) # make a single prediction row = [2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118] row = asarray(row).reshape((1, len(row))) yhat = model.predict(row) print('Prediction: %.3f' % yhat[0])

MAE: -15.048 (1.316) Prediction: -93.434

LightGBM, short for Light Gradient Boosted Machine, is a library developed at Microsoft that provides an efficient implementation of the gradient boosting algorithm.

The primary benefit of the LightGBM is the changes to the training algorithm that make the process dramatically faster, and in many cases, result in a more effective model.

For more technical details on the LightGBM algorithm, see the paper:

You can install the LightGBM library using the pip Python installer, as follows:

sudo pip install lightgbm

For additional installation instructions specific to your platform, see:

Next, let’s confirm that the library is installed and you are using a modern version.

Run the following script to print the library version number.

# check lightgbm version import lightgbm print(lightgbm.__version__)

Running the example, you should see the following version number or higher.

2.3.1

The LightGBM library provides wrapper classes so that the efficient algorithm implementation can be used with the scikit-learn library, specifically via the *LGBMClassifier* and *LGBMRegressor* classes.

Let’s take a closer look at each in turn.

The example below first evaluates an LGBMClassifier on the test problem using repeated k-fold cross-validation and reports the mean accuracy. Then a single model is fit on all available data and a single prediction is made.

The complete example is listed below.

# lightgbm for classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from lightgbm import LGBMClassifier from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # evaluate the model model = LGBMClassifier() cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores))) # fit the model on the whole dataset model = LGBMClassifier() model.fit(X, y) # make a single prediction row = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]] yhat = model.predict(row) print('Prediction: %d' % yhat[0])

Accuracy: 0.934 (0.021) Prediction: 1

The example below first evaluates an LGBMRegressor on the test problem using repeated k-fold cross-validation and reports the mean absolute error. Then a single model is fit on all available data and a single prediction is made.

The complete example is listed below.

# lightgbm for regression from numpy import mean from numpy import std from sklearn.datasets import make_regression from lightgbm import LGBMRegressor from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from matplotlib import pyplot # define dataset X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) # evaluate the model model = LGBMRegressor() cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise') print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores))) # fit the model on the whole dataset model = LGBMRegressor() model.fit(X, y) # make a single prediction row = [[2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]] yhat = model.predict(row) print('Prediction: %.3f' % yhat[0])

MAE: -12.739 (1.408) Prediction: -82.040

CatBoost is a third-party library developed at Yandex that provides an efficient implementation of the gradient boosting algorithm.

The primary benefit of the CatBoost (in addition to computational speed improvements) is support for categorical input variables. This gives the library its name CatBoost for “*Category Gradient Boosting*.”

For more technical details on the CatBoost algorithm, see the paper:

You can install the CatBoost library using the pip Python installer, as follows:

sudo pip install catboost

For additional installation instructions specific to your platform, see:

Next, let’s confirm that the library is installed and you are using a modern version.

Run the following script to print the library version number.

# check catboost version import catboost print(catboost.__version__)

Running the example, you should see the following version number or higher.

0.21

The CatBoost library provides wrapper classes so that the efficient algorithm implementation can be used with the scikit-learn library, specifically via the *CatBoostClassifier* and *CatBoostRegressor* classes.

Let’s take a closer look at each in turn.

The example below first evaluates a CatBoostClassifier on the test problem using repeated k-fold cross-validation and reports the mean accuracy. Then a single model is fit on all available data and a single prediction is made.

The complete example is listed below.

# catboost for classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from catboost import CatBoostClassifier from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # evaluate the model model = CatBoostClassifier(verbose=0, n_estimators=100) cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores))) # fit the model on the whole dataset model = CatBoostClassifier(verbose=0, n_estimators=100) model.fit(X, y) # make a single prediction row = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]] yhat = model.predict(row) print('Prediction: %d' % yhat[0])

Accuracy: 0.931 (0.026) Prediction: 1

The example below first evaluates a CatBoostRegressor on the test problem using repeated k-fold cross-validation and reports the mean absolute error. Then a single model is fit on all available data and a single prediction is made.

The complete example is listed below.

# catboost for regression from numpy import mean from numpy import std from sklearn.datasets import make_regression from catboost import CatBoostRegressor from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from matplotlib import pyplot # define dataset X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) # evaluate the model model = CatBoostRegressor(verbose=0, n_estimators=100) cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise') print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores))) # fit the model on the whole dataset model = CatBoostRegressor(verbose=0, n_estimators=100) model.fit(X, y) # make a single prediction row = [[2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]] yhat = model.predict(row) print('Prediction: %.3f' % yhat[0])

MAE: -9.281 (0.951) Prediction: -74.212

This section provides more resources on the topic if you are looking to go deeper.

- How to Setup Your Python Environment for Machine Learning with Anaconda
- A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning
- How to Configure the Gradient Boosting Algorithm
- A Gentle Introduction to XGBoost for Applied Machine Learning

- Stochastic Gradient Boosting, 2002.
- XGBoost: A Scalable Tree Boosting System, 2016.
- LightGBM: A Highly Efficient Gradient Boosting Decision Tree, 2017.
- CatBoost: gradient boosting with categorical features support, 2017.

- Scikit-Learn Homepage.
- sklearn.ensemble API.
- XGBoost Homepage.
- XGBoost Python API.
- LightGBM Project.
- LightGBM Python API.
- CatBoost Homepage.
- CatBoost API.

In this tutorial, you discovered how to use gradient boosting models for classification and regression in Python.

Specifically, you learned:

- Gradient boosting is an ensemble algorithm that fits boosted decision trees by minimizing an error gradient.
- How to evaluate and use gradient boosting with scikit-learn, including gradient boosting machines and the histogram-based algorithm.
- How to evaluate and use third-party gradient boosting algorithms including XGBoost, LightGBM and CatBoost.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Gradient Boosting with Scikit-Learn, XGBoost, LightGBM, and CatBoost appeared first on Machine Learning Mastery.

]]>The post How to Calculate Feature Importance With Python appeared first on Machine Learning Mastery.

]]>Feature importance refers to techniques that assign a score to input features based on how useful they are at predicting a target variable.

There are many types and sources of feature importance scores, although popular examples include statistical correlation scores, coefficients calculated as part of linear models, decision trees, and permutation importance scores.

Feature importance scores play an important role in a predictive modeling project, including providing insight into the data, insight into the model, and the basis for dimensionality reduction and feature selection that can improve the efficiency and effectiveness of a predictive model on the problem.

In this tutorial, you will discover feature importance scores for machine learning in python

After completing this tutorial, you will know:

- The role of feature importance in a predictive modeling problem.
- How to calculate and review feature importance from linear models and decision trees.
- How to calculate and review permutation feature importance scores.

Let’s get started.

This tutorial is divided into five parts; they are:

- Feature Importance
- Preparation
- Check Scikit-Learn Version
- Test Datasets

- Coefficients as Feature Importance
- Linear Regression Feature Importance
- Logistic Regression Feature Importance

- Decision Tree Feature Importance
- CART Feature Importance
- Random Forest Feature Importance
- XGBoost Feature Importance

- Permutation Feature Importance
- Permutation Feature Importance for Regression
- Permutation Feature Importance for Classification

Feature importance refers to a class of techniques for assigning scores to input features to a predictive model that indicates the relative importance of each feature when making a prediction.

Feature importance scores can be calculated for problems that involve predicting a numerical value, called regression, and those problems that involve predicting a class label, called classification.

The scores are useful and can be used in a range of situations in a predictive modeling problem, such as:

- Better understanding the data.
- Better understanding a model.
- Reducing the number of input features.

**Feature importance scores can provide insight into the dataset**. The relative scores can highlight which features may be most relevant to the target, and the converse, which features are the least relevant. This may be interpreted by a domain expert and could be used as the basis for gathering more or different data.

**Feature importance scores can provide insight into the model**. Most importance scores are calculated by a predictive model that has been fit on the dataset. Inspecting the importance score provides insight into that specific model and which features are the most important and least important to the model when making a prediction. This is a type of model interpretation that can be performed for those models that support it.

**Feature importance can be used to improve a predictive model**. This can be achieved by using the importance scores to select those features to delete (lowest scores) or those features to keep (highest scores). This is a type of feature selection and can simplify the problem that is being modeled, speed up the modeling process (deleting features is called dimensionality reduction), and in some cases, improve the performance of the model.

Feature importance scores can be fed to a wrapper model, such as SelectFromModel or SelectKBest, to perform feature selection.

There are many ways to calculate feature importance scores and many models that can be used for this purpose.

Perhaps the simplest way is to calculate simple coefficient statistics between each feature and the target variable. For more on this approach, see the tutorial:

In this tutorial, we will look at three main types of more advanced feature importance; they are:

- Feature importance from model coefficients.
- Feature importance from decision trees.
- Feature importance from permutation testing.

Let’s take a closer look at each.

Before we dive in, let’s confirm our environment and prepare some test datasets.

First, confirm that you have a modern version of the scikit-learn library installed.

This is important because some of the models we will explore in this tutorial require a modern version of the library.

You can check the version of the library you have installed with the following code example:

# check scikit-learn version import sklearn print(sklearn.__version__)

Running the example will print the version of the library. At the time of writing, this is about version 0.22.

You need to be using this version of scikit-learn or higher.

0.22.1

Next, let’s define some test datasets that we can use as the basis for demonstrating and exploring feature importance scores.

Each test problem has five important and five unimportant features, and it may be interesting to see which methods are consistent at finding or differentiating the features based on their importance.

We will use the make_classification() function to create a test binary classification dataset.

The dataset will have 1,000 examples, with 10 input features, five of which will be informative and the remaining five will be redundant. We will fix the random number seed to ensure we get the same examples each time the code is run.

An example of creating and summarizing the dataset is listed below.

# test classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and confirms the expected number of samples and features.

(1000, 10) (1000,)

We will use the make_regression() function to create a test regression dataset.

Like the classification dataset, the regression dataset will have 1,000 examples, with 10 input features, five of which will be informative and the remaining five that will be redundant.

# test regression dataset from sklearn.datasets import make_regression # define dataset X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and confirms the expected number of samples and features.

(1000, 10) (1000,)

Next, let’s take a closer look at coefficients as importance scores.

Linear machine learning algorithms fit a model where the prediction is the weighted sum of the input values.

Examples include linear regression, logistic regression, and extensions that add regularization, such as ridge regression and the elastic net.

All of these algorithms find a set of coefficients to use in the weighted sum in order to make a prediction. These coefficients can be used directly as a crude type of feature importance score.

Let’s take a closer look at using coefficients as feature importance for classification and regression. We will fit a model on the dataset to find the coefficients, then summarize the importance scores for each input feature and finally create a bar chart to get an idea of the relative importance of the features.

We can fit a LinearRegression model on the regression dataset and retrieve the *coeff_* property that contains the coefficients found for each input variable.

These coefficients can provide the basis for a crude feature importance score. This assumes that the input variables have the same scale or have been scaled prior to fitting a model.

The complete example of linear regression coefficients for feature importance is listed below.

# linear regression feature importance from sklearn.datasets import make_regression from sklearn.linear_model import LinearRegression from matplotlib import pyplot # define dataset X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) # define the model model = LinearRegression() # fit the model model.fit(X, y) # get importance importance = model.coef_ # summarize feature importance for i,v in enumerate(importance): print('Feature: %0d, Score: %.5f' % (i,v)) # plot feature importance pyplot.bar([x for x in range(len(importance))], importance) pyplot.show()

Running the example fits the model, then reports the coefficient value for each feature.

The scores suggest that the model found the five important features and marked all other features with a zero coefficient, essentially removing them from the model.

Feature: 0, Score: 0.00000 Feature: 1, Score: 12.44483 Feature: 2, Score: -0.00000 Feature: 3, Score: -0.00000 Feature: 4, Score: 93.32225 Feature: 5, Score: 86.50811 Feature: 6, Score: 26.74607 Feature: 7, Score: 3.28535 Feature: 8, Score: -0.00000 Feature: 9, Score: 0.00000

A bar chart is then created for the feature importance scores.

This approach may also be used with Ridge and ElasticNet models.

We can fit a LogisticRegression model on the regression dataset and retrieve the *coeff_* property that contains the coefficients found for each input variable.

These coefficients can provide the basis for a crude feature importance score. This assumes that the input variables have the same scale or have been scaled prior to fitting a model.

The complete example of logistic regression coefficients for feature importance is listed below.

# logistic regression for feature importance from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # define the model model = LogisticRegression() # fit the model model.fit(X, y) # get importance importance = model.coef_[0] # summarize feature importance for i,v in enumerate(importance): print('Feature: %0d, Score: %.5f' % (i,v)) # plot feature importance pyplot.bar([x for x in range(len(importance))], importance) pyplot.show()

Running the example fits the model, then reports the coefficient value for each feature.

Recall this is a classification problem with classes 0 and 1. Notice that the coefficients are both positive and negative. The positive scores indicate a feature that predicts class 1, whereas the negative scores indicate a feature that predicts class 0.

No clear pattern of important and unimportant features can be identified from these results, at least from what I can tell.

Feature: 0, Score: 0.16320 Feature: 1, Score: -0.64301 Feature: 2, Score: 0.48497 Feature: 3, Score: -0.46190 Feature: 4, Score: 0.18432 Feature: 5, Score: -0.11978 Feature: 6, Score: -0.40602 Feature: 7, Score: 0.03772 Feature: 8, Score: -0.51785 Feature: 9, Score: 0.26540

A bar chart is then created for the feature importance scores.

Now that we have seen the use of coefficients as importance scores, let’s look at the more common example of decision-tree-based importance scores.

Decision tree algorithms like classification and regression trees (CART) offer importance scores based on the reduction in the criterion used to select split points, like Gini or entropy.

This same approach can be used for ensembles of decision trees, such as the random forest and stochastic gradient boosting algorithms.

Let’s take a look at a worked example of each.

We can use the CART algorithm for feature importance implemented in scikit-learn as the *DecisionTreeRegressor* and *DecisionTreeClassifier* classes.

After being fit, the model provides a *feature_importances_* property that can be accessed to retrieve the relative importance scores for each input feature.

Let’s take a look at an example of this for regression and classification.

The complete example of fitting a DecisionTreeRegressor and summarizing the calculated feature importance scores is listed below.

# decision tree for feature importance on a regression problem from sklearn.datasets import make_regression from sklearn.tree import DecisionTreeRegressor from matplotlib import pyplot # define dataset X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) # define the model model = DecisionTreeRegressor() # fit the model model.fit(X, y) # get importance importance = model.feature_importances_ # summarize feature importance for i,v in enumerate(importance): print('Feature: %0d, Score: %.5f' % (i,v)) # plot feature importance pyplot.bar([x for x in range(len(importance))], importance) pyplot.show()

Running the example fits the model, then reports the coefficient value for each feature.

The results suggest perhaps three of the 10 features as being important to prediction.

Feature: 0, Score: 0.00294 Feature: 1, Score: 0.00502 Feature: 2, Score: 0.00318 Feature: 3, Score: 0.00151 Feature: 4, Score: 0.51648 Feature: 5, Score: 0.43814 Feature: 6, Score: 0.02723 Feature: 7, Score: 0.00200 Feature: 8, Score: 0.00244 Feature: 9, Score: 0.00106

A bar chart is then created for the feature importance scores.

The complete example of fitting a DecisionTreeClassifier and summarizing the calculated feature importance scores is listed below.

# decision tree for feature importance on a classification problem from sklearn.datasets import make_classification from sklearn.tree import DecisionTreeClassifier from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # define the model model = DecisionTreeClassifier() # fit the model model.fit(X, y) # get importance importance = model.feature_importances_ # summarize feature importance for i,v in enumerate(importance): print('Feature: %0d, Score: %.5f' % (i,v)) # plot feature importance pyplot.bar([x for x in range(len(importance))], importance) pyplot.show()

Running the example fits the model, then reports the coefficient value for each feature.

The results suggest perhaps four of the 10 features as being important to prediction.

Feature: 0, Score: 0.01486 Feature: 1, Score: 0.01029 Feature: 2, Score: 0.18347 Feature: 3, Score: 0.30295 Feature: 4, Score: 0.08124 Feature: 5, Score: 0.00600 Feature: 6, Score: 0.19646 Feature: 7, Score: 0.02908 Feature: 8, Score: 0.12820 Feature: 9, Score: 0.04745

A bar chart is then created for the feature importance scores.

We can use the Random Forest algorithm for feature importance implemented in scikit-learn as the *RandomForestRegressor* and *RandomForestClassifier* classes.

After being fit, the model provides a *feature_importances_* property that can be accessed to retrieve the relative importance scores for each input feature.

This approach can also be used with the bagging and extra trees algorithms.

Let’s take a look at an example of this for regression and classification.

The complete example of fitting a RandomForestRegressor and summarizing the calculated feature importance scores is listed below.

# random forest for feature importance on a regression problem from sklearn.datasets import make_regression from sklearn.ensemble import RandomForestRegressor from matplotlib import pyplot # define dataset X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) # define the model model = RandomForestRegressor() # fit the model model.fit(X, y) # get importance importance = model.feature_importances_ # summarize feature importance for i,v in enumerate(importance): print('Feature: %0d, Score: %.5f' % (i,v)) # plot feature importance pyplot.bar([x for x in range(len(importance))], importance) pyplot.show()

Running the example fits the model, then reports the coefficient value for each feature.

The results suggest perhaps two or three of the 10 features as being important to prediction.

Feature: 0, Score: 0.00280 Feature: 1, Score: 0.00545 Feature: 2, Score: 0.00294 Feature: 3, Score: 0.00289 Feature: 4, Score: 0.52992 Feature: 5, Score: 0.42046 Feature: 6, Score: 0.02663 Feature: 7, Score: 0.00304 Feature: 8, Score: 0.00304 Feature: 9, Score: 0.00283

A bar chart is then created for the feature importance scores.

The complete example of fitting a RandomForestClassifier and summarizing the calculated feature importance scores is listed below.

# random forest for feature importance on a classification problem from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # define the model model = RandomForestClassifier() # fit the model model.fit(X, y) # get importance importance = model.feature_importances_ # summarize feature importance for i,v in enumerate(importance): print('Feature: %0d, Score: %.5f' % (i,v)) # plot feature importance pyplot.bar([x for x in range(len(importance))], importance) pyplot.show()

Running the example fits the model, then reports the coefficient value for each feature.

The results suggest perhaps two or three of the 10 features as being important to prediction.

Feature: 0, Score: 0.06523 Feature: 1, Score: 0.10737 Feature: 2, Score: 0.15779 Feature: 3, Score: 0.20422 Feature: 4, Score: 0.08709 Feature: 5, Score: 0.09948 Feature: 6, Score: 0.10009 Feature: 7, Score: 0.04551 Feature: 8, Score: 0.08830 Feature: 9, Score: 0.04493

A bar chart is then created for the feature importance scores.

XGBoost is a library that provides an efficient and effective implementation of the stochastic gradient boosting algorithm.

This algorithm can be used with scikit-learn via the *XGBRegressor* and *XGBClassifier* classes.

After being fit, the model provides a *feature_importances_* property that can be accessed to retrieve the relative importance scores for each input feature.

This algorithm is also provided via scikit-learn via the *GradientBoostingClassifier* and *GradientBoostingRegressor* classes and the same approach to feature selection can be used.

First, install the XGBoost library, such as with pip:

sudo pip install xgboost

Then confirm that the library was installed correctly and works by checking the version number.

# check xgboost version import xgboost print(xgboost.__version__)

Running the example, you should see the following version number or higher.

0.90

For more on the XGBoost library, start here:

Let’s take a look at an example of XGBoost for feature importance on regression and classification problems.

The complete example of fitting a XGBRegressor and summarizing the calculated feature importance scores is listed below.

# xgboost for feature importance on a regression problem from sklearn.datasets import make_regression from xgboost import XGBRegressor from matplotlib import pyplot # define dataset X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) # define the model model = XGBRegressor() # fit the model model.fit(X, y) # get importance importance = model.feature_importances_ # summarize feature importance for i,v in enumerate(importance): print('Feature: %0d, Score: %.5f' % (i,v)) # plot feature importance pyplot.bar([x for x in range(len(importance))], importance) pyplot.show()

Running the example fits the model, then reports the coefficient value for each feature.

The results suggest perhaps two or three of the 10 features as being important to prediction.

Feature: 0, Score: 0.00060 Feature: 1, Score: 0.01917 Feature: 2, Score: 0.00091 Feature: 3, Score: 0.00118 Feature: 4, Score: 0.49380 Feature: 5, Score: 0.42342 Feature: 6, Score: 0.05057 Feature: 7, Score: 0.00419 Feature: 8, Score: 0.00124 Feature: 9, Score: 0.00491

A bar chart is then created for the feature importance scores.

The complete example of fitting an XGBClassifier and summarizing the calculated feature importance scores is listed below.

# xgboost for feature importance on a classification problem from sklearn.datasets import make_classification from xgboost import XGBClassifier from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # define the model model = XGBClassifier() # fit the model model.fit(X, y) # get importance importance = model.feature_importances_ # summarize feature importance for i,v in enumerate(importance): print('Feature: %0d, Score: %.5f' % (i,v)) # plot feature importance pyplot.bar([x for x in range(len(importance))], importance) pyplot.show()

Running the example fits the model then reports the coefficient value for each feature.

The results suggest perhaps seven of the 10 features as being important to prediction.

Feature: 0, Score: 0.02464 Feature: 1, Score: 0.08153 Feature: 2, Score: 0.12516 Feature: 3, Score: 0.28400 Feature: 4, Score: 0.12694 Feature: 5, Score: 0.10752 Feature: 6, Score: 0.08624 Feature: 7, Score: 0.04820 Feature: 8, Score: 0.09357 Feature: 9, Score: 0.02220

A bar chart is then created for the feature importance scores.

Permutation feature importance is a technique for calculating relative importance scores that is independent of the model used.

First, a model is fit on the dataset, such as a model that does not support native feature importance scores. Then the model is used to make predictions on a dataset, although the values of a feature (column) in the dataset are scrambled. This is repeated for each feature in the dataset. Then this whole process is repeated 3, 5, 10 or more times. The result is a mean importance score for each input feature (and distribution of scores given the repeats).

This approach can be used for regression or classification and requires that a performance metric be chosen as the basis of the importance score, such as the mean squared error for regression and accuracy for classification.

Permutation feature selection can be used via the permutation_importance() function that takes a fit model, a dataset (train or test dataset is fine), and a scoring function.

Let’s take a look at this approach to feature selection with an algorithm that does not support feature selection natively, specifically k-nearest neighbors.

The complete example of fitting a KNeighborsRegressor and summarizing the calculated permutation feature importance scores is listed below.

# permutation feature importance with knn for regression from sklearn.datasets import make_regression from sklearn.neighbors import KNeighborsRegressor from sklearn.inspection import permutation_importance from matplotlib import pyplot # define dataset X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) # define the model model = KNeighborsRegressor() # fit the model model.fit(X, y) # perform permutation importance results = permutation_importance(model, X, y, scoring='neg_mean_squared_error') # get importance importance = results.importances_mean # summarize feature importance for i,v in enumerate(importance): print('Feature: %0d, Score: %.5f' % (i,v)) # plot feature importance pyplot.bar([x for x in range(len(importance))], importance) pyplot.show()

Running the example fits the model, then reports the coefficient value for each feature.

The results suggest perhaps two or three of the 10 features as being important to prediction.

Feature: 0, Score: 175.52007 Feature: 1, Score: 345.80170 Feature: 2, Score: 126.60578 Feature: 3, Score: 95.90081 Feature: 4, Score: 9666.16446 Feature: 5, Score: 8036.79033 Feature: 6, Score: 929.58517 Feature: 7, Score: 139.67416 Feature: 8, Score: 132.06246 Feature: 9, Score: 84.94768

A bar chart is then created for the feature importance scores.

The complete example of fitting a KNeighborsClassifier and summarizing the calculated permutation feature importance scores is listed below.

# permutation feature importance with knn for classification from sklearn.datasets import make_classification from sklearn.neighbors import KNeighborsClassifier from sklearn.inspection import permutation_importance from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # define the model model = KNeighborsClassifier() # fit the model model.fit(X, y) # perform permutation importance results = permutation_importance(model, X, y, scoring='accuracy') # get importance importance = results.importances_mean # summarize feature importance for i,v in enumerate(importance): print('Feature: %0d, Score: %.5f' % (i,v)) # plot feature importance pyplot.bar([x for x in range(len(importance))], importance) pyplot.show()

Running the example fits the model, then reports the coefficient value for each feature.

The results suggest perhaps two or three of the 10 features as being important to prediction.

Feature: 0, Score: 0.04760 Feature: 1, Score: 0.06680 Feature: 2, Score: 0.05240 Feature: 3, Score: 0.09300 Feature: 4, Score: 0.05140 Feature: 5, Score: 0.05520 Feature: 6, Score: 0.07920 Feature: 7, Score: 0.05560 Feature: 8, Score: 0.05620 Feature: 9, Score: 0.03080

A bar chart is then created for the feature importance scores.

This section provides more resources on the topic if you are looking to go deeper.

- How to Choose a Feature Selection Method For Machine Learning
- How to Perform Feature Selection with Categorical Data
- Feature Importance and Feature Selection With XGBoost in Python
- Feature Selection For Machine Learning in Python
- An Introduction to Feature Selection

- Feature selection, scikit-learn API.
- Permutation feature importance, scikit-learn API.
- sklearn.datasets.make_classification API.
- sklearn.datasets.make_regression API.
- XGBoost Python API Reference.
- sklearn.inspection.permutation_importance API.

In this tutorial, you discovered feature importance scores for machine learning in python

Specifically, you learned:

- The role of feature importance in a predictive modeling problem.
- How to calculate and review feature importance from linear models and decision trees.
- How to calculate and review permutation feature importance scores.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Calculate Feature Importance With Python appeared first on Machine Learning Mastery.

]]>The post How to Develop Multi-Output Regression Models with Python appeared first on Machine Learning Mastery.

]]>Multioutput regression are regression problems that involve predicting two or more numerical values given an input example.

An example might be to predict a coordinate given an input, e.g. predicting x and y values. Another example would be multi-step time series forecasting that involves predicting multiple future time series of a given variable.

Many machine learning algorithms are designed for predicting a single numeric value, referred to simply as regression. Some algorithms do support multioutput regression inherently, such as linear regression and decision trees. There are also special workaround models that can be used to wrap and use those algorithms that do not natively support predicting multiple outputs.

In this tutorial, you will discover how to develop machine learning models for multioutput regression.

After completing this tutorial, you will know:

- The problem of multioutput regression in machine learning.
- How to develop machine learning models that inherently support multiple-output regression.
- How to develop wrapper models that allow algorithms that do not inherently support multiple outputs to be used for multiple-output regression.

Let’s get started.

This tutorial is divided into three parts; they are:

- Problem of Multioutput Regression
- Check Scikit-Learn Version
- Multioutput Regression Test Problem

- Inherently Multioutput Regression Algorithms
- Linear Regression for Multioutput Regression
- k-Nearest Neighbors for Multioutput Regression
- Random Forest for Multioutput Regression
- Evaluate Multioutput Regression With Cross-Validation

- Wrapper Multioutput Regression Algorithms
- Separate Model for Each Output (MultiOutputRegressor)
- Chained Models for Each Output (RegressorChain)

Regression refers to a predictive modeling problem that involves predicting a numerical value.

For example, predicting a size, weight, amount, number of sales, and number of clicks are regression problems. Typically, a single numeric value is predicted given input variables.

Some regression problems require the prediction of two or more numeric values. For example, predicting an x and y coordinate.

These problems are referred to as multiple-output regression, or multioutput regression.

**Regression**: Predict a single numeric output given an input.**Multioutput Regression**: Predict two or more numeric outputs given an input.

In multioutput regression, typically the outputs are dependent upon the input and upon each other. This means that often the outputs are not independent of each other and may require a model that predicts both outputs together or each output contingent upon the other outputs.

Multi-step time series forecasting may be considered a type of multiple-output regression where a sequence of future values are predicted and each predicted value is dependent upon the prior values in the sequence.

There are a number of strategies for handling multioutput regression and we will explore some of them in this tutorial.

First, confirm that you have a modern version of the scikit-learn library installed.

This is important because some of the models we will explore in this tutorial require a modern version of the library.

You can check the version of the library with the following code example:

# check scikit-learn version import sklearn print(sklearn.__version__)

Running the example will print the version of the library.

At the time of writing, this is about version 0.22. You need to be using this version of scikit-learn or higher.

0.22.1

We can define a test problem that we can use to demonstrate the different modeling strategies.

We will use the make_regression() function to create a test dataset for multiple-output regression. We will generate 1,000 examples with 10 input features, five of which will be redundant and five that will be informative. The problem will require the prediction of two numeric values.

**Problem Input**: 10 numeric variables.**Problem Output**: 2 numeric variables.

The example below generates the dataset and summarizes the shape.

# example of multioutput regression test problem from sklearn.datasets import make_regression # create datasets X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, n_targets=2, random_state=1) # summarize dataset print(X.shape, y.shape)

Running the example creates the dataset and summarizes the shape of the input and output elements of the dataset for modeling, confirming the chosen configuration.

(1000, 10) (1000, 2)

Next, let’s look at modeling this problem directly.

Some regression machine learning algorithms support multiple outputs directly.

This includes most of the popular machine learning algorithms implemented in the scikit-learn library, such as:

- LinearRegression (and related)
- KNeighborsRegressor
- DecisionTreeRegressor
- RandomForestRegressor (and related)

Let’s look at a few examples to make this concrete.

The example below fits a linear regression model on the multioutput regression dataset, then makes a single prediction with the fit model.

# linear regression for multioutput regression from sklearn.datasets import make_regression from sklearn.linear_model import LinearRegression # create datasets X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, n_targets=2, random_state=1) # define model model = LinearRegression() # fit model model.fit(X, y) # make a prediction data_in = [[-2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]] yhat = model.predict(data_in) # summarize prediction print(yhat[0])

Running the example fits the model and then makes a prediction for one input, confirming that the model predicted two required values.

[-93.147146 23.26985013]

The example below fits a k-nearest neighbors model on the multioutput regression dataset, then makes a single prediction with the fit model.

# k-nearest neighbors for multioutput regression from sklearn.datasets import make_regression from sklearn.neighbors import KNeighborsRegressor # create datasets X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, n_targets=2, random_state=1) # define model model = KNeighborsRegressor() # fit model model.fit(X, y) # make a prediction data_in = [[-2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]] yhat = model.predict(data_in) # summarize prediction print(yhat[0])

Running the example fits the model and then makes a prediction for one input, confirming that the model predicted two required values.

[-109.74862659 0.38754079]

The example below fits a random forest model on the multioutput regression dataset, then makes a single prediction with the fit model.

# random forest for multioutput regression from sklearn.datasets import make_regression from sklearn.ensemble import RandomForestRegressor # create datasets X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, n_targets=2, random_state=1) # define model model = RandomForestRegressor() # fit model model.fit(X, y) # make a prediction data_in = [[-2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]] yhat = model.predict(data_in) # summarize prediction print(yhat[0])

Running the example fits the model and then makes a prediction for one input, confirming that the model predicted two required values.

[-76.79505796 27.16551641]

We may want to evaluate a multioutput regression using k-fold cross-validation.

This can be achieved in the same way as evaluating any other machine learning model.

We will fit and evaluate a *DecisionTreeRegressor* model on the test problem using 10-fold cross-validation with three repeats. We will use the mean absolute error (MAE) performance metric as the score.

The complete example is listed below.

# evaluate multioutput regression model with k-fold cross-validation from numpy import absolute from numpy import mean from numpy import std from sklearn.datasets import make_regression from sklearn.tree import DecisionTreeRegressor from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold # create datasets X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, n_targets=2, random_state=1) # define model model = DecisionTreeRegressor() # evaluate model cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise') # summarize performance n_scores = absolute(n_scores) print('Result: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example evaluates the performance of the decision tree model for multioutput regression on the test problem. The mean and standard deviation of the MAE is reported calculated across all folds and all repeats.

Importantly, error is reported across both output variables, rather than separate error scores for each output variable.

Result: 51.659 (3.455)

Not all regression algorithms support multioutput regression.

One example is the support vector machine, although for regression, it is referred to as support vector regression, or SVR.

This algorithm does not support multiple outputs for a regression problem and will raise an error. We can demonstrate this with an example, listed below.

# failure of support vector regression for multioutput regression from sklearn.datasets import make_regression from sklearn.svm import LinearSVR # create datasets X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, n_targets=2, random_state=1) # define model model = LinearSVR() # fit model model.fit(X, y)

Running the example reports an error message indicating that the model does not support multioutput regression.

ValueError: bad input shape (1000, 2)

There are two workarounds that we can adopt in order to use an algorithm like SVR for multioutput regression.

They are to create a separate model for each output and to create a linear sequence of models, one for each output, where the output of each model is dependent upon the output of the previous models.

Thankfully, the scikit-learn library supports both of these cases. Let’s take a closer look at each.

We can create a separate model for each output of the problem.

This assumes that the outputs are independent of each other, which might not be a correct assumption. Nevertheless, this approach can provide surprisingly effective predictions on a range of problems and may be worth trying, at least as a performance baseline.

You never know. The outputs for your problem may, in fact, be mostly independent, if not completely independent, and this strategy can help you find out.

This approach is supported by the MultiOutputRegressor class that takes a regression model as an argument. It will then create one instance of the provided model for each output in the problem.

The example below demonstrates using the *MultiOutputRegressor* class with linear SVR for the test problem.

# example of linear SVR with the MultiOutputRegressor wrapper for multioutput regression from sklearn.datasets import make_regression from sklearn.multioutput import MultiOutputRegressor from sklearn.svm import LinearSVR # create datasets X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, n_targets=2, random_state=1) # define model model = LinearSVR() wrapper = MultiOutputRegressor(model) # fit model wrapper.fit(X, y) # make a prediction data_in = [[-2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]] yhat = wrapper.predict(data_in) # summarize prediction print(yhat[0])

Running the example fits a separate LinearSVR for each of the outputs in the problem using the *MultiOutputRegressor* wrapper class.

This wrapper can then be used directly to make a prediction on new data, confirming that multiple outputs are supported.

[-93.147146 23.26985013]

Another approach to using single-output regression models for multioutput regression is to create a linear sequence of models.

The first model in the sequence uses the input and predicts one output; the second model uses the input and the output from the first model to make a prediction; the third model uses the input and output from the first two models to make a prediction, and so on.

This can be achieved using the RegressorChain class in the scikit-learn library.

The order of the models may be based on the order of the outputs in the dataset (the default) or specified via the “*order*” argument. For example, *order=[0,1]* would first predict the 0th output, then the 1st output, whereas an *order=[1,0]* would first predict the last output variable and then the first output variable in our test problem.

The example below uses the *RegressorChain* with the default output order to fit a linear SVR on the multioutput regression test problem.

# example of fitting a chain of linear SVR for multioutput regression from sklearn.datasets import make_regression from sklearn.multioutput import RegressorChain from sklearn.svm import LinearSVR # create datasets X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, n_targets=2, random_state=1) # define model model = LinearSVR() wrapper = RegressorChain(model) # fit model wrapper.fit(X, y) # make a prediction data_in = [[-2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]] yhat = wrapper.predict(data_in) # summarize prediction print(yhat[0])

Running the example first fits a linear SVR to predict the first output variable, then a second linear SVR to predict the second output variable using the input and the output of the first model. These models are fit on the entire dataset.

The fit chain of models is then used directly to make a prediction on a new test instance, predicting the required two output variables.

[-93.147146 23.26938475]

This section provides more resources on the topic if you are looking to go deeper.

- Multiclass and multilabel algorithms, API.
- sklearn.datasets.make_regression API.
- sklearn.multioutput.MultiOutputRegressor API.
- sklearn.multioutput.RegressorChain API.

In this tutorial, you discovered how to develop machine learning models for multioutput regression.

Specifically, you learned:

- The problem of multioutput regression in machine learning.
- How to develop machine learning models that inherently support multiple-output regression.
- How to develop wrapper models that allow algorithms that do not inherently support multiple outputs to be used for multiple-output regression.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Develop Multi-Output Regression Models with Python appeared first on Machine Learning Mastery.

]]>The post 4 Distance Measures for Machine Learning appeared first on Machine Learning Mastery.

]]>Distance measures play an important role in machine learning.

They provide the foundation for many popular and effective machine learning algorithms like k-nearest neighbors for supervised learning and k-means clustering for unsupervised learning.

Different distance measures must be chosen and used depending on the types of the data. As such, it is important to know how to implement and calculate a range of different popular distance measures and the intuitions for the resulting scores.

In this tutorial, you will discover distance measures in machine learning.

After completing this tutorial, you will know:

- The role and importance of distance measures in machine learning algorithms.
- How to implement and calculate Hamming, Euclidean, and Manhattan distance measures.
- How to implement and calculate the Minkowski distance that generalizes the Euclidean and Manhattan distance measures.

Let’s get started.

This tutorial is divided into five parts; they are:

- Role of Distance Measures
- Hamming Distance
- Euclidean Distance
- Manhattan Distance (Taxicab or City Block)
- Minkowski Distance

Distance measures play an important role in machine learning.

A distance measure is an objective score that summarizes the relative difference between two objects in a problem domain.

Most commonly, the two objects are rows of data that describe a subject (such as a person, car, or house), or an event (such as a purchase, a claim, or a diagnosis).

Perhaps the most likely way you will encounter distance measures is when you are using a specific machine learning algorithm that uses distance measures at its core. The most famous algorithm of this type is the k-nearest neighbors algorithm, or KNN for short.

In the KNN algorithm, a classification or regression prediction is made for new examples by calculating the distance between the new example (row) and all examples (rows) in the training dataset. The k examples in the training dataset with the smallest distance are then selected and a prediction is made by averaging the outcome (mode of the class label or mean of the real value for regression).

KNN belongs to a broader field of algorithms called case-based or instance-based learning, most of which use distance measures in a similar manner. Another popular instance-based algorithm that uses distance measures is the learning vector quantization, or LVQ, algorithm that may also be considered a type of neural network.

Related is the self-organizing map algorithm, or SOM, that also uses distance measures and can be used for supervised or unsupervised learning. Another unsupervised learning algorithm that uses distance measures at its core is the K-means clustering algorithm.

In instance-based learning the training examples are stored verbatim, and a distance function is used to determine which member of the training set is closest to an unknown test instance. Once the nearest training instance has been located, its class is predicted for the test instance.

— Page 135, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.

A short list of some of the more popular machine learning algorithms that use distance measures at their core is as follows:

- K-Nearest Neighbors
- Learning Vector Quantization (LVQ)
- Self-Organizing Map (SOM)
- K-Means Clustering

There are many kernel-based methods may also be considered distance-based algorithms. Perhaps the most widely known kernel method is the support vector machine algorithm, or SVM for short.

**Do you know more algorithms that use distance measures?**

Let me know in the comments below.

When calculating the distance between two examples or rows of data, it is possible that different data types are used for different columns of the examples. An example might have real values, boolean values, categorical values, and ordinal values. Different distance measures may be required for each that are summed together into a single distance score.

Numerical values may have different scales. This can greatly impact the calculation of distance measure and it is often a good practice to normalize or standardize numerical values prior to calculating the distance measure.

Numerical error in regression problems may also be considered a distance. For example, the error between the expected value and the predicted value is a one-dimensional distance measure that can be summed or averaged over all examples in a test set to give a total distance between the expected and predicted outcomes in the dataset. The calculation of the error, such as the mean squared error or mean absolute error, may resemble a standard distance measure.

As we can see, distance measures play an important role in machine learning. Perhaps four of the most commonly used distance measures in machine learning are as follows:

- Hamming Distance
- Euclidean Distance
- Manhattan Distance
- Minkowski Distance

**What are some other distance measures you have used or heard of?**

Let me know in the comments below.

You need to know how to calculate each of these distance measures when implementing algorithms from scratch and the intuition for what is being calculated when using algorithms that make use of these distance measures.

Let’s take a closer look at each in turn.

Hamming distance calculates the distance between two binary vectors, also referred to as binary strings or bitstrings for short.

You are most likely going to encounter bitstrings when you one-hot encode categorical columns of data.

For example, if a column had the categories ‘*red*,’ ‘*green*,’ and ‘*blue*,’ you might one hot encode each example as a bitstring with one bit for each column.

- red = [1, 0, 0]
- green = [0, 1, 0]
- blue = [0, 0, 1]

The distance between red and green could be calculated as the sum or the average number of bit differences between the two bitstrings. This is the Hamming distance.

For a one-hot encoded string, it might make more sense to summarize to the sum of the bit differences between the strings, which will always be a 0 or 1.

- HammingDistance = sum for i to N abs(v1[i] – v2[i])

For bitstrings that may have many 1 bits, it is more common to calculate the average number of bit differences to give a hamming distance score between 0 (identical) and 1 (all different).

- HammingDistance = (sum for i to N abs(v1[i] – v2[i])) / N

We can demonstrate this with an example of calculating the Hamming distance between two bitstrings, listed below.

# calculating hamming distance between bit strings # calculate hamming distance def hamming_distance(a, b): return sum(abs(e1 - e2) for e1, e2 in zip(a, b)) / len(a) # define data row1 = [0, 0, 0, 0, 0, 1] row2 = [0, 0, 0, 0, 1, 0] # calculate distance dist = hamming_distance(row1, row2) print(dist)

Running the example reports the Hamming distance between the two bitstrings.

We can see that there are two differences between the strings, or 2 out of 6 bit positions different, which averaged (2/6) is about 1/3 or 0.333.

0.3333333333333333

We can also perform the same calculation using the hamming() function from SciPy. The complete example is listed below.

# calculating hamming distance between bit strings from scipy.spatial.distance import hamming # define data row1 = [0, 0, 0, 0, 0, 1] row2 = [0, 0, 0, 0, 1, 0] # calculate distance dist = hamming(row1, row2) print(dist)

Running the example, we can see we get the same result, confirming our manual implementation.

0.3333333333333333

Euclidean distance calculates the distance between two real-valued vectors.

You are most likely to use Euclidean distance when calculating the distance between two rows of data that have numerical values, such a floating point or integer values.

If columns have values with differing scales, it is common to normalize or standardize the numerical values across all columns prior to calculating the Euclidean distance. Otherwise, columns that have large values will dominate the distance measure.

Although there are other possible choices, most instance-based learners use Euclidean distance.

— Page 135, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.

Euclidean distance is calculated as the square root of the sum of the squared differences between the two vectors.

- EuclideanDistance = sqrt(sum for i to N (v1[i] – v2[i])^2)

If the distance calculation is to be performed thousands or millions of times, it is common to remove the square root operation in an effort to speed up the calculation. The resulting scores will have the same relative proportions after this modification and can still be used effectively within a machine learning algorithm for finding the most similar examples.

- EuclideanDistance = sum for i to N (v1[i] – v2[i])^2

This calculation is related to the L2 vector norm and is equivalent to the sum squared error and the root sum squared error if the square root is added.

We can demonstrate this with an example of calculating the Euclidean distance between two real-valued vectors, listed below.

# calculating euclidean distance between vectors from math import sqrt # calculate euclidean distance def euclidean_distance(a, b): return sqrt(sum((e1-e2)**2 for e1, e2 in zip(a,b))) # define data row1 = [10, 20, 15, 10, 5] row2 = [12, 24, 18, 8, 7] # calculate distance dist = euclidean_distance(row1, row2) print(dist)

Running the example reports the Euclidean distance between the two vectors.

6.082762530298219

We can also perform the same calculation using the euclidean() function from SciPy. The complete example is listed below.

# calculating euclidean distance between vectors from scipy.spatial.distance import euclidean # define data row1 = [10, 20, 15, 10, 5] row2 = [12, 24, 18, 8, 7] # calculate distance dist = euclidean(row1, row2) print(dist)

Running the example, we can see we get the same result, confirming our manual implementation.

6.082762530298219

The Manhattan distance, also called the Taxicab distance or the City Block distance, calculates the distance between two real-valued vectors.

It is perhaps more useful to vectors that describe objects on a uniform grid, like a chessboard or city blocks. The taxicab name for the measure refers to the intuition for what the measure calculates: the shortest path that a taxicab would take between city blocks (coordinates on the grid).

It might make sense to calculate Manhattan distance instead of Euclidean distance for two vectors in an integer feature space.

Manhattan distance is calculated as the sum of the absolute differences between the two vectors.

- ManhattanDistance = sum for i to N sum |v1[i] – v2[i]|

The Manhattan distance is related to the L1 vector norm and the sum absolute error and mean absolute error metric.

We can demonstrate this with an example of calculating the Manhattan distance between two integer vectors, listed below.

# calculating manhattan distance between vectors from math import sqrt # calculate manhattan distance def manhattan_distance(a, b): return sum(abs(e1-e2) for e1, e2 in zip(a,b)) # define data row1 = [10, 20, 15, 10, 5] row2 = [12, 24, 18, 8, 7] # calculate distance dist = manhattan_distance(row1, row2) print(dist)

Running the example reports the Manhattan distance between the two vectors.

13

We can also perform the same calculation using the cityblock() function from SciPy. The complete example is listed below.

# calculating manhattan distance between vectors from scipy.spatial.distance import cityblock # define data row1 = [10, 20, 15, 10, 5] row2 = [12, 24, 18, 8, 7] # calculate distance dist = cityblock(row1, row2) print(dist)

Running the example, we can see we get the same result, confirming our manual implementation.

13

Minkowski distance calculates the distance between two real-valued vectors.

It is a generalization of the Euclidean and Manhattan distance measures and adds a parameter, called the “*order*” or “*p*“, that allows different distance measures to be calculated.

The Minkowski distance measure is calculated as follows:

- EuclideanDistance = (sum for i to N (abs(v1[i] – v2[i]))^p)^(1/p)

Where “*p*” is the order parameter.

When p is set to 1, the calculation is the same as the Manhattan distance. When p is set to 2, it is the same as the Euclidean distance.

*p=1*: Manhattan distance.*p=2*: Euclidean distance.

Intermediate values provide a controlled balance between the two measures.

It is common to use Minkowski distance when implementing a machine learning algorithm that uses distance measures as it gives control over the type of distance measure used for real-valued vectors via a hyperparameter “*p*” that can be tuned.

We can demonstrate this calculation with an example of calculating the Minkowski distance between two real vectors, listed below.

# calculating minkowski distance between vectors from math import sqrt # calculate minkowski distance def minkowski_distance(a, b, p): return sum(abs(e1-e2)**p for e1, e2 in zip(a,b))**(1/p) # define data row1 = [10, 20, 15, 10, 5] row2 = [12, 24, 18, 8, 7] # calculate distance (p=1) dist = minkowski_distance(row1, row2, 1) print(dist) # calculate distance (p=2) dist = minkowski_distance(row1, row2, 2) print(dist)

Running the example first calculates and prints the Minkowski distance with *p* set to 1 to give the Manhattan distance, then with *p* set to 2 to give the Euclidean distance, matching the values calculated on the same data from the previous sections.

13.0 6.082762530298219

We can also perform the same calculation using the minkowski_distance() function from SciPy. The complete example is listed below.

# calculating minkowski distance between vectors from scipy.spatial import minkowski_distance # define data row1 = [10, 20, 15, 10, 5] row2 = [12, 24, 18, 8, 7] # calculate distance (p=1) dist = minkowski_distance(row1, row2, 1) print(dist) # calculate distance (p=2) dist = minkowski_distance(row1, row2, 2) print(dist)

Running the example, we can see we get the same results, confirming our manual implementation.

13.0 6.082762530298219

This section provides more resources on the topic if you are looking to go deeper.

- Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.

- Distance computations (scipy.spatial.distance)
- scipy.spatial.distance.hamming API.
- scipy.spatial.distance.euclidean API.
- scipy.spatial.distance.cityblock API.
- scipy.spatial.minkowski_distance API.

- Instance-based learning, Wikipedia.
- Hamming distance, Wikipedia.
- Euclidean distance, Wikipedia.
- Taxicab geometry, Wikipedia.
- Minkowski distance, Wikipedia.

In this tutorial, you discovered distance measures in machine learning.

Specifically, you learned:

- The role and importance of distance measures in machine learning algorithms.
- How to implement and calculate Hamming, Euclidean, and Manhattan distance measures.
- How to implement and calculate the Minkowski distance that generalizes the Euclidean and Manhattan distance measures.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post 4 Distance Measures for Machine Learning appeared first on Machine Learning Mastery.

]]>The post PyTorch Tutorial: How to Develop Deep Learning Models with Python appeared first on Machine Learning Mastery.

]]>Predictive modeling with deep learning is a skill that modern developers need to know.

PyTorch is the premier open-source deep learning framework developed and maintained by Facebook.

At its core, PyTorch is a mathematical library that allows you to perform efficient computation and automatic differentiation on graph-based models. Achieving this directly is challenging, although thankfully, the modern PyTorch API provides classes and idioms that allow you to easily develop a suite of deep learning models.

In this tutorial, you will discover a step-by-step guide to developing deep learning models in PyTorch.

After completing this tutorial, you will know:

- The difference between Torch and PyTorch and how to install and confirm PyTorch is working.
- The five-step life-cycle of PyTorch models and how to define, fit, and evaluate models.
- How to develop PyTorch deep learning models for regression, classification, and predictive modeling tasks.

Let’s get started.

The focus of this tutorial is on using the PyTorch API for common deep learning model development tasks; we will not be diving into the math and theory of deep learning. For that, I recommend starting with this excellent book.

The best way to learn deep learning in python is by doing. Dive in. You can circle back for more theory later.

I have designed each code example to use best practices and to be standalone so that you can copy and paste it directly into your project and adapt it to your specific needs. This will give you a massive head start over trying to figure out the API from official documentation alone.

It is a large tutorial, and as such, it is divided into three parts; they are:

- How to Install PyTorch
- What Are Torch and PyTorch?
- How to Install PyTorch
- How to Confirm PyTorch Is Installed

- PyTorch Deep Learning Model Life-Cycle
- Step 1: Prepare the Data
- Step 2: Define the Model
- Step 3: Train the Model
- Step 4: Evaluate the Model
- Step 5: Make Predictions

- How to Develop PyTorch Deep Learning Models
- How to Develop an MLP for Binary Classification
- How to Develop an MLP for Multiclass Classification
- How to Develop an MLP for Regression
- How to Develop a CNN for Image Classification

Work through this tutorial. It will take you 60 minutes, max!

**You do not need to understand everything (at least not right now)**. Your goal is to run through the tutorial end-to-end and get a result. You do not need to understand everything on the first pass. List down your questions as you go. Make heavy use of the API documentation to learn about all of the functions that you’re using.

**You do not need to know the math first**. Math is a compact way of describing how algorithms work, specifically tools from linear algebra, probability, and calculus. These are not the only tools that you can use to learn how algorithms work. You can also use code and explore algorithm behavior with different inputs and outputs. Knowing the math will not tell you what algorithm to choose or how to best configure it. You can only discover that through carefully controlled experiments.

**You do not need to know how the algorithms work**. It is important to know about the limitations and how to configure deep learning algorithms. But learning about algorithms can come later. You need to build up this algorithm knowledge slowly over a long period of time. Today, start by getting comfortable with the platform.

**You do not need to be a Python programmer**. The syntax of the Python language can be intuitive if you are new to it. Just like other languages, focus on function calls (e.g. function()) and assignments (e.g. a = “b”). This will get you most of the way. You are a developer; you know how to pick up the basics of a language really fast. Just get started and dive into the details later.

**You do not need to be a deep learning expert**. You can learn about the benefits and limitations of various algorithms later, and there are plenty of tutorials that you can read to brush up on the steps of a deep learning project.

In this section, you will discover what PyTorch is, how to install it, and how to confirm that it is installed correctly.

PyTorch is an open-source Python library for deep learning developed and maintained by Facebook.

The project started in 2016 and quickly became a popular framework among developers and researchers.

Torch (*Torch7*) is an open-source project for deep learning written in C and generally used via the Lua interface. It was a precursor project to PyTorch and is no longer actively developed. PyTorch includes “*Torch*” in the name, acknowledging the prior torch library with the “*Py*” prefix indicating the Python focus of the new project.

The PyTorch API is simple and flexible, making it a favorite for academics and researchers in the development of new deep learning models and applications. The extensive use has led to many extensions for specific applications (such as text, computer vision, and audio data), and may pre-trained models that can be used directly. As such, it may be the most popular library used by academics.

The flexibility of PyTorch comes at the cost of ease of use, especially for beginners, as compared to simpler interfaces like Keras. The choice to use PyTorch instead of Keras gives up some ease of use, a slightly steeper learning curve, and more code for more flexibility, and perhaps a more vibrant academic community.

Before installing PyTorch, ensure that you have Python installed, such as Python 3.6 or higher.

If you don’t have Python installed, you can install it using Anaconda. This tutorial will show you how:

There are many ways to install the PyTorch open-source deep learning library.

The most common, and perhaps simplest, way to install PyTorch on your workstation is by using pip.

For example, on the command line, you can type:

sudo pip install torch

Perhaps the most popular application of deep learning is for computer vision, and the PyTorch computer vision package is called “torchvision.”

Installing torchvision is also highly recommended and it can be installed as follows:

sudo pip install torchvision

If you prefer to use an installation method more specific to your platform or package manager, you can see a complete list of installation instructions here:

There is no need to set up the GPU now.

All examples in this tutorial will work just fine on a modern CPU. If you want to configure PyTorch for your GPU, you can do that after completing this tutorial. Don’t get distracted!

Once PyTorch is installed, it is important to confirm that the library was installed successfully and that you can start using it.

Don’t skip this step.

If PyTorch is not installed correctly or raises an error on this step, you won’t be able to run the examples later.

Create a new file called *versions.py* and copy and paste the following code into the file.

# check pytorch version import torch print(torch.__version__)

Save the file, then open your command line and change directory to where you saved the file.

Then type:

python versions.py

You should then see output like the following:

1.3.1

This confirms that PyTorch is installed correctly and that we are all using the same version.

This also shows you how to run a Python script from the command line. I recommend running all code from the command line in this manner, and not from a notebook or an IDE.

In this section, you will discover the life-cycle for a deep learning model and the PyTorch API that you can use to define models.

A model has a life-cycle, and this very simple knowledge provides the backbone for both modeling a dataset and understanding the PyTorch API.

The five steps in the life-cycle are as follows:

- 1. Prepare the Data.
- 2. Define the Model.
- 3. Train the Model.
- 4. Evaluate the Model.
- 5. Make Predictions.

Let’s take a closer look at each step in turn.

**Note**: There are many ways to achieve each of these steps using the PyTorch API, although I have aimed to show you the simplest, or most common, or most idiomatic.

If you discover a better approach, let me know in the comments below.

The first step is to load and prepare your data.

Neural network models require numerical input data and numerical output data.

You can use standard Python libraries to load and prepare tabular data, like CSV files. For example, Pandas can be used to load your CSV file, and tools from scikit-learn can be used to encode categorical data, such as class labels.

PyTorch provides the Dataset class that you can extend and customize to load your dataset.

For example, the constructor of your dataset object can load your data file (e.g. a CSV file). You can then override the *__len__()* function that can be used to get the length of the dataset (number of rows or samples), and the *__getitem__()* function that is used to get a specific sample by index.

When loading your dataset, you can also perform any required transforms, such as scaling or encoding.

A skeleton of a custom *Dataset* class is provided below.

# dataset definition class CSVDataset(Dataset): # load the dataset def __init__(self, path): # store the inputs and outputs self.X = ... self.y = ... # number of rows in the dataset def __len__(self): return len(self.X) # get a row at an index def __getitem__(self, idx): return [self.X[idx], self.y[idx]]

Once loaded, PyTorch provides the DataLoader class to navigate a *Dataset* instance during the training and evaluation of your model.

A *DataLoader* instance can be created for the training dataset, test dataset, and even a validation dataset.

The random_split() function can be used to split a dataset into train and test sets. Once split, a selection of rows from the *Dataset* can be provided to a DataLoader, along with the batch size and whether the data should be shuffled every epoch.

For example, we can define a *DataLoader* by passing in a selected sample of rows in the dataset.

... # create the dataset dataset = CSVDataset(...) # select rows from the dataset train, test = random_split(dataset, [[...], [...]]) # create a data loader for train and test sets train_dl = DataLoader(train, batch_size=32, shuffle=True) test_dl = DataLoader(test, batch_size=1024, shuffle=False)

Once defined, a *DataLoader* can be enumerated, yielding one batch worth of samples each iteration.

... # train the model for i, (inputs, targets) in enumerate(train_dl): ...

The next step is to define a model.

The idiom for defining a model in PyTorch involves defining a class that extends the Module class.

The constructor of your class defines the layers of the model and the forward() function is the override that defines how to forward propagate input through the defined layers of the model.

Many layers are available, such as Linear for fully connected layers, Conv2d for convolutional layers, and MaxPool2d for pooling layers.

Activation functions can also be defined as layers, such as ReLU, Softmax, and Sigmoid.

Below is an example of a simple MLP model with one layer.

# model definition class MLP(Module): # define model elements def __init__(self, n_inputs): super(MLP, self).__init__() self.layer = Linear(n_inputs, 1) self.activation = Sigmoid() # forward propagate input def forward(self, X): X = self.layer(X) X = self.activation(X) return X

The weights of a given layer can also be initialized after the layer is defined in the constructor.

Common examples include the Xavier and He weight initialization schemes. For example:

... xavier_uniform_(self.layer.weight)

The training process requires that you define a loss function and an optimization algorithm.

Common loss functions include the following:

- BCELoss: Binary cross-entropy loss for binary classification.
- CrossEntropyLoss: Categorical cross-entropy loss for multi-class classification.
- MSELoss: Mean squared loss for regression.

For more on loss functions generally, see the tutorial:

Stochastic gradient descent is used for optimization, and the standard algorithm is provided by the SGD class, although other versions of the algorithm are available, such as Adam.

# define the optimization criterion = MSELoss() optimizer = SGD(model.parameters(), lr=0.01, momentum=0.9)

Training the model involves enumerating the *DataLoader* for the training dataset.

First, a loop is required for the number of training epochs. Then an inner loop is required for the mini-batches for stochastic gradient descent.

... # enumerate epochs for epoch in range(100): # enumerate mini batches for i, (inputs, targets) in enumerate(train_dl): ...

Each update to the model involves the same general pattern comprised of:

- Clearing the last error gradient.
- A forward pass of the input through the model.
- Calculating the loss for the model output.
- Backpropagating the error through the model.
- Update the model in an effort to reduce loss.

For example:

... # clear the gradients optimizer.zero_grad() # compute the model output yhat = model(inputs) # calculate loss loss = criterion(yhat, targets) # credit assignment loss.backward() # update model weights optimizer.step()

Once the model is fit, it can be evaluated on the test dataset.

This can be achieved by using the *DataLoader* for the test dataset and collecting the predictions for the test set, then comparing the predictions to the expected values of the test set and calculating a performance metric.

... for i, (inputs, targets) in enumerate(test_dl): # evaluate the model on the test set yhat = model(inputs) ...

A fit model can be used to make a prediction on new data.

For example, you might have a single image or a single row of data and want to make a prediction.

This requires that you wrap the data in a PyTorch Tensor data structure.

A Tensor is just the PyTorch version of a NumPy array for holding data. It also allows you to perform the automatic differentiation tasks in the model graph, like calling *backward()* when training the model.

The prediction too will be a Tensor, although you can retrieve the NumPy array by detaching the Tensor from the automatic differentiation graph and calling the NumPy function.

... # convert row to data row = Variable(Tensor([row]).float()) # make prediction yhat = model(row) # retrieve numpy array yhat = yhat.detach().numpy()

Now that we are familiar with the PyTorch API at a high-level and the model life-cycle, let’s look at how we can develop some standard deep learning models from scratch.

In this section, you will discover how to develop, evaluate, and make predictions with standard deep learning models, including Multilayer Perceptrons (MLP) and Convolutional Neural Networks (CNN).

A Multilayer Perceptron model, or MLP for short, is a standard fully connected neural network model.

It is comprised of layers of nodes where each node is connected to all outputs from the previous layer and the output of each node is connected to all inputs for nodes in the next layer.

An MLP is a model with one or more fully connected layers. This model is appropriate for tabular data, that is data as it looks in a table or spreadsheet with one column for each variable and one row for each variable. There are three predictive modeling problems you may want to explore with an MLP; they are binary classification, multiclass classification, and regression.

Let’s fit a model on a real dataset for each of these cases.

**Note**: The models in this section are effective, but not optimized. See if you can improve their performance. Post your findings in the comments below.

We will use the Ionosphere binary (two class) classification dataset to demonstrate an MLP for binary classification.

This dataset involves predicting whether there is a structure in the atmosphere or not given radar returns.

The dataset will be downloaded automatically using Pandas, but you can learn more about it here.

We will use a LabelEncoder to encode the string labels to integer values 0 and 1. The model will be fit on 67 percent of the data, and the remaining 33 percent will be used for evaluation, split using the train_test_split() function.

It is a good practice to use ‘*relu*‘ activation with a ‘*He Uniform*‘ weight initialization. This combination goes a long way to overcome the problem of vanishing gradients when training deep neural network models. For more on ReLU, see the tutorial:

The model predicts the probability of class 1 and uses the sigmoid activation function. The model is optimized using stochastic gradient descent and seeks to minimize the binary cross-entropy loss.

The complete example is listed below.

# pytorch mlp for binary classification from numpy import vstack from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.metrics import accuracy_score from torch.utils.data import Dataset from torch.utils.data import DataLoader from torch.utils.data import random_split from torch import Tensor from torch.nn import Linear from torch.nn import ReLU from torch.nn import Sigmoid from torch.nn import Module from torch.optim import SGD from torch.nn import BCELoss from torch.nn.init import kaiming_uniform_ from torch.nn.init import xavier_uniform_ # dataset definition class CSVDataset(Dataset): # load the dataset def __init__(self, path): # load the csv file as a dataframe df = read_csv(path, header=None) # store the inputs and outputs self.X = df.values[:, :-1] self.y = df.values[:, -1] # ensure input data is floats self.X = self.X.astype('float32') # label encode target and ensure the values are floats self.y = LabelEncoder().fit_transform(self.y) self.y = self.y.astype('float32') self.y = self.y.reshape((len(self.y), 1)) # number of rows in the dataset def __len__(self): return len(self.X) # get a row at an index def __getitem__(self, idx): return [self.X[idx], self.y[idx]] # get indexes for train and test rows def get_splits(self, n_test=0.33): # determine sizes test_size = round(n_test * len(self.X)) train_size = len(self.X) - test_size # calculate the split return random_split(self, [train_size, test_size]) # model definition class MLP(Module): # define model elements def __init__(self, n_inputs): super(MLP, self).__init__() # input to first hidden layer self.hidden1 = Linear(n_inputs, 10) kaiming_uniform_(self.hidden1.weight, nonlinearity='relu') self.act1 = ReLU() # second hidden layer self.hidden2 = Linear(10, 8) kaiming_uniform_(self.hidden2.weight, nonlinearity='relu') self.act2 = ReLU() # third hidden layer and output self.hidden3 = Linear(8, 1) xavier_uniform_(self.hidden3.weight) self.act3 = Sigmoid() # forward propagate input def forward(self, X): # input to first hidden layer X = self.hidden1(X) X = self.act1(X) # second hidden layer X = self.hidden2(X) X = self.act2(X) # third hidden layer and output X = self.hidden3(X) X = self.act3(X) return X # prepare the dataset def prepare_data(path): # load the dataset dataset = CSVDataset(path) # calculate split train, test = dataset.get_splits() # prepare data loaders train_dl = DataLoader(train, batch_size=32, shuffle=True) test_dl = DataLoader(test, batch_size=1024, shuffle=False) return train_dl, test_dl # train the model def train_model(train_dl, model): # define the optimization criterion = BCELoss() optimizer = SGD(model.parameters(), lr=0.01, momentum=0.9) # enumerate epochs for epoch in range(100): # enumerate mini batches for i, (inputs, targets) in enumerate(train_dl): # clear the gradients optimizer.zero_grad() # compute the model output yhat = model(inputs) # calculate loss loss = criterion(yhat, targets) # credit assignment loss.backward() # update model weights optimizer.step() # evaluate the model def evaluate_model(test_dl, model): predictions, actuals = list(), list() for i, (inputs, targets) in enumerate(test_dl): # evaluate the model on the test set yhat = model(inputs) # retrieve numpy array yhat = yhat.detach().numpy() actual = targets.numpy() actual = actual.reshape((len(actual), 1)) # round to class values yhat = yhat.round() # store predictions.append(yhat) actuals.append(actual) predictions, actuals = vstack(predictions), vstack(actuals) # calculate accuracy acc = accuracy_score(actuals, predictions) return acc # make a class prediction for one row of data def predict(row, model): # convert row to data row = Tensor([row]) # make prediction yhat = model(row) # retrieve numpy array yhat = yhat.detach().numpy() return yhat # prepare the data path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/ionosphere.csv' train_dl, test_dl = prepare_data(path) print(len(train_dl.dataset), len(test_dl.dataset)) # define the network model = MLP(34) # train the model train_model(train_dl, model) # evaluate the model acc = evaluate_model(test_dl, model) print('Accuracy: %.3f' % acc) # make a single prediction (expect class=1) row = [1,0,0.99539,-0.05889,0.85243,0.02306,0.83398,-0.37708,1,0.03760,0.85243,-0.17755,0.59755,-0.44945,0.60536,-0.38223,0.84356,-0.38542,0.58212,-0.32192,0.56971,-0.29674,0.36946,-0.47357,0.56811,-0.51171,0.41078,-0.46168,0.21266,-0.34090,0.42267,-0.54487,0.18641,-0.45300] yhat = predict(row, model) print('Predicted: %.3f (class=%d)' % (yhat, yhat.round()))

Running the example first reports the shape of the train and test datasets, then fits the model and evaluates it on the test dataset. Finally, a prediction is made for a single row of data.

Your specific results will vary given the stochastic nature of the learning algorithm. Try running the example a few times.

**What result did you get?**

**Can you change the model to do better?**

Post your findings to the comments below.

In this case, we can see that the model achieved a classification accuracy of about 94 percent and then predicted a probability of 0.99 that the one row of data belong to class 1.

235 116 Accuracy: 0.948 Predicted: 0.998 (class=1)

We will use the Iris flowers multiclass classification dataset to demonstrate an MLP for multiclass classification.

This problem involves predicting the species of iris flower given measures of the flower.

The dataset will be downloaded automatically using Pandas, but you can learn more about it here.

Given that it is a multiclass classification, the model must have one node for each class in the output layer and use the softmax activation function. The loss function is the cross entropy, which is appropriate for integer encoded class labels (e.g. 0 for one class, 1 for the next class, etc.).

The complete example of fitting and evaluating an MLP on the iris flowers dataset is listed below.

# pytorch mlp for multiclass classification from numpy import vstack from numpy import argmax from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.metrics import accuracy_score from torch import Tensor from torch.utils.data import Dataset from torch.utils.data import DataLoader from torch.utils.data import random_split from torch.nn import Linear from torch.nn import ReLU from torch.nn import Softmax from torch.nn import Module from torch.optim import SGD from torch.nn import CrossEntropyLoss from torch.nn.init import kaiming_uniform_ from torch.nn.init import xavier_uniform_ # dataset definition class CSVDataset(Dataset): # load the dataset def __init__(self, path): # load the csv file as a dataframe df = read_csv(path, header=None) # store the inputs and outputs self.X = df.values[:, :-1] self.y = df.values[:, -1] # ensure input data is floats self.X = self.X.astype('float32') # label encode target and ensure the values are floats self.y = LabelEncoder().fit_transform(self.y) # number of rows in the dataset def __len__(self): return len(self.X) # get a row at an index def __getitem__(self, idx): return [self.X[idx], self.y[idx]] # get indexes for train and test rows def get_splits(self, n_test=0.33): # determine sizes test_size = round(n_test * len(self.X)) train_size = len(self.X) - test_size # calculate the split return random_split(self, [train_size, test_size]) # model definition class MLP(Module): # define model elements def __init__(self, n_inputs): super(MLP, self).__init__() # input to first hidden layer self.hidden1 = Linear(n_inputs, 10) kaiming_uniform_(self.hidden1.weight, nonlinearity='relu') self.act1 = ReLU() # second hidden layer self.hidden2 = Linear(10, 8) kaiming_uniform_(self.hidden2.weight, nonlinearity='relu') self.act2 = ReLU() # third hidden layer and output self.hidden3 = Linear(8, 3) xavier_uniform_(self.hidden3.weight) self.act3 = Softmax(dim=1) # forward propagate input def forward(self, X): # input to first hidden layer X = self.hidden1(X) X = self.act1(X) # second hidden layer X = self.hidden2(X) X = self.act2(X) # output layer X = self.hidden3(X) X = self.act3(X) return X # prepare the dataset def prepare_data(path): # load the dataset dataset = CSVDataset(path) # calculate split train, test = dataset.get_splits() # prepare data loaders train_dl = DataLoader(train, batch_size=32, shuffle=True) test_dl = DataLoader(test, batch_size=1024, shuffle=False) return train_dl, test_dl # train the model def train_model(train_dl, model): # define the optimization criterion = CrossEntropyLoss() optimizer = SGD(model.parameters(), lr=0.01, momentum=0.9) # enumerate epochs for epoch in range(500): # enumerate mini batches for i, (inputs, targets) in enumerate(train_dl): # clear the gradients optimizer.zero_grad() # compute the model output yhat = model(inputs) # calculate loss loss = criterion(yhat, targets) # credit assignment loss.backward() # update model weights optimizer.step() # evaluate the model def evaluate_model(test_dl, model): predictions, actuals = list(), list() for i, (inputs, targets) in enumerate(test_dl): # evaluate the model on the test set yhat = model(inputs) # retrieve numpy array yhat = yhat.detach().numpy() actual = targets.numpy() # convert to class labels yhat = argmax(yhat, axis=1) # reshape for stacking actual = actual.reshape((len(actual), 1)) yhat = yhat.reshape((len(yhat), 1)) # store predictions.append(yhat) actuals.append(actual) predictions, actuals = vstack(predictions), vstack(actuals) # calculate accuracy acc = accuracy_score(actuals, predictions) return acc # make a class prediction for one row of data def predict(row, model): # convert row to data row = Tensor([row]) # make prediction yhat = model(row) # retrieve numpy array yhat = yhat.detach().numpy() return yhat # prepare the data path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv' train_dl, test_dl = prepare_data(path) print(len(train_dl.dataset), len(test_dl.dataset)) # define the network model = MLP(4) # train the model train_model(train_dl, model) # evaluate the model acc = evaluate_model(test_dl, model) print('Accuracy: %.3f' % acc) # make a single prediction row = [5.1,3.5,1.4,0.2] yhat = predict(row, model) print('Predicted: %s (class=%d)' % (yhat, argmax(yhat)))

Running the example first reports the shape of the train and test datasets, then fits the model and evaluates it on the test dataset. Finally, a prediction is made for a single row of data.

Your specific results will vary given the stochastic nature of the learning algorithm. Try running the example a few times.

**What result did you get?
Can you change the model to do better?**

Post your findings to the comments below.

In this case, we can see that the model achieved a classification accuracy of about 98 percent and then predicted a probability of a row of data belonging to each class, although class 0 has the highest probability.

100 50 Accuracy: 0.980 Predicted: [[9.5524162e-01 4.4516966e-02 2.4138369e-04]] (class=0)

We will use the Boston housing regression dataset to demonstrate an MLP for regression predictive modeling.

This problem involves predicting house value based on properties of the house and neighborhood.

The dataset will be downloaded automatically using Pandas, but you can learn more about it here.

This is a regression problem that involves predicting a single numeric value. As such, the output layer has a single node and uses the default or linear activation function (no activation function). The mean squared error (mse) loss is minimized when fitting the model.

Recall that this is regression, not classification; therefore, we cannot calculate classification accuracy. For more on this, see the tutorial:

The complete example of fitting and evaluating an MLP on the Boston housing dataset is listed below.

# pytorch mlp for regression from numpy import vstack from numpy import sqrt from pandas import read_csv from sklearn.metrics import mean_squared_error from torch.utils.data import Dataset from torch.utils.data import DataLoader from torch.utils.data import random_split from torch import Tensor from torch.nn import Linear from torch.nn import Sigmoid from torch.nn import Module from torch.optim import SGD from torch.nn import MSELoss from torch.nn.init import xavier_uniform_ # dataset definition class CSVDataset(Dataset): # load the dataset def __init__(self, path): # load the csv file as a dataframe df = read_csv(path, header=None) # store the inputs and outputs self.X = df.values[:, :-1].astype('float32') self.y = df.values[:, -1].astype('float32') # ensure target has the right shape self.y = self.y.reshape((len(self.y), 1)) # number of rows in the dataset def __len__(self): return len(self.X) # get a row at an index def __getitem__(self, idx): return [self.X[idx], self.y[idx]] # get indexes for train and test rows def get_splits(self, n_test=0.33): # determine sizes test_size = round(n_test * len(self.X)) train_size = len(self.X) - test_size # calculate the split return random_split(self, [train_size, test_size]) # model definition class MLP(Module): # define model elements def __init__(self, n_inputs): super(MLP, self).__init__() # input to first hidden layer self.hidden1 = Linear(n_inputs, 10) xavier_uniform_(self.hidden1.weight) self.act1 = Sigmoid() # second hidden layer self.hidden2 = Linear(10, 8) xavier_uniform_(self.hidden2.weight) self.act2 = Sigmoid() # third hidden layer and output self.hidden3 = Linear(8, 1) xavier_uniform_(self.hidden3.weight) # forward propagate input def forward(self, X): # input to first hidden layer X = self.hidden1(X) X = self.act1(X) # second hidden layer X = self.hidden2(X) X = self.act2(X) # third hidden layer and output X = self.hidden3(X) return X # prepare the dataset def prepare_data(path): # load the dataset dataset = CSVDataset(path) # calculate split train, test = dataset.get_splits() # prepare data loaders train_dl = DataLoader(train, batch_size=32, shuffle=True) test_dl = DataLoader(test, batch_size=1024, shuffle=False) return train_dl, test_dl # train the model def train_model(train_dl, model): # define the optimization criterion = MSELoss() optimizer = SGD(model.parameters(), lr=0.01, momentum=0.9) # enumerate epochs for epoch in range(100): # enumerate mini batches for i, (inputs, targets) in enumerate(train_dl): # clear the gradients optimizer.zero_grad() # compute the model output yhat = model(inputs) # calculate loss loss = criterion(yhat, targets) # credit assignment loss.backward() # update model weights optimizer.step() # evaluate the model def evaluate_model(test_dl, model): predictions, actuals = list(), list() for i, (inputs, targets) in enumerate(test_dl): # evaluate the model on the test set yhat = model(inputs) # retrieve numpy array yhat = yhat.detach().numpy() actual = targets.numpy() actual = actual.reshape((len(actual), 1)) # store predictions.append(yhat) actuals.append(actual) predictions, actuals = vstack(predictions), vstack(actuals) # calculate mse mse = mean_squared_error(actuals, predictions) return mse # make a class prediction for one row of data def predict(row, model): # convert row to data row = Tensor([row]) # make prediction yhat = model(row) # retrieve numpy array yhat = yhat.detach().numpy() return yhat # prepare the data path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' train_dl, test_dl = prepare_data(path) print(len(train_dl.dataset), len(test_dl.dataset)) # define the network model = MLP(13) # train the model train_model(train_dl, model) # evaluate the model mse = evaluate_model(test_dl, model) print('MSE: %.3f, RMSE: %.3f' % (mse, sqrt(mse))) # make a single prediction (expect class=1) row = [0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98] yhat = predict(row, model) print('Predicted: %.3f' % yhat)

Running the example first reports the shape of the train and test datasets, then fits the model and evaluates it on the test dataset. Finally, a prediction is made for a single row of data.

Your specific results will vary given the stochastic nature of the learning algorithm. Try running the example a few times.

**What result did you get?
Can you change the model to do better?**

Post your findings to the comments below.

In this case, we can see that the model achieved a MSE of about 82, which is an RMSE of about nine (units are thousands of dollars). A value of 21 is then predicted for the single example.

339 167 MSE: 82.576, RMSE: 9.087 Predicted: 21.909

Convolutional Neural Networks, or CNNs for short, are a type of network designed for image input.

They are comprised of models with convolutional layers that extract features (called feature maps) and pooling layers that distill features down to the most salient elements.

CNNs are best suited to image classification tasks, although they can be used on a wide array of tasks that take images as input.

A popular image classification task is the MNIST handwritten digit classification. It involves tens of thousands of handwritten digits that must be classified as a number between 0 and 9.

The torchvision API provides a convenience function to download and load this dataset directly.

The example below loads the dataset and plots the first few images.

# load mnist dataset in pytorch from torch.utils.data import DataLoader from torchvision.datasets import MNIST from torchvision.transforms import Compose from torchvision.transforms import ToTensor from matplotlib import pyplot # define location to save or load the dataset path = '~/.torch/datasets/mnist' # define the transforms to apply to the data trans = Compose([ToTensor()]) # download and define the datasets train = MNIST(path, train=True, download=True, transform=trans) test = MNIST(path, train=False, download=True, transform=trans) # define how to enumerate the datasets train_dl = DataLoader(train, batch_size=32, shuffle=True) test_dl = DataLoader(test, batch_size=32, shuffle=True) # get one batch of images i, (inputs, targets) = next(enumerate(train_dl)) # plot some images for i in range(25): # define subplot pyplot.subplot(5, 5, i+1) # plot raw pixel data pyplot.imshow(inputs[i][0], cmap='gray') # show the figure pyplot.show()

Running the example loads the MNIST dataset, then summarizes the default train and test datasets.

Train: X=(60000, 28, 28), y=(60000,) Test: X=(10000, 28, 28), y=(10000,)

A plot is then created showing a grid of examples of handwritten images in the training dataset.

We can train a CNN model to classify the images in the MNIST dataset.

Note that the images are arrays of grayscale pixel data, therefore, we must add a channel dimension to the data before we can use the images as input to the model.

It is a good idea to scale the pixel values from the default range of 0-255 to have a zero mean and a standard deviation of 1. For more on scaling pixel values, see the tutorial:

The complete example of fitting and evaluating a CNN model on the MNIST dataset is listed below.

# pytorch cnn for multiclass classification from numpy import vstack from numpy import argmax from pandas import read_csv from sklearn.metrics import accuracy_score from torchvision.datasets import MNIST from torchvision.transforms import Compose from torchvision.transforms import ToTensor from torchvision.transforms import Normalize from torch.utils.data import DataLoader from torch.nn import Conv2d from torch.nn import MaxPool2d from torch.nn import Linear from torch.nn import ReLU from torch.nn import Softmax from torch.nn import Module from torch.optim import SGD from torch.nn import CrossEntropyLoss from torch.nn.init import kaiming_uniform_ from torch.nn.init import xavier_uniform_ # model definition class CNN(Module): # define model elements def __init__(self, n_channels): super(CNN, self).__init__() # input to first hidden layer self.hidden1 = Conv2d(n_channels, 32, (3,3)) kaiming_uniform_(self.hidden1.weight, nonlinearity='relu') self.act1 = ReLU() # first pooling layer self.pool1 = MaxPool2d((2,2), stride=(2,2)) # second hidden layer self.hidden2 = Conv2d(32, 32, (3,3)) kaiming_uniform_(self.hidden2.weight, nonlinearity='relu') self.act2 = ReLU() # second pooling layer self.pool2 = MaxPool2d((2,2), stride=(2,2)) # fully connected layer self.hidden3 = Linear(5*5*32, 100) kaiming_uniform_(self.hidden3.weight, nonlinearity='relu') self.act3 = ReLU() # output layer self.hidden4 = Linear(100, 10) xavier_uniform_(self.hidden4.weight) self.act4 = Softmax(dim=1) # forward propagate input def forward(self, X): # input to first hidden layer X = self.hidden1(X) X = self.act1(X) X = self.pool1(X) # second hidden layer X = self.hidden2(X) X = self.act2(X) X = self.pool2(X) # flatten X = X.view(-1, 4*4*50) # third hidden layer X = self.hidden3(X) X = self.act3(X) # output layer X = self.hidden4(X) X = self.act4(X) return X # prepare the dataset def prepare_data(path): # define standardization trans = Compose([ToTensor(), Normalize((0.1307,), (0.3081,))]) # load dataset train = MNIST(path, train=True, download=True, transform=trans) test = MNIST(path, train=False, download=True, transform=trans) # prepare data loaders train_dl = DataLoader(train, batch_size=64, shuffle=True) test_dl = DataLoader(test, batch_size=1024, shuffle=False) return train_dl, test_dl # train the model def train_model(train_dl, model): # define the optimization criterion = CrossEntropyLoss() optimizer = SGD(model.parameters(), lr=0.01, momentum=0.9) # enumerate epochs for epoch in range(10): # enumerate mini batches for i, (inputs, targets) in enumerate(train_dl): # clear the gradients optimizer.zero_grad() # compute the model output yhat = model(inputs) # calculate loss loss = criterion(yhat, targets) # credit assignment loss.backward() # update model weights optimizer.step() # evaluate the model def evaluate_model(test_dl, model): predictions, actuals = list(), list() for i, (inputs, targets) in enumerate(test_dl): # evaluate the model on the test set yhat = model(inputs) # retrieve numpy array yhat = yhat.detach().numpy() actual = targets.numpy() # convert to class labels yhat = argmax(yhat, axis=1) # reshape for stacking actual = actual.reshape((len(actual), 1)) yhat = yhat.reshape((len(yhat), 1)) # store predictions.append(yhat) actuals.append(actual) predictions, actuals = vstack(predictions), vstack(actuals) # calculate accuracy acc = accuracy_score(actuals, predictions) return acc # prepare the data path = '~/.torch/datasets/mnist' train_dl, test_dl = prepare_data(path) print(len(train_dl.dataset), len(test_dl.dataset)) # define the network model = CNN(1) # # train the model train_model(train_dl, model) # evaluate the model acc = evaluate_model(test_dl, model) print('Accuracy: %.3f' % acc)

Running the example first reports the shape of the train and test datasets, then fits the model and evaluates it on the test dataset.

**What result did you get?
Can you change the model to do better?**

Post your findings to the comments below.

In this case, we can see that the model achieved a classification accuracy of about 98 percent on the test dataset. We can then see that the model predicted class 5 for the first image in the training set.

60000 10000 Accuracy: 0.985

This section provides more resources on the topic if you are looking to go deeper.

- Deep Learning, 2016.
- Programming PyTorch for Deep Learning: Creating and Deploying Deep Learning Applications, 2018.
- Deep Learning with PyTorch, 2020.
- Deep Learning for Coders with fastai and PyTorch: AI Applications Without a PhD, 2020.

- PyTorch Homepage.
- PyTorch Documentation
- PyTorch Installation Guide
- PyTorch, Wikipedia.
- PyTorch on GitHub.

In this tutorial, you discovered a step-by-step guide to developing deep learning models in PyTorch.

Specifically, you learned:

- The difference between Torch and PyTorch and how to install and confirm PyTorch is working.
- The five-step life-cycle of PyTorch models and how to define, fit, and evaluate models.
- How to develop PyTorch deep learning models for regression, classification, and predictive modeling tasks.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post PyTorch Tutorial: How to Develop Deep Learning Models with Python appeared first on Machine Learning Mastery.

]]>The post Basic Data Cleaning for Machine Learning (That You Must Perform) appeared first on Machine Learning Mastery.

]]>Data cleaning is a critically important step in any machine learning project.

In tabular data, there are many different statistical analysis and data visualization techniques you can use to explore your data in order to identify data cleaning operations you may want to perform.

Before jumping to the sophisticated methods, there are some very basic data cleaning operations that you probably should perform on every single machine learning project. These are so basic that they are often overlooked by seasoned machine learning practitioners, yet are so critical that if skipped, models may break or report overly optimistic performance results.

In this tutorial, you will discover basic data cleaning you should always perform on your dataset.

After completing this tutorial, you will know:

- How to identify and remove column variables that only have a single value.
- How to identify and consider column variables with very few unique values.
- How to identify and remove rows that contain duplicate observations.

Let’s get started.

This tutorial is divided into five parts; they are:

- Identify Columns That Contain a Single Value
- Delete Columns That Contain a Single Value
- Consider Columns That Have Very Few Values
- Identify Rows that Contain Duplicate Data
- Delete Rows that Contain Duplicate Data

Columns that have a single observation or value are probably useless for modeling.

Here, a single value means that each row for that column has the same value. For example, the column *X1* has the value 1.0 for all rows in the dataset:

X1 1.0 1.0 1.0 1.0 1.0 ...

Columns that have a single value for all rows do not contain any information for modeling.

Depending on the choice of data preparation and modeling algorithms, variables with a single value can also cause errors or unexpected results.

You can detect rows that have this property using the unique() NumPy function that will report the number of unique values in each column.

The example below loads the oil-spill classification dataset that contains 50 variables and summarizes the number of unique values for each column.

# summarize the number of unique values for each column using numpy from urllib.request import urlopen from numpy import loadtxt from numpy import unique # define the location of the dataset path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/oil-spill.csv' # load the dataset data = loadtxt(urlopen(path), delimiter=',') # summarize the number of unique values in each column for i in range(data.shape[1]): print(i, len(unique(data[:, i])))

Running the example loads the dataset directly from the URL and prints the number of unique values for each column.

We can see that column index 22 only has a single value and should be removed.

0 238 1 297 2 927 3 933 4 179 5 375 6 820 7 618 8 561 9 57 10 577 11 59 12 73 13 107 14 53 15 91 16 893 17 810 18 170 19 53 20 68 21 9 22 1 23 92 24 9 25 8 26 9 27 308 28 447 29 392 30 107 31 42 32 4 33 45 34 141 35 110 36 3 37 758 38 9 39 9 40 388 41 220 42 644 43 649 44 499 45 2 46 937 47 169 48 286 49 2

A simpler approach is to use the nunique() Pandas function that does the hard work for you.

Below is the same example using the Pandas function.

# summarize the number of unique values for each column using numpy from pandas import read_csv # define the location of the dataset path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/oil-spill.csv' # load the dataset df = read_csv(path, header=None) # summarize the number of unique values in each column print(df.nunique())

Running the example, we get the same result, the column index, and the number of unique values for each column.

0 238 1 297 2 927 3 933 4 179 5 375 6 820 7 618 8 561 9 57 10 577 11 59 12 73 13 107 14 53 15 91 16 893 17 810 18 170 19 53 20 68 21 9 22 1 23 92 24 9 25 8 26 9 27 308 28 447 29 392 30 107 31 42 32 4 33 45 34 141 35 110 36 3 37 758 38 9 39 9 40 388 41 220 42 644 43 649 44 499 45 2 46 937 47 169 48 286 49 2 dtype: int64

Variables or columns that have a single value should probably be removed from your dataset

Columns are relatively easy to remove from a NumPy array or Pandas DataFrame.

One approach is to record all columns that have a single unique value, then delete them from the Pandas DataFrame by calling the drop() function.

The complete example is listed below.

# delete columns with a single unique value from pandas import read_csv # define the location of the dataset path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/oil-spill.csv' # load the dataset df = read_csv(path, header=None) print(df.shape) # get number of unique values for each column counts = df.nunique() # record columns to delete to_del = [i for i,v in enumerate(counts) if v == 1] print(to_del) # drop useless columns df.drop(to_del, axis=1, inplace=True) print(df.shape)

Running the example first loads the dataset and reports the number of rows and columns.

The number of unique values for each column is calculated, and those columns that have a single unique value are identified. In this case, column index 22.

The identified columns are then removed from the DataFrame, and the number of rows and columns in the DataFrame are reported to confirm the change.

(937, 50) [22] (937, 49)

In the previous section, we saw that some columns in the example dataset had very few unique values.

For example, there were columns that only had 2, 4, and 9 unique values. This might make sense for ordinal or categorical variables. In this case, the dataset only contains numerical variables. As such, only having 2, 4, or 9 unique numerical values in a column might be surprising.

These columns may or may not contribute to the skill of a model.

Depending on the choice of data preparation and modeling algorithms, variables with very few numerical values can also cause errors or unexpected results. For example, I have seen them cause errors when using power transforms for data preparation and when fitting linear models that assume a “*sensible*” data probability distribution.

To help highlight columns of this type, you can calculate the number of unique values for each variable as a percentage of the total number of rows in the dataset.

Let’s do this manually using NumPy. The complete example is listed below.

# summarize the percentage of unique values for each column using numpy from urllib.request import urlopen from numpy import loadtxt from numpy import unique # define the location of the dataset path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/oil-spill.csv' # load the dataset data = loadtxt(urlopen(path), delimiter=',') # summarize the number of unique values in each column for i in range(data.shape[1]): num = len(unique(data[:, i])) percentage = float(num) / data.shape[0] * 100 print('%d, %d, %.1f%%' % (i, num, percentage))

Running the example reports the column index and the number of unique values for each column, followed by the percentage of unique values out of all rows in the dataset.

Here, we can see that some columns have a very low percentage of unique values, such as below 1 percent.

0, 238, 25.4% 1, 297, 31.7% 2, 927, 98.9% 3, 933, 99.6% 4, 179, 19.1% 5, 375, 40.0% 6, 820, 87.5% 7, 618, 66.0% 8, 561, 59.9% 9, 57, 6.1% 10, 577, 61.6% 11, 59, 6.3% 12, 73, 7.8% 13, 107, 11.4% 14, 53, 5.7% 15, 91, 9.7% 16, 893, 95.3% 17, 810, 86.4% 18, 170, 18.1% 19, 53, 5.7% 20, 68, 7.3% 21, 9, 1.0% 22, 1, 0.1% 23, 92, 9.8% 24, 9, 1.0% 25, 8, 0.9% 26, 9, 1.0% 27, 308, 32.9% 28, 447, 47.7% 29, 392, 41.8% 30, 107, 11.4% 31, 42, 4.5% 32, 4, 0.4% 33, 45, 4.8% 34, 141, 15.0% 35, 110, 11.7% 36, 3, 0.3% 37, 758, 80.9% 38, 9, 1.0% 39, 9, 1.0% 40, 388, 41.4% 41, 220, 23.5% 42, 644, 68.7% 43, 649, 69.3% 44, 499, 53.3% 45, 2, 0.2% 46, 937, 100.0% 47, 169, 18.0% 48, 286, 30.5% 49, 2, 0.2%

We can update the example to only summarize those variables that have unique values that are less than 1 percent of the number of rows.

# summarize the percentage of unique values for each column using numpy from urllib.request import urlopen from numpy import loadtxt from numpy import unique # define the location of the dataset path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/oil-spill.csv' # load the dataset data = loadtxt(urlopen(path), delimiter=',') # summarize the number of unique values in each column for i in range(data.shape[1]): num = len(unique(data[:, i])) percentage = float(num) / data.shape[0] * 100 if percentage < 1: print('%d, %d, %.1f%%' % (i, num, percentage))

Running the example, we can see that 11 of the 50 variables have numerical variables that have unique values that are less than 1 percent of the number of rows.

This does not mean that these rows and columns should be deleted, but they require further attention.

For example:

- Perhaps the unique values can be encoded as ordinal values?
- Perhaps the unique values can be encoded as categorical values?
- Perhaps compare model skill with each variable removed from the dataset?

21, 9, 1.0% 22, 1, 0.1% 24, 9, 1.0% 25, 8, 0.9% 26, 9, 1.0% 32, 4, 0.4% 36, 3, 0.3% 38, 9, 1.0% 39, 9, 1.0% 45, 2, 0.2% 49, 2, 0.2%

For example, if we wanted to delete all 11 columns with unique values less than 1 percent of rows; the example below demonstrates this.

# delete columns where number of unique values is less than 1% of the rows from pandas import read_csv # define the location of the dataset path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/oil-spill.csv' # load the dataset df = read_csv(path, header=None) print(df.shape) # get number of unique values for each column counts = df.nunique() # record columns to delete to_del = [i for i,v in enumerate(counts) if (float(v)/df.shape[0]*100) < 1] print(to_del) # drop useless columns df.drop(to_del, axis=1, inplace=True) print(df.shape)

Running the example first loads the dataset and reports the number of rows and columns.

The number of unique values for each column is calculated, and those columns that have a number of unique values less than 1 percent of the rows are identified. In this case, 11 columns.

The identified columns are then removed from the DataFrame, and the number of rows and columns in the DataFrame are reported to confirm the change.

(937, 50) [21, 22, 24, 25, 26, 32, 36, 38, 39, 45, 49] (937, 39)

Rows that have identical data are probably useless, if not dangerously misleading during model evaluation.

Here, a duplicate row is a row where each value in each column for that row appears in identically the same order (same column values) in another row.

From a probabilistic perspective, you can think of duplicate data as adjusting the priors for a class label or data distribution. This may help an algorithm like Naive Bayes if you wish to purposefully bias the priors. Typically, this is not the case and machine learning algorithms will perform better by identifying and removing rows with duplicate data.

From an algorithm evaluation perspective, duplicate rows will result in misleading performance. For example, if you are using a train/test split or k-fold cross-validation, then it is possible for a duplicate row or rows to appear in both train and test datasets and any evaluation of the model on these rows will be (or should be) correct. This will result in an optimistically biased estimate of performance on unseen data.

If you think this is not the case for your dataset or chosen model, design a controlled experiment to test it. This could be achieved by evaluating model skill with the raw dataset and the dataset with duplicates removed and comparing performance. Another experiment might involve augmenting the dataset with different numbers of randomly selected duplicate examples.

The pandas function duplicated() will report whether a given row is duplicated or not. All rows are marked as either False to indicate that it is not a duplicate or True to indicate that it is a duplicate. If there are duplicates, the first occurrence of the row is marked False (by default), as we might expect.

The example below checks for duplicates.

# locate rows of duplicate data from pandas import read_csv # define the location of the dataset path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv' # load the dataset df = read_csv(path, header=None) # calculate duplicates dups = df.duplicated() # report if there are any duplicates print(dups.any()) # list all duplicate rows print(df[dups])

Running the example first loads the dataset, then calculates row duplicates.

First, the presence of any duplicate rows is reported, and in this case, we can see that there are duplicates (True).

Then all duplicate rows are reported. In this case, we can see that three duplicate rows that were identified are printed.

True 0 1 2 3 4 34 4.9 3.1 1.5 0.1 Iris-setosa 37 4.9 3.1 1.5 0.1 Iris-setosa 142 5.8 2.7 5.1 1.9 Iris-virginica

Rows of duplicate data should probably be deleted from your dataset prior to modeling.

There are many ways to achieve this, although Pandas provides the drop_duplicates() function that achieves exactly this.

The example below demonstrates deleting duplicate rows from a dataset.

# delete rows of duplicate data from the dataset from pandas import read_csv # define the location of the dataset path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv' # load the dataset df = read_csv(path, header=None) print(df.shape) # delete duplicate rows df.drop_duplicates(inplace=True) print(df.shape)

Running the example first loads the dataset and reports the number of rows and columns.

Next, the rows of duplicated data are identified and removed from the DataFrame. Then the shape of the DataFrame is reported to confirm the change.

(150, 5) (147, 5)

This section provides more resources on the topic if you are looking to go deeper.

- numpy.unique API.
- pandas.DataFrame.nunique API.
- pandas.DataFrame.drop API.
- pandas.DataFrame.duplicated API.
- pandas.DataFrame.drop_duplicates API.

In this tutorial, you discovered basic data cleaning you should always perform on your dataset.

Specifically, you learned:

- How to identify and remove column variables that only have a single value.
- How to identify and consider column variables with very few unique values.
- How to identify and remove rows that contain duplicate observations.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Basic Data Cleaning for Machine Learning (That You Must Perform) appeared first on Machine Learning Mastery.

]]>