The post What Is Argmax in Machine Learning? appeared first on Machine Learning Mastery.

]]>Argmax is a mathematical function that you may encounter in applied machine learning.

For example, you may see “*argmax*” or “*arg max*” used in a research paper used to describe an algorithm. You may also be instructed to use the argmax function in your algorithm implementation.

This may be the first time that you encounter the argmax function and you may wonder what it is and how it works.

In this tutorial, you will discover the argmax function and how it is used in machine learning.

After completing this tutorial, you will know:

- Argmax is an operation that finds the argument that gives the maximum value from a target function.
- Argmax is most commonly used in machine learning for finding the class with the largest predicted probability.
- Argmax can be implemented manually, although the argmax() NumPy function is preferred in practice.

Let’s get started.

This tutorial is divided into three parts; they are:

- What Is Argmax?
- How Is Argmax Used in Machine Learning?
- How to Implement Argmax in Python

Argmax is a mathematical function.

It is typically applied to another function that takes an argument. For example, given a function *g()* that takes the argument *x*, the *argmax* operation of that function would be described as follows:

- result = argmax(g(x))

The *argmax* function returns the argument or arguments (*arg*) for the target function that returns the maximum (*max*) value from the target function.

Consider the example where *g(x)* is calculated as the square of the *x* value and the domain or extent of input values (*x*) is limited to integers from 1 to 5:

- g(1) = 1^2 = 1
- g(2) = 2^2 = 4
- g(3) = 3^2 = 9
- g(4) = 4^2 = 16
- g(5) = 5^2 = 25

We can intuitively see that the argmax for the function *g(x)* is 5.

That is, the argument (*x*) to the target function *g()* that results in the largest value from the target function (25) is 5. Argmax provides a shorthand for specifying this argument in an abstract way without knowing what the value might be in a specific case.

- argmax(g(x)) = 5

Note that this is not the *max()* of the values returned from function. This would be 25.

It is also not the *max()* of the arguments, although in this case the argmax and max of the arguments is the same, e.g. 5. The *argmax()* is 5 because g returns the largest value (25) when 5 is provided, not because 5 is the largest argument.

Typically, “*argmax*” is written as two separate words, e.g. “*arg max*“. For example:

- result = arg max(g(x))

It is also common to use the arg max function as an operation without brackets surrounding the target function. This is often how you will see the operation written and used in a research paper or textbook. For example:

- result = arg max g(x)

You can also use a similar operation to find the arguments to the target function that result in the minimum value from the target function, called *argmin* or “*arg min*.”

The argmax function is used throughout the field of mathematics and machine learning.

Nevertheless, there are specific situations where you will see argmax used in applied machine learning and may need to implement it yourself.

The most common situation for using argmax that you will encounter in applied machine learning is in finding the index of an array that results in the largest value.

Recall that an array is a list or vector of numbers.

It is common for multi-class classification models to predict a vector of probabilities (or probability-like values), with one probability for each class label. The probabilities represent the likelihood that a sample belongs to each of the class labels.

The predicted probabilities are ordered such that the predicted probability at index 0 belongs to the first class, the predicted probability at index 1 belongs to the second class, and so on.

Often, a single class label prediction is required from a set of predicted probabilities for a multi-class classification problem.

This conversion from a vector of predicted probabilities to a class label is most often described using the argmax operation and most often implemented using the argmax function.

Let’s make this concrete with an example.

Consider a multi-class classification problem with three classes: “*red*“, “*blue*,” and “*green*.” The class labels are mapped to integer values for modeling, as follows:

- red = 0
- blue = 1
- green = 2

Each class label integer values maps to an index of a 3-element vector that may be predicted by a model specifying the likelihood that an example belongs to each class.

Consider a model has made one prediction for an input sample and predicted the following vector of probabilities:

- yhat = [0.4, 0.5, 0.1]

We can see that the example has a 40 percent probability of belonging to red, a 50 percent probability of belonging to blue, and a 10 percent probability of belonging to green.

We can apply the argmax function to the vector of probabilities. The vector is the function, the output of the function is the probabilities, and the input to the function is a vector element index or an array index.

- arg max yhat

We can intuitively see that in this case, the argmax of the vector of predicted probabilities (yhat) is 1, as the probability at array index 1 is the largest value.

Note that this is not the max() of the probabilities, which would be 0.5. Also note that this is not the max of the arguments, which would be 2. Instead it is the argument that results in the maximum value, e.g. 1 that results in 0.5.

- arg max yhat = 1

We can then map this integer value back to a class label, which would be “*blue*.”

- arg max yhat = “blue”

The argmax function can be implemented in Python for a given vector of numbers.

First, we can define a function called *argmax()* that enumerates a provided vector and returns the index with the largest value.

The complete example is listed below.

# argmax function def argmax(vector): index, value = 0, vector[0] for i,v in enumerate(vector): if v > value: index, value = i,v return index # define vector vector = [0.4, 0.5, 0.1] # get argmax result = argmax(vector) print('arg max of %s: %d' % (vector, result))

Running the example prints the argmax of our test data used in the previous section, which in this case is an index of 1.

arg max of [0.4, 0.5, 0.1]: 1

Thankfully, there is a built-in version of the argmax() function provided with the NumPy library.

This is the version that you should use in practice.

The example below demonstrates the *argmax()* NumPy function on the same vector of probabilities.

# numpy implementation of argmax from numpy import argmax # define vector vector = [0.4, 0.5, 0.1] # get argmax result = argmax(vector) print('arg max of %s: %d' % (vector, result))

Running the example prints an index of 1, as is expected.

arg max of [0.4, 0.5, 0.1]: 1

It is more likely that you will have a collection of predicted probabilities for multiple samples.

This would be stored as a matrix with rows of predicted probabilities and each column representing a class label. The desired result of an argmax on this matrix would be a vector with one index (or class label integer) for each row of predictions.

This can be achieved with the *argmax()* NumPy function by setting the “*axis*” argument. By default, the argmax would be calculated for the entire matrix, returning a single number. Instead, we can set the axis value to 1 and calculate the argmax across the columns for each row of data.

The example below demonstrates this with a matrix of four rows of predicted probabilities for the three class labels.

# numpy implementation of argmax from numpy import argmax from numpy import asarray # define vector probs = asarray([[0.4, 0.5, 0.1], [0.0, 0.0, 1.0], [0.9, 0.0, 0.1], [0.3, 0.3, 0.4]]) print(probs.shape) # get argmax result = argmax(probs, axis=1) print(result)

Running the example first prints the shape of the matrix of predicted probabilities, confirming we have four rows with three columns per row.

The argmax of the matrix is then calculated and printed as a vector, showing four values. This is what we expect, where each row results in a single argmax value or index with the largest probability.

(4, 3) [1 2 0 2]

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered the argmax function and how it is used in machine learning.

Specifically, you learned:

- Argmax is an operation that finds the argument that gives the maximum value from a target function.
- Argmax is most commonly used in machine learning for finding the class with the largest predicted probability.
- Argmax can be implemented manually, although the argmax() NumPy function is preferred in practice.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post What Is Argmax in Machine Learning? appeared first on Machine Learning Mastery.

]]>The post Gradient Boosting with Scikit-Learn, XGBoost, LightGBM, and CatBoost appeared first on Machine Learning Mastery.

]]>Gradient boosting is a powerful ensemble machine learning algorithm.

It’s popular for structured predictive modeling problems, such as classification and regression on tabular data, and is often the main algorithm or one of the main algorithms used in winning solutions to machine learning competitions, like those on Kaggle.

There are many implementations of gradient boosting available, including standard implementations in SciPy and efficient third-party libraries. Each uses a different interface and even different names for the algorithm.

In this tutorial, you will discover how to use gradient boosting models for classification and regression in Python.

Standardized code examples are provided for the four major implementations of gradient boosting in Python, ready for you to copy-paste and use in your own predictive modeling project.

After completing this tutorial, you will know:

- Gradient boosting is an ensemble algorithm that fits boosted decision trees by minimizing an error gradient.
- How to evaluate and use gradient boosting with scikit-learn, including gradient boosting machines and the histogram-based algorithm.
- How to evaluate and use third-party gradient boosting algorithms, including XGBoost, LightGBM, and CatBoost.

Let’s get started.

This tutorial is divided into five parts; they are:

- Gradient Boosting Overview
- Gradient Boosting With Scikit-Learn
- Library Installation
- Test Problems
- Gradient Boosting
- Histogram-Based Gradient Boosting

- Gradient Boosting With XGBoost
- Library Installation
- XGBoost for Classification
- XGBoost for Regression

- Gradient Boosting With LightGBM
- Library Installation
- LightGBM for Classification
- LightGBM for Regression

- Gradient Boosting With CatBoost
- Library Installation
- CatBoost for Classification
- CatBoost for Regression

Gradient boosting refers to a class of ensemble machine learning algorithms that can be used for classification or regression predictive modeling problems.

Gradient boosting is also known as gradient tree boosting, stochastic gradient boosting (an extension), and gradient boosting machines, or GBM for short.

Ensembles are constructed from decision tree models. Trees are added one at a time to the ensemble and fit to correct the prediction errors made by prior models. This is a type of ensemble machine learning model referred to as boosting.

Models are fit using any arbitrary differentiable loss function and gradient descent optimization algorithm. This gives the technique its name, “*gradient boosting*,” as the loss gradient is minimized as the model is fit, much like a neural network.

Gradient boosting is an effective machine learning algorithm and is often the main, or one of the main, algorithms used to win machine learning competitions (like Kaggle) on tabular and similar structured datasets.

**Note**: We will not be going into the theory behind how the gradient boosting algorithm works in this tutorial.

For more on the gradient boosting algorithm, see the tutorial:

The algorithm provides hyperparameters that should, and perhaps must, be tuned for a specific dataset. Although there are many hyperparameters to tune, perhaps the most important are as follows:

- The number of trees or estimators in the model.
- The learning rate of the model.
- The row and column sampling rate for stochastic models.
- The maximum tree depth.
- The minimum tree weight.
- The regularization terms alpha and lambda.

**Note**: We will not be exploring how to configure or tune the configuration of gradient boosting algorithms in this tutorial.

For more on tuning the hyperparameters of gradient boosting algorithms, see the tutorial:

There are many implementations of the gradient boosting algorithm available in Python. Perhaps the most used implementation is the version provided with the scikit-learn library.

Additional third-party libraries are available that provide computationally efficient alternate implementations of the algorithm that often achieve better results in practice. Examples include the XGBoost library, the LightGBM library, and the CatBoost library.

**Do you have a different favorite gradient boosting implementation?**

Let me know in the comments below.

When using gradient boosting on your predictive modeling project, you may want to test each implementation of the algorithm.

This tutorial provides examples of each implementation of the gradient boosting algorithm on classification and regression predictive modeling problems that you can copy-paste into your project.

Let’s take a look at each in turn.

**Note**: We are not comparing the performance of the algorithms in this tutorial. Instead, we are providing code examples to demonstrate how to use each different implementation. As such, we are using synthetic test datasets to demonstrate evaluating and making a prediction with each implementation.

This tutorial assumes you have Python and SciPy installed. If you need help, see the tutorial:

In this section, we will review how to use the gradient boosting algorithm implementation in the scikit-learn library.

First, let’s install the library.

Don’t skip this step as you will need to ensure you have the latest version installed.

You can install the scikit-learn library using the pip Python installer, as follows:

sudo pip install scikit-learn

For additional installation instructions specific to your platform, see:

Next, let’s confirm that the library is installed and you are using a modern version.

Run the following script to print the library version number.

# check scikit-learn version import sklearn print(sklearn.__version__)

Running the example, you should see the following version number or higher.

0.22.1

We will demonstrate the gradient boosting algorithm for classification and regression.

As such, we will use synthetic test problems from the scikit-learn library.

We will use the make_classification() function to create a test binary classification dataset.

The dataset will have 1,000 examples, with 10 input features, five of which will be informative and the remaining five that will be redundant. We will fix the random number seed to ensure we get the same examples each time the code is run.

An example of creating and summarizing the dataset is listed below.

# test classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and confirms the expected number of samples and features.

(1000, 10) (1000,)

We will use the make_regression() function to create a test regression dataset.

Like the classification dataset, the regression dataset will have 1,000 examples, with 10 input features, five of which will be informative and the remaining five that will be redundant.

# test regression dataset from sklearn.datasets import make_regression # define dataset X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and confirms the expected number of samples and features.

(1000, 10) (1000,)

Next, let’s look at how we can develop gradient boosting models in scikit-learn.

The scikit-learn library provides the GBM algorithm for regression and classification via the *GradientBoostingClassifier* and *GradientBoostingRegressor* classes.

Let’s take a closer look at each in turn.

The example below first evaluates a GradientBoostingClassifier on the test problem using repeated k-fold cross-validation and reports the mean accuracy. Then a single model is fit on all available data and a single prediction is made.

The complete example is listed below.

# gradient boosting for classification in scikit-learn from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.ensemble import GradientBoostingClassifier from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # evaluate the model model = GradientBoostingClassifier() cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores))) # fit the model on the whole dataset model = GradientBoostingClassifier() model.fit(X, y) # make a single prediction row = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]] yhat = model.predict(row) print('Prediction: %d' % yhat[0])

Running the example first reports the evaluation of the model using repeated k-fold cross-validation, then the result of making a single prediction with a model fit on the entire dataset.

Accuracy: 0.915 (0.025) Prediction: 1

The example below first evaluates a GradientBoostingRegressor on the test problem using repeated k-fold cross-validation and reports the mean absolute error. Then a single model is fit on all available data and a single prediction is made.

The complete example is listed below.

# gradient boosting for regression in scikit-learn from numpy import mean from numpy import std from sklearn.datasets import make_regression from sklearn.ensemble import GradientBoostingRegressor from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from matplotlib import pyplot # define dataset X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) # evaluate the model model = GradientBoostingRegressor() cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise') print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores))) # fit the model on the whole dataset model = GradientBoostingRegressor() model.fit(X, y) # make a single prediction row = [[2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]] yhat = model.predict(row) print('Prediction: %.3f' % yhat[0])

Running the example first reports the evaluation of the model using repeated k-fold cross-validation, then the result of making a single prediction with a model fit on the entire dataset.

MAE: -11.854 (1.121) Prediction: -80.661

The scikit-learn library provides an alternate implementation of the gradient boosting algorithm, referred to as histogram-based gradient boosting.

This is an alternate approach to implement gradient tree boosting inspired by the LightGBM library (described more later). This implementation is provided via the *HistGradientBoostingClassifier* and *HistGradientBoostingRegressor* classes.

The primary benefit of the histogram-based approach to gradient boosting is speed. These implementations are designed to be much faster to fit on training data.

At the time of writing, this is an experimental implementation and requires that you add the following line to your code to enable access to these classes.

from sklearn.experimental import enable_hist_gradient_boosting

Without this line, you will see an error like:

ImportError: cannot import name 'HistGradientBoostingClassifier'

or

ImportError: cannot import name 'HistGradientBoostingRegressor'

Let’s take a close look at how to use this implementation.

The example below first evaluates a HistGradientBoostingClassifier on the test problem using repeated k-fold cross-validation and reports the mean accuracy. Then a single model is fit on all available data and a single prediction is made.

The complete example is listed below.

# histogram-based gradient boosting for classification in scikit-learn from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.experimental import enable_hist_gradient_boosting from sklearn.ensemble import HistGradientBoostingClassifier from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # evaluate the model model = HistGradientBoostingClassifier() cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores))) # fit the model on the whole dataset model = HistGradientBoostingClassifier() model.fit(X, y) # make a single prediction row = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]] yhat = model.predict(row) print('Prediction: %d' % yhat[0])

Running the example first reports the evaluation of the model using repeated k-fold cross-validation, then the result of making a single prediction with a model fit on the entire dataset.

Accuracy: 0.935 (0.024) Prediction: 1

The example below first evaluates a HistGradientBoostingRegressor on the test problem using repeated k-fold cross-validation and reports the mean absolute error. Then a single model is fit on all available data and a single prediction is made.

The complete example is listed below.

# histogram-based gradient boosting for regression in scikit-learn from numpy import mean from numpy import std from sklearn.datasets import make_regression from sklearn.experimental import enable_hist_gradient_boosting from sklearn.ensemble import HistGradientBoostingRegressor from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from matplotlib import pyplot # define dataset X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) # evaluate the model model = HistGradientBoostingRegressor() cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise') print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores))) # fit the model on the whole dataset model = HistGradientBoostingRegressor() model.fit(X, y) # make a single prediction row = [[2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]] yhat = model.predict(row) print('Prediction: %.3f' % yhat[0])

MAE: -12.723 (1.540) Prediction: -77.837

XGBoost, which is short for “*Extreme Gradient Boosting*,” is a library that provides an efficient implementation of the gradient boosting algorithm.

The main benefit of the XGBoost implementation is computational efficiency and often better model performance.

For more on the benefits and capability of XGBoost, see the tutorial:

You can install the XGBoost library using the pip Python installer, as follows:

sudo pip install xgboost

For additional installation instructions specific to your platform see:

Next, let’s confirm that the library is installed and you are using a modern version.

Run the following script to print the library version number.

# check xgboost version import xgboost print(xgboost.__version__)

Running the example, you should see the following version number or higher.

1.0.1

The XGBoost library provides wrapper classes so that the efficient algorithm implementation can be used with the scikit-learn library, specifically via the *XGBClassifier* and *XGBregressor* classes.

Let’s take a closer look at each in turn.

The example below first evaluates an XGBClassifier on the test problem using repeated k-fold cross-validation and reports the mean accuracy. Then a single model is fit on all available data and a single prediction is made.

The complete example is listed below.

# xgboost for classification from numpy import asarray from numpy import mean from numpy import std from sklearn.datasets import make_classification from xgboost import XGBClassifier from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # evaluate the model model = XGBClassifier() cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores))) # fit the model on the whole dataset model = XGBClassifier() model.fit(X, y) # make a single prediction row = [2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145] row = asarray(row).reshape((1, len(row))) yhat = model.predict(row) print('Prediction: %d' % yhat[0])

Accuracy: 0.936 (0.019) Prediction: 1

The example below first evaluates an XGBRegressor on the test problem using repeated k-fold cross-validation and reports the mean absolute error. Then a single model is fit on all available data and a single prediction is made.

The complete example is listed below.

# xgboost for regression from numpy import asarray from numpy import mean from numpy import std from sklearn.datasets import make_regression from xgboost import XGBRegressor from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from matplotlib import pyplot # define dataset X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) # evaluate the model model = XGBRegressor(objective='reg:squarederror') cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise') print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores))) # fit the model on the whole dataset model = XGBRegressor(objective='reg:squarederror') model.fit(X, y) # make a single prediction row = [2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118] row = asarray(row).reshape((1, len(row))) yhat = model.predict(row) print('Prediction: %.3f' % yhat[0])

MAE: -15.048 (1.316) Prediction: -93.434

LightGBM, short for Light Gradient Boosted Machine, is a library developed at Microsoft that provides an efficient implementation of the gradient boosting algorithm.

The primary benefit of the LightGBM is the changes to the training algorithm that make the process dramatically faster, and in many cases, result in a more effective model.

For more technical details on the LightGBM algorithm, see the paper:

You can install the LightGBM library using the pip Python installer, as follows:

sudo pip install lightgbm

For additional installation instructions specific to your platform, see:

Next, let’s confirm that the library is installed and you are using a modern version.

Run the following script to print the library version number.

# check lightgbm version import lightgbm print(lightgbm.__version__)

Running the example, you should see the following version number or higher.

2.3.1

The LightGBM library provides wrapper classes so that the efficient algorithm implementation can be used with the scikit-learn library, specifically via the *LGBMClassifier* and *LGBMRegressor* classes.

Let’s take a closer look at each in turn.

The example below first evaluates an LGBMClassifier on the test problem using repeated k-fold cross-validation and reports the mean accuracy. Then a single model is fit on all available data and a single prediction is made.

The complete example is listed below.

# lightgbm for classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from lightgbm import LGBMClassifier from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # evaluate the model model = LGBMClassifier() cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores))) # fit the model on the whole dataset model = LGBMClassifier() model.fit(X, y) # make a single prediction row = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]] yhat = model.predict(row) print('Prediction: %d' % yhat[0])

Accuracy: 0.934 (0.021) Prediction: 1

The example below first evaluates an LGBMRegressor on the test problem using repeated k-fold cross-validation and reports the mean absolute error. Then a single model is fit on all available data and a single prediction is made.

The complete example is listed below.

# lightgbm for regression from numpy import mean from numpy import std from sklearn.datasets import make_regression from lightgbm import LGBMRegressor from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from matplotlib import pyplot # define dataset X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) # evaluate the model model = LGBMRegressor() cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise') print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores))) # fit the model on the whole dataset model = LGBMRegressor() model.fit(X, y) # make a single prediction row = [[2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]] yhat = model.predict(row) print('Prediction: %.3f' % yhat[0])

MAE: -12.739 (1.408) Prediction: -82.040

CatBoost is a third-party library developed at Yandex that provides an efficient implementation of the gradient boosting algorithm.

The primary benefit of the CatBoost (in addition to computational speed improvements) is support for categorical input variables. This gives the library its name CatBoost for “*Category Gradient Boosting*.”

For more technical details on the CatBoost algorithm, see the paper:

You can install the CatBoost library using the pip Python installer, as follows:

sudo pip install catboost

For additional installation instructions specific to your platform, see:

Next, let’s confirm that the library is installed and you are using a modern version.

Run the following script to print the library version number.

# check catboost version import catboost print(catboost.__version__)

Running the example, you should see the following version number or higher.

0.21

The CatBoost library provides wrapper classes so that the efficient algorithm implementation can be used with the scikit-learn library, specifically via the *CatBoostClassifier* and *CatBoostRegressor* classes.

Let’s take a closer look at each in turn.

The example below first evaluates a CatBoostClassifier on the test problem using repeated k-fold cross-validation and reports the mean accuracy. Then a single model is fit on all available data and a single prediction is made.

The complete example is listed below.

# catboost for classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from catboost import CatBoostClassifier from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # evaluate the model model = CatBoostClassifier(verbose=0, n_estimators=100) cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores))) # fit the model on the whole dataset model = CatBoostClassifier(verbose=0, n_estimators=100) model.fit(X, y) # make a single prediction row = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]] yhat = model.predict(row) print('Prediction: %d' % yhat[0])

Accuracy: 0.931 (0.026) Prediction: 1

The example below first evaluates a CatBoostRegressor on the test problem using repeated k-fold cross-validation and reports the mean absolute error. Then a single model is fit on all available data and a single prediction is made.

The complete example is listed below.

# catboost for regression from numpy import mean from numpy import std from sklearn.datasets import make_regression from catboost import CatBoostRegressor from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from matplotlib import pyplot # define dataset X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) # evaluate the model model = CatBoostRegressor(verbose=0, n_estimators=100) cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise') print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores))) # fit the model on the whole dataset model = CatBoostRegressor(verbose=0, n_estimators=100) model.fit(X, y) # make a single prediction row = [[2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]] yhat = model.predict(row) print('Prediction: %.3f' % yhat[0])

MAE: -9.281 (0.951) Prediction: -74.212

This section provides more resources on the topic if you are looking to go deeper.

- How to Setup Your Python Environment for Machine Learning with Anaconda
- A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning
- How to Configure the Gradient Boosting Algorithm
- A Gentle Introduction to XGBoost for Applied Machine Learning

- Stochastic Gradient Boosting, 2002.
- XGBoost: A Scalable Tree Boosting System, 2016.
- LightGBM: A Highly Efficient Gradient Boosting Decision Tree, 2017.
- CatBoost: gradient boosting with categorical features support, 2017.

- Scikit-Learn Homepage.
- sklearn.ensemble API.
- XGBoost Homepage.
- XGBoost Python API.
- LightGBM Project.
- LightGBM Python API.
- CatBoost Homepage.
- CatBoost API.

In this tutorial, you discovered how to use gradient boosting models for classification and regression in Python.

Specifically, you learned:

- Gradient boosting is an ensemble algorithm that fits boosted decision trees by minimizing an error gradient.
- How to evaluate and use gradient boosting with scikit-learn, including gradient boosting machines and the histogram-based algorithm.
- How to evaluate and use third-party gradient boosting algorithms including XGBoost, LightGBM and CatBoost.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Gradient Boosting with Scikit-Learn, XGBoost, LightGBM, and CatBoost appeared first on Machine Learning Mastery.

]]>The post How to Calculate Feature Importance With Python appeared first on Machine Learning Mastery.

]]>Feature importance refers to techniques that assign a score to input features based on how useful they are at predicting a target variable.

There are many types and sources of feature importance scores, although popular examples include statistical correlation scores, coefficients calculated as part of linear models, decision trees, and permutation importance scores.

Feature importance scores play an important role in a predictive modeling project, including providing insight into the data, insight into the model, and the basis for dimensionality reduction and feature selection that can improve the efficiency and effectiveness of a predictive model on the problem.

In this tutorial, you will discover feature importance scores for machine learning in python

After completing this tutorial, you will know:

- The role of feature importance in a predictive modeling problem.
- How to calculate and review feature importance from linear models and decision trees.
- How to calculate and review permutation feature importance scores.

Let’s get started.

This tutorial is divided into five parts; they are:

- Feature Importance
- Preparation
- Check Scikit-Learn Version
- Test Datasets

- Coefficients as Feature Importance
- Linear Regression Feature Importance
- Logistic Regression Feature Importance

- Decision Tree Feature Importance
- CART Feature Importance
- Random Forest Feature Importance
- XGBoost Feature Importance

- Permutation Feature Importance
- Permutation Feature Importance for Regression
- Permutation Feature Importance for Classification

Feature importance refers to a class of techniques for assigning scores to input features to a predictive model that indicates the relative importance of each feature when making a prediction.

Feature importance scores can be calculated for problems that involve predicting a numerical value, called regression, and those problems that involve predicting a class label, called classification.

The scores are useful and can be used in a range of situations in a predictive modeling problem, such as:

- Better understanding the data.
- Better understanding a model.
- Reducing the number of input features.

**Feature importance scores can provide insight into the dataset**. The relative scores can highlight which features may be most relevant to the target, and the converse, which features are the least relevant. This may be interpreted by a domain expert and could be used as the basis for gathering more or different data.

**Feature importance scores can provide insight into the model**. Most importance scores are calculated by a predictive model that has been fit on the dataset. Inspecting the importance score provides insight into that specific model and which features are the most important and least important to the model when making a prediction. This is a type of model interpretation that can be performed for those models that support it.

**Feature importance can be used to improve a predictive model**. This can be achieved by using the importance scores to select those features to delete (lowest scores) or those features to keep (highest scores). This is a type of feature selection and can simplify the problem that is being modeled, speed up the modeling process (deleting features is called dimensionality reduction), and in some cases, improve the performance of the model.

Feature importance scores can be fed to a wrapper model, such as SelectFromModel or SelectKBest, to perform feature selection.

There are many ways to calculate feature importance scores and many models that can be used for this purpose.

Perhaps the simplest way is to calculate simple coefficient statistics between each feature and the target variable. For more on this approach, see the tutorial:

In this tutorial, we will look at three main types of more advanced feature importance; they are:

- Feature importance from model coefficients.
- Feature importance from decision trees.
- Feature importance from permutation testing.

Let’s take a closer look at each.

Before we dive in, let’s confirm our environment and prepare some test datasets.

First, confirm that you have a modern version of the scikit-learn library installed.

This is important because some of the models we will explore in this tutorial require a modern version of the library.

You can check the version of the library you have installed with the following code example:

# check scikit-learn version import sklearn print(sklearn.__version__)

Running the example will print the version of the library. At the time of writing, this is about version 0.22.

You need to be using this version of scikit-learn or higher.

0.22.1

Next, let’s define some test datasets that we can use as the basis for demonstrating and exploring feature importance scores.

Each test problem has five important and five unimportant features, and it may be interesting to see which methods are consistent at finding or differentiating the features based on their importance.

We will use the make_classification() function to create a test binary classification dataset.

The dataset will have 1,000 examples, with 10 input features, five of which will be informative and the remaining five will be redundant. We will fix the random number seed to ensure we get the same examples each time the code is run.

An example of creating and summarizing the dataset is listed below.

# test classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and confirms the expected number of samples and features.

(1000, 10) (1000,)

We will use the make_regression() function to create a test regression dataset.

Like the classification dataset, the regression dataset will have 1,000 examples, with 10 input features, five of which will be informative and the remaining five that will be redundant.

# test regression dataset from sklearn.datasets import make_regression # define dataset X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and confirms the expected number of samples and features.

(1000, 10) (1000,)

Next, let’s take a closer look at coefficients as importance scores.

Linear machine learning algorithms fit a model where the prediction is the weighted sum of the input values.

Examples include linear regression, logistic regression, and extensions that add regularization, such as ridge regression and the elastic net.

All of these algorithms find a set of coefficients to use in the weighted sum in order to make a prediction. These coefficients can be used directly as a crude type of feature importance score.

Let’s take a closer look at using coefficients as feature importance for classification and regression. We will fit a model on the dataset to find the coefficients, then summarize the importance scores for each input feature and finally create a bar chart to get an idea of the relative importance of the features.

We can fit a LinearRegression model on the regression dataset and retrieve the *coeff_* property that contains the coefficients found for each input variable.

These coefficients can provide the basis for a crude feature importance score. This assumes that the input variables have the same scale or have been scaled prior to fitting a model.

The complete example of linear regression coefficients for feature importance is listed below.

# linear regression feature importance from sklearn.datasets import make_regression from sklearn.linear_model import LinearRegression from matplotlib import pyplot # define dataset X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) # define the model model = LinearRegression() # fit the model model.fit(X, y) # get importance importance = model.coef_ # summarize feature importance for i,v in enumerate(importance): print('Feature: %0d, Score: %.5f' % (i,v)) # plot feature importance pyplot.bar([x for x in range(len(importance))], importance) pyplot.show()

Running the example fits the model, then reports the coefficient value for each feature.

The scores suggest that the model found the five important features and marked all other features with a zero coefficient, essentially removing them from the model.

Feature: 0, Score: 0.00000 Feature: 1, Score: 12.44483 Feature: 2, Score: -0.00000 Feature: 3, Score: -0.00000 Feature: 4, Score: 93.32225 Feature: 5, Score: 86.50811 Feature: 6, Score: 26.74607 Feature: 7, Score: 3.28535 Feature: 8, Score: -0.00000 Feature: 9, Score: 0.00000

A bar chart is then created for the feature importance scores.

This approach may also be used with Ridge and ElasticNet models.

We can fit a LogisticRegression model on the regression dataset and retrieve the *coeff_* property that contains the coefficients found for each input variable.

These coefficients can provide the basis for a crude feature importance score. This assumes that the input variables have the same scale or have been scaled prior to fitting a model.

The complete example of logistic regression coefficients for feature importance is listed below.

# logistic regression for feature importance from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # define the model model = LogisticRegression() # fit the model model.fit(X, y) # get importance importance = model.coef_[0] # summarize feature importance for i,v in enumerate(importance): print('Feature: %0d, Score: %.5f' % (i,v)) # plot feature importance pyplot.bar([x for x in range(len(importance))], importance) pyplot.show()

Running the example fits the model, then reports the coefficient value for each feature.

Recall this is a classification problem with classes 0 and 1. Notice that the coefficients are both positive and negative. The positive scores indicate a feature that predicts class 1, whereas the negative scores indicate a feature that predicts class 0.

No clear pattern of important and unimportant features can be identified from these results, at least from what I can tell.

Feature: 0, Score: 0.16320 Feature: 1, Score: -0.64301 Feature: 2, Score: 0.48497 Feature: 3, Score: -0.46190 Feature: 4, Score: 0.18432 Feature: 5, Score: -0.11978 Feature: 6, Score: -0.40602 Feature: 7, Score: 0.03772 Feature: 8, Score: -0.51785 Feature: 9, Score: 0.26540

A bar chart is then created for the feature importance scores.

Now that we have seen the use of coefficients as importance scores, let’s look at the more common example of decision-tree-based importance scores.

Decision tree algorithms like classification and regression trees (CART) offer importance scores based on the reduction in the criterion used to select split points, like Gini or entropy.

This same approach can be used for ensembles of decision trees, such as the random forest and stochastic gradient boosting algorithms.

Let’s take a look at a worked example of each.

We can use the CART algorithm for feature importance implemented in scikit-learn as the *DecisionTreeRegressor* and *DecisionTreeClassifier* classes.

After being fit, the model provides a *feature_importances_* property that can be accessed to retrieve the relative importance scores for each input feature.

Let’s take a look at an example of this for regression and classification.

The complete example of fitting a DecisionTreeRegressor and summarizing the calculated feature importance scores is listed below.

# decision tree for feature importance on a regression problem from sklearn.datasets import make_regression from sklearn.tree import DecisionTreeRegressor from matplotlib import pyplot # define dataset X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) # define the model model = DecisionTreeRegressor() # fit the model model.fit(X, y) # get importance importance = model.feature_importances_ # summarize feature importance for i,v in enumerate(importance): print('Feature: %0d, Score: %.5f' % (i,v)) # plot feature importance pyplot.bar([x for x in range(len(importance))], importance) pyplot.show()

Running the example fits the model, then reports the coefficient value for each feature.

The results suggest perhaps three of the 10 features as being important to prediction.

Feature: 0, Score: 0.00294 Feature: 1, Score: 0.00502 Feature: 2, Score: 0.00318 Feature: 3, Score: 0.00151 Feature: 4, Score: 0.51648 Feature: 5, Score: 0.43814 Feature: 6, Score: 0.02723 Feature: 7, Score: 0.00200 Feature: 8, Score: 0.00244 Feature: 9, Score: 0.00106

A bar chart is then created for the feature importance scores.

The complete example of fitting a DecisionTreeClassifier and summarizing the calculated feature importance scores is listed below.

# decision tree for feature importance on a classification problem from sklearn.datasets import make_classification from sklearn.tree import DecisionTreeClassifier from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # define the model model = DecisionTreeClassifier() # fit the model model.fit(X, y) # get importance importance = model.feature_importances_ # summarize feature importance for i,v in enumerate(importance): print('Feature: %0d, Score: %.5f' % (i,v)) # plot feature importance pyplot.bar([x for x in range(len(importance))], importance) pyplot.show()

Running the example fits the model, then reports the coefficient value for each feature.

The results suggest perhaps four of the 10 features as being important to prediction.

Feature: 0, Score: 0.01486 Feature: 1, Score: 0.01029 Feature: 2, Score: 0.18347 Feature: 3, Score: 0.30295 Feature: 4, Score: 0.08124 Feature: 5, Score: 0.00600 Feature: 6, Score: 0.19646 Feature: 7, Score: 0.02908 Feature: 8, Score: 0.12820 Feature: 9, Score: 0.04745

A bar chart is then created for the feature importance scores.

We can use the Random Forest algorithm for feature importance implemented in scikit-learn as the *RandomForestRegressor* and *RandomForestClassifier* classes.

After being fit, the model provides a *feature_importances_* property that can be accessed to retrieve the relative importance scores for each input feature.

This approach can also be used with the bagging and extra trees algorithms.

Let’s take a look at an example of this for regression and classification.

The complete example of fitting a RandomForestRegressor and summarizing the calculated feature importance scores is listed below.

# random forest for feature importance on a regression problem from sklearn.datasets import make_regression from sklearn.ensemble import RandomForestRegressor from matplotlib import pyplot # define dataset X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) # define the model model = RandomForestRegressor() # fit the model model.fit(X, y) # get importance importance = model.feature_importances_ # summarize feature importance for i,v in enumerate(importance): print('Feature: %0d, Score: %.5f' % (i,v)) # plot feature importance pyplot.bar([x for x in range(len(importance))], importance) pyplot.show()

Running the example fits the model, then reports the coefficient value for each feature.

The results suggest perhaps two or three of the 10 features as being important to prediction.

Feature: 0, Score: 0.00280 Feature: 1, Score: 0.00545 Feature: 2, Score: 0.00294 Feature: 3, Score: 0.00289 Feature: 4, Score: 0.52992 Feature: 5, Score: 0.42046 Feature: 6, Score: 0.02663 Feature: 7, Score: 0.00304 Feature: 8, Score: 0.00304 Feature: 9, Score: 0.00283

A bar chart is then created for the feature importance scores.

The complete example of fitting a RandomForestClassifier and summarizing the calculated feature importance scores is listed below.

# random forest for feature importance on a classification problem from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # define the model model = RandomForestClassifier() # fit the model model.fit(X, y) # get importance importance = model.feature_importances_ # summarize feature importance for i,v in enumerate(importance): print('Feature: %0d, Score: %.5f' % (i,v)) # plot feature importance pyplot.bar([x for x in range(len(importance))], importance) pyplot.show()

Running the example fits the model, then reports the coefficient value for each feature.

The results suggest perhaps two or three of the 10 features as being important to prediction.

Feature: 0, Score: 0.06523 Feature: 1, Score: 0.10737 Feature: 2, Score: 0.15779 Feature: 3, Score: 0.20422 Feature: 4, Score: 0.08709 Feature: 5, Score: 0.09948 Feature: 6, Score: 0.10009 Feature: 7, Score: 0.04551 Feature: 8, Score: 0.08830 Feature: 9, Score: 0.04493

A bar chart is then created for the feature importance scores.

XGBoost is a library that provides an efficient and effective implementation of the stochastic gradient boosting algorithm.

This algorithm can be used with scikit-learn via the *XGBRegressor* and *XGBClassifier* classes.

After being fit, the model provides a *feature_importances_* property that can be accessed to retrieve the relative importance scores for each input feature.

This algorithm is also provided via scikit-learn via the *GradientBoostingClassifier* and *GradientBoostingRegressor* classes and the same approach to feature selection can be used.

First, install the XGBoost library, such as with pip:

sudo pip install xgboost

Then confirm that the library was installed correctly and works by checking the version number.

# check xgboost version import xgboost print(xgboost.__version__)

Running the example, you should see the following version number or higher.

0.90

For more on the XGBoost library, start here:

Let’s take a look at an example of XGBoost for feature importance on regression and classification problems.

The complete example of fitting a XGBRegressor and summarizing the calculated feature importance scores is listed below.

# xgboost for feature importance on a regression problem from sklearn.datasets import make_regression from xgboost import XGBRegressor from matplotlib import pyplot # define dataset X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) # define the model model = XGBRegressor() # fit the model model.fit(X, y) # get importance importance = model.feature_importances_ # summarize feature importance for i,v in enumerate(importance): print('Feature: %0d, Score: %.5f' % (i,v)) # plot feature importance pyplot.bar([x for x in range(len(importance))], importance) pyplot.show()

Running the example fits the model, then reports the coefficient value for each feature.

The results suggest perhaps two or three of the 10 features as being important to prediction.

Feature: 0, Score: 0.00060 Feature: 1, Score: 0.01917 Feature: 2, Score: 0.00091 Feature: 3, Score: 0.00118 Feature: 4, Score: 0.49380 Feature: 5, Score: 0.42342 Feature: 6, Score: 0.05057 Feature: 7, Score: 0.00419 Feature: 8, Score: 0.00124 Feature: 9, Score: 0.00491

A bar chart is then created for the feature importance scores.

The complete example of fitting an XGBClassifier and summarizing the calculated feature importance scores is listed below.

# xgboost for feature importance on a classification problem from sklearn.datasets import make_classification from xgboost import XGBClassifier from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # define the model model = XGBClassifier() # fit the model model.fit(X, y) # get importance importance = model.feature_importances_ # summarize feature importance for i,v in enumerate(importance): print('Feature: %0d, Score: %.5f' % (i,v)) # plot feature importance pyplot.bar([x for x in range(len(importance))], importance) pyplot.show()

Running the example fits the model then reports the coefficient value for each feature.

The results suggest perhaps seven of the 10 features as being important to prediction.

Feature: 0, Score: 0.02464 Feature: 1, Score: 0.08153 Feature: 2, Score: 0.12516 Feature: 3, Score: 0.28400 Feature: 4, Score: 0.12694 Feature: 5, Score: 0.10752 Feature: 6, Score: 0.08624 Feature: 7, Score: 0.04820 Feature: 8, Score: 0.09357 Feature: 9, Score: 0.02220

A bar chart is then created for the feature importance scores.

Permutation feature importance is a technique for calculating relative importance scores that is independent of the model used.

First, a model is fit on the dataset, such as a model that does not support native feature importance scores. Then the model is used to make predictions on a dataset, although the values of a feature (column) in the dataset are scrambled. This is repeated for each feature in the dataset. Then this whole process is repeated 3, 5, 10 or more times. The result is a mean importance score for each input feature (and distribution of scores given the repeats).

This approach can be used for regression or classification and requires that a performance metric be chosen as the basis of the importance score, such as the mean squared error for regression and accuracy for classification.

Permutation feature selection can be used via the permutation_importance() function that takes a fit model, a dataset (train or test dataset is fine), and a scoring function.

Let’s take a look at this approach to feature selection with an algorithm that does not support feature selection natively, specifically k-nearest neighbors.

The complete example of fitting a KNeighborsRegressor and summarizing the calculated permutation feature importance scores is listed below.

# permutation feature importance with knn for regression from sklearn.datasets import make_regression from sklearn.neighbors import KNeighborsRegressor from sklearn.inspection import permutation_importance from matplotlib import pyplot # define dataset X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1) # define the model model = KNeighborsRegressor() # fit the model model.fit(X, y) # perform permutation importance results = permutation_importance(model, X, y, scoring='neg_mean_squared_error') # get importance importance = results.importances_mean # summarize feature importance for i,v in enumerate(importance): print('Feature: %0d, Score: %.5f' % (i,v)) # plot feature importance pyplot.bar([x for x in range(len(importance))], importance) pyplot.show()

Running the example fits the model, then reports the coefficient value for each feature.

The results suggest perhaps two or three of the 10 features as being important to prediction.

Feature: 0, Score: 175.52007 Feature: 1, Score: 345.80170 Feature: 2, Score: 126.60578 Feature: 3, Score: 95.90081 Feature: 4, Score: 9666.16446 Feature: 5, Score: 8036.79033 Feature: 6, Score: 929.58517 Feature: 7, Score: 139.67416 Feature: 8, Score: 132.06246 Feature: 9, Score: 84.94768

A bar chart is then created for the feature importance scores.

The complete example of fitting a KNeighborsClassifier and summarizing the calculated permutation feature importance scores is listed below.

# permutation feature importance with knn for classification from sklearn.datasets import make_classification from sklearn.neighbors import KNeighborsClassifier from sklearn.inspection import permutation_importance from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1) # define the model model = KNeighborsClassifier() # fit the model model.fit(X, y) # perform permutation importance results = permutation_importance(model, X, y, scoring='accuracy') # get importance importance = results.importances_mean # summarize feature importance for i,v in enumerate(importance): print('Feature: %0d, Score: %.5f' % (i,v)) # plot feature importance pyplot.bar([x for x in range(len(importance))], importance) pyplot.show()

Running the example fits the model, then reports the coefficient value for each feature.

The results suggest perhaps two or three of the 10 features as being important to prediction.

Feature: 0, Score: 0.04760 Feature: 1, Score: 0.06680 Feature: 2, Score: 0.05240 Feature: 3, Score: 0.09300 Feature: 4, Score: 0.05140 Feature: 5, Score: 0.05520 Feature: 6, Score: 0.07920 Feature: 7, Score: 0.05560 Feature: 8, Score: 0.05620 Feature: 9, Score: 0.03080

A bar chart is then created for the feature importance scores.

This section provides more resources on the topic if you are looking to go deeper.

- How to Choose a Feature Selection Method For Machine Learning
- How to Perform Feature Selection with Categorical Data
- Feature Importance and Feature Selection With XGBoost in Python
- Feature Selection For Machine Learning in Python
- An Introduction to Feature Selection

- Feature selection, scikit-learn API.
- Permutation feature importance, scikit-learn API.
- sklearn.datasets.make_classification API.
- sklearn.datasets.make_regression API.
- XGBoost Python API Reference.
- sklearn.inspection.permutation_importance API.

In this tutorial, you discovered feature importance scores for machine learning in python

Specifically, you learned:

- The role of feature importance in a predictive modeling problem.
- How to calculate and review feature importance from linear models and decision trees.
- How to calculate and review permutation feature importance scores.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Calculate Feature Importance With Python appeared first on Machine Learning Mastery.

]]>The post How to Develop Multi-Output Regression Models with Python appeared first on Machine Learning Mastery.

]]>Multioutput regression are regression problems that involve predicting two or more numerical values given an input example.

An example might be to predict a coordinate given an input, e.g. predicting x and y values. Another example would be multi-step time series forecasting that involves predicting multiple future time series of a given variable.

Many machine learning algorithms are designed for predicting a single numeric value, referred to simply as regression. Some algorithms do support multioutput regression inherently, such as linear regression and decision trees. There are also special workaround models that can be used to wrap and use those algorithms that do not natively support predicting multiple outputs.

In this tutorial, you will discover how to develop machine learning models for multioutput regression.

After completing this tutorial, you will know:

- The problem of multioutput regression in machine learning.
- How to develop machine learning models that inherently support multiple-output regression.
- How to develop wrapper models that allow algorithms that do not inherently support multiple outputs to be used for multiple-output regression.

Let’s get started.

This tutorial is divided into three parts; they are:

- Problem of Multioutput Regression
- Check Scikit-Learn Version
- Multioutput Regression Test Problem

- Inherently Multioutput Regression Algorithms
- Linear Regression for Multioutput Regression
- k-Nearest Neighbors for Multioutput Regression
- Random Forest for Multioutput Regression
- Evaluate Multioutput Regression With Cross-Validation

- Wrapper Multioutput Regression Algorithms
- Separate Model for Each Output (MultiOutputRegressor)
- Chained Models for Each Output (RegressorChain)

Regression refers to a predictive modeling problem that involves predicting a numerical value.

For example, predicting a size, weight, amount, number of sales, and number of clicks are regression problems. Typically, a single numeric value is predicted given input variables.

Some regression problems require the prediction of two or more numeric values. For example, predicting an x and y coordinate.

These problems are referred to as multiple-output regression, or multioutput regression.

**Regression**: Predict a single numeric output given an input.**Multioutput Regression**: Predict two or more numeric outputs given an input.

In multioutput regression, typically the outputs are dependent upon the input and upon each other. This means that often the outputs are not independent of each other and may require a model that predicts both outputs together or each output contingent upon the other outputs.

Multi-step time series forecasting may be considered a type of multiple-output regression where a sequence of future values are predicted and each predicted value is dependent upon the prior values in the sequence.

There are a number of strategies for handling multioutput regression and we will explore some of them in this tutorial.

First, confirm that you have a modern version of the scikit-learn library installed.

This is important because some of the models we will explore in this tutorial require a modern version of the library.

You can check the version of the library with the following code example:

# check scikit-learn version import sklearn print(sklearn.__version__)

Running the example will print the version of the library.

At the time of writing, this is about version 0.22. You need to be using this version of scikit-learn or higher.

0.22.1

We can define a test problem that we can use to demonstrate the different modeling strategies.

We will use the make_regression() function to create a test dataset for multiple-output regression. We will generate 1,000 examples with 10 input features, five of which will be redundant and five that will be informative. The problem will require the prediction of two numeric values.

**Problem Input**: 10 numeric variables.**Problem Output**: 2 numeric variables.

The example below generates the dataset and summarizes the shape.

# example of multioutput regression test problem from sklearn.datasets import make_regression # create datasets X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, n_targets=2, random_state=1) # summarize dataset print(X.shape, y.shape)

Running the example creates the dataset and summarizes the shape of the input and output elements of the dataset for modeling, confirming the chosen configuration.

(1000, 10) (1000, 2)

Next, let’s look at modeling this problem directly.

Some regression machine learning algorithms support multiple outputs directly.

This includes most of the popular machine learning algorithms implemented in the scikit-learn library, such as:

- LinearRegression (and related)
- KNeighborsRegressor
- DecisionTreeRegressor
- RandomForestRegressor (and related)

Let’s look at a few examples to make this concrete.

The example below fits a linear regression model on the multioutput regression dataset, then makes a single prediction with the fit model.

# linear regression for multioutput regression from sklearn.datasets import make_regression from sklearn.linear_model import LinearRegression # create datasets X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, n_targets=2, random_state=1) # define model model = LinearRegression() # fit model model.fit(X, y) # make a prediction data_in = [[-2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]] yhat = model.predict(data_in) # summarize prediction print(yhat[0])

Running the example fits the model and then makes a prediction for one input, confirming that the model predicted two required values.

[-93.147146 23.26985013]

The example below fits a k-nearest neighbors model on the multioutput regression dataset, then makes a single prediction with the fit model.

# k-nearest neighbors for multioutput regression from sklearn.datasets import make_regression from sklearn.neighbors import KNeighborsRegressor # create datasets X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, n_targets=2, random_state=1) # define model model = KNeighborsRegressor() # fit model model.fit(X, y) # make a prediction data_in = [[-2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]] yhat = model.predict(data_in) # summarize prediction print(yhat[0])

Running the example fits the model and then makes a prediction for one input, confirming that the model predicted two required values.

[-109.74862659 0.38754079]

The example below fits a random forest model on the multioutput regression dataset, then makes a single prediction with the fit model.

# random forest for multioutput regression from sklearn.datasets import make_regression from sklearn.ensemble import RandomForestRegressor # create datasets X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, n_targets=2, random_state=1) # define model model = RandomForestRegressor() # fit model model.fit(X, y) # make a prediction data_in = [[-2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]] yhat = model.predict(data_in) # summarize prediction print(yhat[0])

Running the example fits the model and then makes a prediction for one input, confirming that the model predicted two required values.

[-76.79505796 27.16551641]

We may want to evaluate a multioutput regression using k-fold cross-validation.

This can be achieved in the same way as evaluating any other machine learning model.

We will fit and evaluate a *DecisionTreeRegressor* model on the test problem using 10-fold cross-validation with three repeats. We will use the mean absolute error (MAE) performance metric as the score.

The complete example is listed below.

# evaluate multioutput regression model with k-fold cross-validation from numpy import absolute from numpy import mean from numpy import std from sklearn.datasets import make_regression from sklearn.tree import DecisionTreeRegressor from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold # create datasets X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, n_targets=2, random_state=1) # define model model = DecisionTreeRegressor() # evaluate model cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise') # summarize performance n_scores = absolute(n_scores) print('Result: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example evaluates the performance of the decision tree model for multioutput regression on the test problem. The mean and standard deviation of the MAE is reported calculated across all folds and all repeats.

Importantly, error is reported across both output variables, rather than separate error scores for each output variable.

Result: 51.659 (3.455)

Not all regression algorithms support multioutput regression.

One example is the support vector machine, although for regression, it is referred to as support vector regression, or SVR.

This algorithm does not support multiple outputs for a regression problem and will raise an error. We can demonstrate this with an example, listed below.

# failure of support vector regression for multioutput regression from sklearn.datasets import make_regression from sklearn.svm import LinearSVR # create datasets X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, n_targets=2, random_state=1) # define model model = LinearSVR() # fit model model.fit(X, y)

Running the example reports an error message indicating that the model does not support multioutput regression.

ValueError: bad input shape (1000, 2)

There are two workarounds that we can adopt in order to use an algorithm like SVR for multioutput regression.

They are to create a separate model for each output and to create a linear sequence of models, one for each output, where the output of each model is dependent upon the output of the previous models.

Thankfully, the scikit-learn library supports both of these cases. Let’s take a closer look at each.

We can create a separate model for each output of the problem.

This assumes that the outputs are independent of each other, which might not be a correct assumption. Nevertheless, this approach can provide surprisingly effective predictions on a range of problems and may be worth trying, at least as a performance baseline.

You never know. The outputs for your problem may, in fact, be mostly independent, if not completely independent, and this strategy can help you find out.

This approach is supported by the MultiOutputRegressor class that takes a regression model as an argument. It will then create one instance of the provided model for each output in the problem.

The example below demonstrates using the *MultiOutputRegressor* class with linear SVR for the test problem.

# example of linear SVR with the MultiOutputRegressor wrapper for multioutput regression from sklearn.datasets import make_regression from sklearn.multioutput import MultiOutputRegressor from sklearn.svm import LinearSVR # create datasets X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, n_targets=2, random_state=1) # define model model = LinearSVR() wrapper = MultiOutputRegressor(model) # fit model wrapper.fit(X, y) # make a prediction data_in = [[-2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]] yhat = wrapper.predict(data_in) # summarize prediction print(yhat[0])

Running the example fits a separate LinearSVR for each of the outputs in the problem using the *MultiOutputRegressor* wrapper class.

This wrapper can then be used directly to make a prediction on new data, confirming that multiple outputs are supported.

[-93.147146 23.26985013]

Another approach to using single-output regression models for multioutput regression is to create a linear sequence of models.

The first model in the sequence uses the input and predicts one output; the second model uses the input and the output from the first model to make a prediction; the third model uses the input and output from the first two models to make a prediction, and so on.

This can be achieved using the RegressorChain class in the scikit-learn library.

The order of the models may be based on the order of the outputs in the dataset (the default) or specified via the “*order*” argument. For example, *order=[0,1]* would first predict the 0th output, then the 1st output, whereas an *order=[1,0]* would first predict the last output variable and then the first output variable in our test problem.

The example below uses the *RegressorChain* with the default output order to fit a linear SVR on the multioutput regression test problem.

# example of fitting a chain of linear SVR for multioutput regression from sklearn.datasets import make_regression from sklearn.multioutput import RegressorChain from sklearn.svm import LinearSVR # create datasets X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, n_targets=2, random_state=1) # define model model = LinearSVR() wrapper = RegressorChain(model) # fit model wrapper.fit(X, y) # make a prediction data_in = [[-2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]] yhat = wrapper.predict(data_in) # summarize prediction print(yhat[0])

Running the example first fits a linear SVR to predict the first output variable, then a second linear SVR to predict the second output variable using the input and the output of the first model. These models are fit on the entire dataset.

The fit chain of models is then used directly to make a prediction on a new test instance, predicting the required two output variables.

[-93.147146 23.26938475]

This section provides more resources on the topic if you are looking to go deeper.

- Multiclass and multilabel algorithms, API.
- sklearn.datasets.make_regression API.
- sklearn.multioutput.MultiOutputRegressor API.
- sklearn.multioutput.RegressorChain API.

In this tutorial, you discovered how to develop machine learning models for multioutput regression.

Specifically, you learned:

- The problem of multioutput regression in machine learning.
- How to develop machine learning models that inherently support multiple-output regression.
- How to develop wrapper models that allow algorithms that do not inherently support multiple outputs to be used for multiple-output regression.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Develop Multi-Output Regression Models with Python appeared first on Machine Learning Mastery.

]]>The post 4 Distance Measures for Machine Learning appeared first on Machine Learning Mastery.

]]>Distance measures play an important role in machine learning.

They provide the foundation for many popular and effective machine learning algorithms like k-nearest neighbors for supervised learning and k-means clustering for unsupervised learning.

Different distance measures must be chosen and used depending on the types of the data. As such, it is important to know how to implement and calculate a range of different popular distance measures and the intuitions for the resulting scores.

In this tutorial, you will discover distance measures in machine learning.

After completing this tutorial, you will know:

- The role and importance of distance measures in machine learning algorithms.
- How to implement and calculate Hamming, Euclidean, and Manhattan distance measures.
- How to implement and calculate the Minkowski distance that generalizes the Euclidean and Manhattan distance measures.

Let’s get started.

This tutorial is divided into five parts; they are:

- Role of Distance Measures
- Hamming Distance
- Euclidean Distance
- Manhattan Distance (Taxicab or City Block)
- Minkowski Distance

Distance measures play an important role in machine learning.

A distance measure is an objective score that summarizes the relative difference between two objects in a problem domain.

Most commonly, the two objects are rows of data that describe a subject (such as a person, car, or house), or an event (such as a purchase, a claim, or a diagnosis).

Perhaps the most likely way you will encounter distance measures is when you are using a specific machine learning algorithm that uses distance measures at its core. The most famous algorithm of this type is the k-nearest neighbors algorithm, or KNN for short.

In the KNN algorithm, a classification or regression prediction is made for new examples by calculating the distance between the new example (row) and all examples (rows) in the training dataset. The k examples in the training dataset with the smallest distance are then selected and a prediction is made by averaging the outcome (mode of the class label or mean of the real value for regression).

KNN belongs to a broader field of algorithms called case-based or instance-based learning, most of which use distance measures in a similar manner. Another popular instance-based algorithm that uses distance measures is the learning vector quantization, or LVQ, algorithm that may also be considered a type of neural network.

Related is the self-organizing map algorithm, or SOM, that also uses distance measures and can be used for supervised or unsupervised learning. Another unsupervised learning algorithm that uses distance measures at its core is the K-means clustering algorithm.

In instance-based learning the training examples are stored verbatim, and a distance function is used to determine which member of the training set is closest to an unknown test instance. Once the nearest training instance has been located, its class is predicted for the test instance.

— Page 135, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.

A short list of some of the more popular machine learning algorithms that use distance measures at their core is as follows:

- K-Nearest Neighbors
- Learning Vector Quantization (LVQ)
- Self-Organizing Map (SOM)
- K-Means Clustering

There are many kernel-based methods may also be considered distance-based algorithms. Perhaps the most widely known kernel method is the support vector machine algorithm, or SVM for short.

**Do you know more algorithms that use distance measures?**

Let me know in the comments below.

When calculating the distance between two examples or rows of data, it is possible that different data types are used for different columns of the examples. An example might have real values, boolean values, categorical values, and ordinal values. Different distance measures may be required for each that are summed together into a single distance score.

Numerical values may have different scales. This can greatly impact the calculation of distance measure and it is often a good practice to normalize or standardize numerical values prior to calculating the distance measure.

Numerical error in regression problems may also be considered a distance. For example, the error between the expected value and the predicted value is a one-dimensional distance measure that can be summed or averaged over all examples in a test set to give a total distance between the expected and predicted outcomes in the dataset. The calculation of the error, such as the mean squared error or mean absolute error, may resemble a standard distance measure.

As we can see, distance measures play an important role in machine learning. Perhaps four of the most commonly used distance measures in machine learning are as follows:

- Hamming Distance
- Euclidean Distance
- Manhattan Distance
- Minkowski Distance

**What are some other distance measures you have used or heard of?**

Let me know in the comments below.

You need to know how to calculate each of these distance measures when implementing algorithms from scratch and the intuition for what is being calculated when using algorithms that make use of these distance measures.

Let’s take a closer look at each in turn.

Hamming distance calculates the distance between two binary vectors, also referred to as binary strings or bitstrings for short.

You are most likely going to encounter bitstrings when you one-hot encode categorical columns of data.

For example, if a column had the categories ‘*red*,’ ‘*green*,’ and ‘*blue*,’ you might one hot encode each example as a bitstring with one bit for each column.

- red = [1, 0, 0]
- green = [0, 1, 0]
- blue = [0, 0, 1]

The distance between red and green could be calculated as the sum or the average number of bit differences between the two bitstrings. This is the Hamming distance.

For a one-hot encoded string, it might make more sense to summarize to the sum of the bit differences between the strings, which will always be a 0 or 1.

- HammingDistance = sum for i to N abs(v1[i] – v2[i])

For bitstrings that may have many 1 bits, it is more common to calculate the average number of bit differences to give a hamming distance score between 0 (identical) and 1 (all different).

- HammingDistance = (sum for i to N abs(v1[i] – v2[i])) / N

We can demonstrate this with an example of calculating the Hamming distance between two bitstrings, listed below.

# calculating hamming distance between bit strings # calculate hamming distance def hamming_distance(a, b): return sum(abs(e1 - e2) for e1, e2 in zip(a, b)) / len(a) # define data row1 = [0, 0, 0, 0, 0, 1] row2 = [0, 0, 0, 0, 1, 0] # calculate distance dist = hamming_distance(row1, row2) print(dist)

Running the example reports the Hamming distance between the two bitstrings.

We can see that there are two differences between the strings, or 2 out of 6 bit positions different, which averaged (2/6) is about 1/3 or 0.333.

0.3333333333333333

We can also perform the same calculation using the hamming() function from SciPy. The complete example is listed below.

# calculating hamming distance between bit strings from scipy.spatial.distance import hamming # define data row1 = [0, 0, 0, 0, 0, 1] row2 = [0, 0, 0, 0, 1, 0] # calculate distance dist = hamming(row1, row2) print(dist)

Running the example, we can see we get the same result, confirming our manual implementation.

0.3333333333333333

Euclidean distance calculates the distance between two real-valued vectors.

You are most likely to use Euclidean distance when calculating the distance between two rows of data that have numerical values, such a floating point or integer values.

If columns have values with differing scales, it is common to normalize or standardize the numerical values across all columns prior to calculating the Euclidean distance. Otherwise, columns that have large values will dominate the distance measure.

Although there are other possible choices, most instance-based learners use Euclidean distance.

— Page 135, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.

Euclidean distance is calculated as the square root of the sum of the squared differences between the two vectors.

- EuclideanDistance = sqrt(sum for i to N (v1[i] – v2[i])^2)

If the distance calculation is to be performed thousands or millions of times, it is common to remove the square root operation in an effort to speed up the calculation. The resulting scores will have the same relative proportions after this modification and can still be used effectively within a machine learning algorithm for finding the most similar examples.

- EuclideanDistance = sum for i to N (v1[i] – v2[i])^2

This calculation is related to the L2 vector norm and is equivalent to the sum squared error and the root sum squared error if the square root is added.

We can demonstrate this with an example of calculating the Euclidean distance between two real-valued vectors, listed below.

# calculating euclidean distance between vectors from math import sqrt # calculate euclidean distance def euclidean_distance(a, b): return sqrt(sum((e1-e2)**2 for e1, e2 in zip(a,b))) # define data row1 = [10, 20, 15, 10, 5] row2 = [12, 24, 18, 8, 7] # calculate distance dist = euclidean_distance(row1, row2) print(dist)

Running the example reports the Euclidean distance between the two vectors.

6.082762530298219

We can also perform the same calculation using the euclidean() function from SciPy. The complete example is listed below.

# calculating euclidean distance between vectors from scipy.spatial.distance import euclidean # define data row1 = [10, 20, 15, 10, 5] row2 = [12, 24, 18, 8, 7] # calculate distance dist = euclidean(row1, row2) print(dist)

Running the example, we can see we get the same result, confirming our manual implementation.

6.082762530298219

The Manhattan distance, also called the Taxicab distance or the City Block distance, calculates the distance between two real-valued vectors.

It is perhaps more useful to vectors that describe objects on a uniform grid, like a chessboard or city blocks. The taxicab name for the measure refers to the intuition for what the measure calculates: the shortest path that a taxicab would take between city blocks (coordinates on the grid).

It might make sense to calculate Manhattan distance instead of Euclidean distance for two vectors in an integer feature space.

Manhattan distance is calculated as the sum of the absolute differences between the two vectors.

- ManhattanDistance = sum for i to N sum |v1[i] – v2[i]|

The Manhattan distance is related to the L1 vector norm and the sum absolute error and mean absolute error metric.

We can demonstrate this with an example of calculating the Manhattan distance between two integer vectors, listed below.

# calculating manhattan distance between vectors from math import sqrt # calculate manhattan distance def manhattan_distance(a, b): return sum(abs(e1-e2) for e1, e2 in zip(a,b)) # define data row1 = [10, 20, 15, 10, 5] row2 = [12, 24, 18, 8, 7] # calculate distance dist = manhattan_distance(row1, row2) print(dist)

Running the example reports the Manhattan distance between the two vectors.

13

We can also perform the same calculation using the cityblock() function from SciPy. The complete example is listed below.

# calculating manhattan distance between vectors from scipy.spatial.distance import cityblock # define data row1 = [10, 20, 15, 10, 5] row2 = [12, 24, 18, 8, 7] # calculate distance dist = cityblock(row1, row2) print(dist)

Running the example, we can see we get the same result, confirming our manual implementation.

13

Minkowski distance calculates the distance between two real-valued vectors.

It is a generalization of the Euclidean and Manhattan distance measures and adds a parameter, called the “*order*” or “*p*“, that allows different distance measures to be calculated.

The Minkowski distance measure is calculated as follows:

- EuclideanDistance = (sum for i to N (abs(v1[i] – v2[i]))^p)^(1/p)

Where “*p*” is the order parameter.

When p is set to 1, the calculation is the same as the Manhattan distance. When p is set to 2, it is the same as the Euclidean distance.

*p=1*: Manhattan distance.*p=2*: Euclidean distance.

Intermediate values provide a controlled balance between the two measures.

It is common to use Minkowski distance when implementing a machine learning algorithm that uses distance measures as it gives control over the type of distance measure used for real-valued vectors via a hyperparameter “*p*” that can be tuned.

We can demonstrate this calculation with an example of calculating the Minkowski distance between two real vectors, listed below.

# calculating minkowski distance between vectors from math import sqrt # calculate minkowski distance def minkowski_distance(a, b, p): return sum(abs(e1-e2)**p for e1, e2 in zip(a,b))**(1/p) # define data row1 = [10, 20, 15, 10, 5] row2 = [12, 24, 18, 8, 7] # calculate distance (p=1) dist = minkowski_distance(row1, row2, 1) print(dist) # calculate distance (p=2) dist = minkowski_distance(row1, row2, 2) print(dist)

Running the example first calculates and prints the Minkowski distance with *p* set to 1 to give the Manhattan distance, then with *p* set to 2 to give the Euclidean distance, matching the values calculated on the same data from the previous sections.

13.0 6.082762530298219

We can also perform the same calculation using the minkowski_distance() function from SciPy. The complete example is listed below.

# calculating minkowski distance between vectors from scipy.spatial import minkowski_distance # define data row1 = [10, 20, 15, 10, 5] row2 = [12, 24, 18, 8, 7] # calculate distance (p=1) dist = minkowski_distance(row1, row2, 1) print(dist) # calculate distance (p=2) dist = minkowski_distance(row1, row2, 2) print(dist)

Running the example, we can see we get the same results, confirming our manual implementation.

13.0 6.082762530298219

This section provides more resources on the topic if you are looking to go deeper.

- Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.

- Distance computations (scipy.spatial.distance)
- scipy.spatial.distance.hamming API.
- scipy.spatial.distance.euclidean API.
- scipy.spatial.distance.cityblock API.
- scipy.spatial.minkowski_distance API.

- Instance-based learning, Wikipedia.
- Hamming distance, Wikipedia.
- Euclidean distance, Wikipedia.
- Taxicab geometry, Wikipedia.
- Minkowski distance, Wikipedia.

In this tutorial, you discovered distance measures in machine learning.

Specifically, you learned:

- The role and importance of distance measures in machine learning algorithms.
- How to implement and calculate Hamming, Euclidean, and Manhattan distance measures.
- How to implement and calculate the Minkowski distance that generalizes the Euclidean and Manhattan distance measures.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post 4 Distance Measures for Machine Learning appeared first on Machine Learning Mastery.

]]>The post PyTorch Tutorial: How to Develop Deep Learning Models with Python appeared first on Machine Learning Mastery.

]]>Predictive modeling with deep learning is a skill that modern developers need to know.

PyTorch is the premier open-source deep learning framework developed and maintained by Facebook.

At its core, PyTorch is a mathematical library that allows you to perform efficient computation and automatic differentiation on graph-based models. Achieving this directly is challenging, although thankfully, the modern PyTorch API provides classes and idioms that allow you to easily develop a suite of deep learning models.

In this tutorial, you will discover a step-by-step guide to developing deep learning models in PyTorch.

After completing this tutorial, you will know:

- The difference between Torch and PyTorch and how to install and confirm PyTorch is working.
- The five-step life-cycle of PyTorch models and how to define, fit, and evaluate models.
- How to develop PyTorch deep learning models for regression, classification, and predictive modeling tasks.

Let’s get started.

The focus of this tutorial is on using the PyTorch API for common deep learning model development tasks; we will not be diving into the math and theory of deep learning. For that, I recommend starting with this excellent book.

The best way to learn deep learning in python is by doing. Dive in. You can circle back for more theory later.

I have designed each code example to use best practices and to be standalone so that you can copy and paste it directly into your project and adapt it to your specific needs. This will give you a massive head start over trying to figure out the API from official documentation alone.

It is a large tutorial, and as such, it is divided into three parts; they are:

- How to Install PyTorch
- What Are Torch and PyTorch?
- How to Install PyTorch
- How to Confirm PyTorch Is Installed

- PyTorch Deep Learning Model Life-Cycle
- Step 1: Prepare the Data
- Step 2: Define the Model
- Step 3: Train the Model
- Step 4: Evaluate the Model
- Step 5: Make Predictions

- How to Develop PyTorch Deep Learning Models
- How to Develop an MLP for Binary Classification
- How to Develop an MLP for Multiclass Classification
- How to Develop an MLP for Regression
- How to Develop a CNN for Image Classification

Work through this tutorial. It will take you 60 minutes, max!

**You do not need to understand everything (at least not right now)**. Your goal is to run through the tutorial end-to-end and get a result. You do not need to understand everything on the first pass. List down your questions as you go. Make heavy use of the API documentation to learn about all of the functions that you’re using.

**You do not need to know the math first**. Math is a compact way of describing how algorithms work, specifically tools from linear algebra, probability, and calculus. These are not the only tools that you can use to learn how algorithms work. You can also use code and explore algorithm behavior with different inputs and outputs. Knowing the math will not tell you what algorithm to choose or how to best configure it. You can only discover that through carefully controlled experiments.

**You do not need to know how the algorithms work**. It is important to know about the limitations and how to configure deep learning algorithms. But learning about algorithms can come later. You need to build up this algorithm knowledge slowly over a long period of time. Today, start by getting comfortable with the platform.

**You do not need to be a Python programmer**. The syntax of the Python language can be intuitive if you are new to it. Just like other languages, focus on function calls (e.g. function()) and assignments (e.g. a = “b”). This will get you most of the way. You are a developer; you know how to pick up the basics of a language really fast. Just get started and dive into the details later.

**You do not need to be a deep learning expert**. You can learn about the benefits and limitations of various algorithms later, and there are plenty of tutorials that you can read to brush up on the steps of a deep learning project.

In this section, you will discover what PyTorch is, how to install it, and how to confirm that it is installed correctly.

PyTorch is an open-source Python library for deep learning developed and maintained by Facebook.

The project started in 2016 and quickly became a popular framework among developers and researchers.

Torch (*Torch7*) is an open-source project for deep learning written in C and generally used via the Lua interface. It was a precursor project to PyTorch and is no longer actively developed. PyTorch includes “*Torch*” in the name, acknowledging the prior torch library with the “*Py*” prefix indicating the Python focus of the new project.

The PyTorch API is simple and flexible, making it a favorite for academics and researchers in the development of new deep learning models and applications. The extensive use has led to many extensions for specific applications (such as text, computer vision, and audio data), and may pre-trained models that can be used directly. As such, it may be the most popular library used by academics.

The flexibility of PyTorch comes at the cost of ease of use, especially for beginners, as compared to simpler interfaces like Keras. The choice to use PyTorch instead of Keras gives up some ease of use, a slightly steeper learning curve, and more code for more flexibility, and perhaps a more vibrant academic community.

Before installing PyTorch, ensure that you have Python installed, such as Python 3.6 or higher.

If you don’t have Python installed, you can install it using Anaconda. This tutorial will show you how:

There are many ways to install the PyTorch open-source deep learning library.

The most common, and perhaps simplest, way to install PyTorch on your workstation is by using pip.

For example, on the command line, you can type:

sudo pip install torch

Perhaps the most popular application of deep learning is for computer vision, and the PyTorch computer vision package is called “torchvision.”

Installing torchvision is also highly recommended and it can be installed as follows:

sudo pip install torchvision

If you prefer to use an installation method more specific to your platform or package manager, you can see a complete list of installation instructions here:

There is no need to set up the GPU now.

All examples in this tutorial will work just fine on a modern CPU. If you want to configure PyTorch for your GPU, you can do that after completing this tutorial. Don’t get distracted!

Once PyTorch is installed, it is important to confirm that the library was installed successfully and that you can start using it.

Don’t skip this step.

If PyTorch is not installed correctly or raises an error on this step, you won’t be able to run the examples later.

Create a new file called *versions.py* and copy and paste the following code into the file.

# check pytorch version import torch print(torch.__version__)

Save the file, then open your command line and change directory to where you saved the file.

Then type:

python versions.py

You should then see output like the following:

1.3.1

This confirms that PyTorch is installed correctly and that we are all using the same version.

This also shows you how to run a Python script from the command line. I recommend running all code from the command line in this manner, and not from a notebook or an IDE.

In this section, you will discover the life-cycle for a deep learning model and the PyTorch API that you can use to define models.

A model has a life-cycle, and this very simple knowledge provides the backbone for both modeling a dataset and understanding the PyTorch API.

The five steps in the life-cycle are as follows:

- 1. Prepare the Data.
- 2. Define the Model.
- 3. Train the Model.
- 4. Evaluate the Model.
- 5. Make Predictions.

Let’s take a closer look at each step in turn.

**Note**: There are many ways to achieve each of these steps using the PyTorch API, although I have aimed to show you the simplest, or most common, or most idiomatic.

If you discover a better approach, let me know in the comments below.

The first step is to load and prepare your data.

Neural network models require numerical input data and numerical output data.

You can use standard Python libraries to load and prepare tabular data, like CSV files. For example, Pandas can be used to load your CSV file, and tools from scikit-learn can be used to encode categorical data, such as class labels.

PyTorch provides the Dataset class that you can extend and customize to load your dataset.

For example, the constructor of your dataset object can load your data file (e.g. a CSV file). You can then override the *__len__()* function that can be used to get the length of the dataset (number of rows or samples), and the *__getitem__()* function that is used to get a specific sample by index.

When loading your dataset, you can also perform any required transforms, such as scaling or encoding.

A skeleton of a custom *Dataset* class is provided below.

# dataset definition class CSVDataset(Dataset): # load the dataset def __init__(self, path): # store the inputs and outputs self.X = ... self.y = ... # number of rows in the dataset def __len__(self): return len(self.X) # get a row at an index def __getitem__(self, idx): return [self.X[idx], self.y[idx]]

Once loaded, PyTorch provides the DataLoader class to navigate a *Dataset* instance during the training and evaluation of your model.

A *DataLoader* instance can be created for the training dataset, test dataset, and even a validation dataset.

The random_split() function can be used to split a dataset into train and test sets. Once split, a selection of rows from the *Dataset* can be provided to a DataLoader, along with the batch size and whether the data should be shuffled every epoch.

For example, we can define a *DataLoader* by passing in a selected sample of rows in the dataset.

... # create the dataset dataset = CSVDataset(...) # select rows from the dataset train, test = random_split(dataset, [[...], [...]]) # create a data loader for train and test sets train_dl = DataLoader(train, batch_size=32, shuffle=True) test_dl = DataLoader(test, batch_size=1024, shuffle=False)

Once defined, a *DataLoader* can be enumerated, yielding one batch worth of samples each iteration.

... # train the model for i, (inputs, targets) in enumerate(train_dl): ...

The next step is to define a model.

The idiom for defining a model in PyTorch involves defining a class that extends the Module class.

The constructor of your class defines the layers of the model and the forward() function is the override that defines how to forward propagate input through the defined layers of the model.

Many layers are available, such as Linear for fully connected layers, Conv2d for convolutional layers, and MaxPool2d for pooling layers.

Activation functions can also be defined as layers, such as ReLU, Softmax, and Sigmoid.

Below is an example of a simple MLP model with one layer.

# model definition class MLP(Module): # define model elements def __init__(self, n_inputs): super(MLP, self).__init__() self.layer = Linear(n_inputs, 1) self.activation = Sigmoid() # forward propagate input def forward(self, X): X = self.layer(X) X = self.activation(X) return X

The weights of a given layer can also be initialized after the layer is defined in the constructor.

Common examples include the Xavier and He weight initialization schemes. For example:

... xavier_uniform_(self.layer.weight)

The training process requires that you define a loss function and an optimization algorithm.

Common loss functions include the following:

- BCELoss: Binary cross-entropy loss for binary classification.
- CrossEntropyLoss: Categorical cross-entropy loss for multi-class classification.
- MSELoss: Mean squared loss for regression.

For more on loss functions generally, see the tutorial:

Stochastic gradient descent is used for optimization, and the standard algorithm is provided by the SGD class, although other versions of the algorithm are available, such as Adam.

# define the optimization criterion = MSELoss() optimizer = SGD(model.parameters(), lr=0.01, momentum=0.9)

Training the model involves enumerating the *DataLoader* for the training dataset.

First, a loop is required for the number of training epochs. Then an inner loop is required for the mini-batches for stochastic gradient descent.

... # enumerate epochs for epoch in range(100): # enumerate mini batches for i, (inputs, targets) in enumerate(train_dl): ...

Each update to the model involves the same general pattern comprised of:

- Clearing the last error gradient.
- A forward pass of the input through the model.
- Calculating the loss for the model output.
- Backpropagating the error through the model.
- Update the model in an effort to reduce loss.

For example:

... # clear the gradients optimizer.zero_grad() # compute the model output yhat = model(inputs) # calculate loss loss = criterion(yhat, targets) # credit assignment loss.backward() # update model weights optimizer.step()

Once the model is fit, it can be evaluated on the test dataset.

This can be achieved by using the *DataLoader* for the test dataset and collecting the predictions for the test set, then comparing the predictions to the expected values of the test set and calculating a performance metric.

... for i, (inputs, targets) in enumerate(test_dl): # evaluate the model on the test set yhat = model(inputs) ...

A fit model can be used to make a prediction on new data.

For example, you might have a single image or a single row of data and want to make a prediction.

This requires that you wrap the data in a PyTorch Tensor data structure.

A Tensor is just the PyTorch version of a NumPy array for holding data. It also allows you to perform the automatic differentiation tasks in the model graph, like calling *backward()* when training the model.

The prediction too will be a Tensor, although you can retrieve the NumPy array by detaching the Tensor from the automatic differentiation graph and calling the NumPy function.

... # convert row to data row = Variable(Tensor([row]).float()) # make prediction yhat = model(row) # retrieve numpy array yhat = yhat.detach().numpy()

Now that we are familiar with the PyTorch API at a high-level and the model life-cycle, let’s look at how we can develop some standard deep learning models from scratch.

In this section, you will discover how to develop, evaluate, and make predictions with standard deep learning models, including Multilayer Perceptrons (MLP) and Convolutional Neural Networks (CNN).

A Multilayer Perceptron model, or MLP for short, is a standard fully connected neural network model.

It is comprised of layers of nodes where each node is connected to all outputs from the previous layer and the output of each node is connected to all inputs for nodes in the next layer.

An MLP is a model with one or more fully connected layers. This model is appropriate for tabular data, that is data as it looks in a table or spreadsheet with one column for each variable and one row for each variable. There are three predictive modeling problems you may want to explore with an MLP; they are binary classification, multiclass classification, and regression.

Let’s fit a model on a real dataset for each of these cases.

**Note**: The models in this section are effective, but not optimized. See if you can improve their performance. Post your findings in the comments below.

We will use the Ionosphere binary (two class) classification dataset to demonstrate an MLP for binary classification.

This dataset involves predicting whether there is a structure in the atmosphere or not given radar returns.

The dataset will be downloaded automatically using Pandas, but you can learn more about it here.

We will use a LabelEncoder to encode the string labels to integer values 0 and 1. The model will be fit on 67 percent of the data, and the remaining 33 percent will be used for evaluation, split using the train_test_split() function.

It is a good practice to use ‘*relu*‘ activation with a ‘*He Uniform*‘ weight initialization. This combination goes a long way to overcome the problem of vanishing gradients when training deep neural network models. For more on ReLU, see the tutorial:

The model predicts the probability of class 1 and uses the sigmoid activation function. The model is optimized using stochastic gradient descent and seeks to minimize the binary cross-entropy loss.

The complete example is listed below.

# pytorch mlp for binary classification from numpy import vstack from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.metrics import accuracy_score from torch.utils.data import Dataset from torch.utils.data import DataLoader from torch.utils.data import random_split from torch import Tensor from torch.nn import Linear from torch.nn import ReLU from torch.nn import Sigmoid from torch.nn import Module from torch.optim import SGD from torch.nn import BCELoss from torch.nn.init import kaiming_uniform_ from torch.nn.init import xavier_uniform_ # dataset definition class CSVDataset(Dataset): # load the dataset def __init__(self, path): # load the csv file as a dataframe df = read_csv(path, header=None) # store the inputs and outputs self.X = df.values[:, :-1] self.y = df.values[:, -1] # ensure input data is floats self.X = self.X.astype('float32') # label encode target and ensure the values are floats self.y = LabelEncoder().fit_transform(self.y) self.y = self.y.astype('float32') self.y = self.y.reshape((len(self.y), 1)) # number of rows in the dataset def __len__(self): return len(self.X) # get a row at an index def __getitem__(self, idx): return [self.X[idx], self.y[idx]] # get indexes for train and test rows def get_splits(self, n_test=0.33): # determine sizes test_size = round(n_test * len(self.X)) train_size = len(self.X) - test_size # calculate the split return random_split(self, [train_size, test_size]) # model definition class MLP(Module): # define model elements def __init__(self, n_inputs): super(MLP, self).__init__() # input to first hidden layer self.hidden1 = Linear(n_inputs, 10) kaiming_uniform_(self.hidden1.weight, nonlinearity='relu') self.act1 = ReLU() # second hidden layer self.hidden2 = Linear(10, 8) kaiming_uniform_(self.hidden2.weight, nonlinearity='relu') self.act2 = ReLU() # third hidden layer and output self.hidden3 = Linear(8, 1) xavier_uniform_(self.hidden3.weight) self.act3 = Sigmoid() # forward propagate input def forward(self, X): # input to first hidden layer X = self.hidden1(X) X = self.act1(X) # second hidden layer X = self.hidden2(X) X = self.act2(X) # third hidden layer and output X = self.hidden3(X) X = self.act3(X) return X # prepare the dataset def prepare_data(path): # load the dataset dataset = CSVDataset(path) # calculate split train, test = dataset.get_splits() # prepare data loaders train_dl = DataLoader(train, batch_size=32, shuffle=True) test_dl = DataLoader(test, batch_size=1024, shuffle=False) return train_dl, test_dl # train the model def train_model(train_dl, model): # define the optimization criterion = BCELoss() optimizer = SGD(model.parameters(), lr=0.01, momentum=0.9) # enumerate epochs for epoch in range(100): # enumerate mini batches for i, (inputs, targets) in enumerate(train_dl): # clear the gradients optimizer.zero_grad() # compute the model output yhat = model(inputs) # calculate loss loss = criterion(yhat, targets) # credit assignment loss.backward() # update model weights optimizer.step() # evaluate the model def evaluate_model(test_dl, model): predictions, actuals = list(), list() for i, (inputs, targets) in enumerate(test_dl): # evaluate the model on the test set yhat = model(inputs) # retrieve numpy array yhat = yhat.detach().numpy() actual = targets.numpy() actual = actual.reshape((len(actual), 1)) # round to class values yhat = yhat.round() # store predictions.append(yhat) actuals.append(actual) predictions, actuals = vstack(predictions), vstack(actuals) # calculate accuracy acc = accuracy_score(actuals, predictions) return acc # make a class prediction for one row of data def predict(row, model): # convert row to data row = Tensor([row]) # make prediction yhat = model(row) # retrieve numpy array yhat = yhat.detach().numpy() return yhat # prepare the data path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/ionosphere.csv' train_dl, test_dl = prepare_data(path) print(len(train_dl.dataset), len(test_dl.dataset)) # define the network model = MLP(34) # train the model train_model(train_dl, model) # evaluate the model acc = evaluate_model(test_dl, model) print('Accuracy: %.3f' % acc) # make a single prediction (expect class=1) row = [1,0,0.99539,-0.05889,0.85243,0.02306,0.83398,-0.37708,1,0.03760,0.85243,-0.17755,0.59755,-0.44945,0.60536,-0.38223,0.84356,-0.38542,0.58212,-0.32192,0.56971,-0.29674,0.36946,-0.47357,0.56811,-0.51171,0.41078,-0.46168,0.21266,-0.34090,0.42267,-0.54487,0.18641,-0.45300] yhat = predict(row, model) print('Predicted: %.3f (class=%d)' % (yhat, yhat.round()))

Running the example first reports the shape of the train and test datasets, then fits the model and evaluates it on the test dataset. Finally, a prediction is made for a single row of data.

Your specific results will vary given the stochastic nature of the learning algorithm. Try running the example a few times.

**What result did you get?**

**Can you change the model to do better?**

Post your findings to the comments below.

In this case, we can see that the model achieved a classification accuracy of about 94 percent and then predicted a probability of 0.99 that the one row of data belong to class 1.

235 116 Accuracy: 0.948 Predicted: 0.998 (class=1)

We will use the Iris flowers multiclass classification dataset to demonstrate an MLP for multiclass classification.

This problem involves predicting the species of iris flower given measures of the flower.

The dataset will be downloaded automatically using Pandas, but you can learn more about it here.

Given that it is a multiclass classification, the model must have one node for each class in the output layer and use the softmax activation function. The loss function is the cross entropy, which is appropriate for integer encoded class labels (e.g. 0 for one class, 1 for the next class, etc.).

The complete example of fitting and evaluating an MLP on the iris flowers dataset is listed below.

# pytorch mlp for multiclass classification from numpy import vstack from numpy import argmax from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.metrics import accuracy_score from torch import Tensor from torch.utils.data import Dataset from torch.utils.data import DataLoader from torch.utils.data import random_split from torch.nn import Linear from torch.nn import ReLU from torch.nn import Softmax from torch.nn import Module from torch.optim import SGD from torch.nn import CrossEntropyLoss from torch.nn.init import kaiming_uniform_ from torch.nn.init import xavier_uniform_ # dataset definition class CSVDataset(Dataset): # load the dataset def __init__(self, path): # load the csv file as a dataframe df = read_csv(path, header=None) # store the inputs and outputs self.X = df.values[:, :-1] self.y = df.values[:, -1] # ensure input data is floats self.X = self.X.astype('float32') # label encode target and ensure the values are floats self.y = LabelEncoder().fit_transform(self.y) # number of rows in the dataset def __len__(self): return len(self.X) # get a row at an index def __getitem__(self, idx): return [self.X[idx], self.y[idx]] # get indexes for train and test rows def get_splits(self, n_test=0.33): # determine sizes test_size = round(n_test * len(self.X)) train_size = len(self.X) - test_size # calculate the split return random_split(self, [train_size, test_size]) # model definition class MLP(Module): # define model elements def __init__(self, n_inputs): super(MLP, self).__init__() # input to first hidden layer self.hidden1 = Linear(n_inputs, 10) kaiming_uniform_(self.hidden1.weight, nonlinearity='relu') self.act1 = ReLU() # second hidden layer self.hidden2 = Linear(10, 8) kaiming_uniform_(self.hidden2.weight, nonlinearity='relu') self.act2 = ReLU() # third hidden layer and output self.hidden3 = Linear(8, 3) xavier_uniform_(self.hidden3.weight) self.act3 = Softmax(dim=1) # forward propagate input def forward(self, X): # input to first hidden layer X = self.hidden1(X) X = self.act1(X) # second hidden layer X = self.hidden2(X) X = self.act2(X) # output layer X = self.hidden3(X) X = self.act3(X) return X # prepare the dataset def prepare_data(path): # load the dataset dataset = CSVDataset(path) # calculate split train, test = dataset.get_splits() # prepare data loaders train_dl = DataLoader(train, batch_size=32, shuffle=True) test_dl = DataLoader(test, batch_size=1024, shuffle=False) return train_dl, test_dl # train the model def train_model(train_dl, model): # define the optimization criterion = CrossEntropyLoss() optimizer = SGD(model.parameters(), lr=0.01, momentum=0.9) # enumerate epochs for epoch in range(500): # enumerate mini batches for i, (inputs, targets) in enumerate(train_dl): # clear the gradients optimizer.zero_grad() # compute the model output yhat = model(inputs) # calculate loss loss = criterion(yhat, targets) # credit assignment loss.backward() # update model weights optimizer.step() # evaluate the model def evaluate_model(test_dl, model): predictions, actuals = list(), list() for i, (inputs, targets) in enumerate(test_dl): # evaluate the model on the test set yhat = model(inputs) # retrieve numpy array yhat = yhat.detach().numpy() actual = targets.numpy() # convert to class labels yhat = argmax(yhat, axis=1) # reshape for stacking actual = actual.reshape((len(actual), 1)) yhat = yhat.reshape((len(yhat), 1)) # store predictions.append(yhat) actuals.append(actual) predictions, actuals = vstack(predictions), vstack(actuals) # calculate accuracy acc = accuracy_score(actuals, predictions) return acc # make a class prediction for one row of data def predict(row, model): # convert row to data row = Tensor([row]) # make prediction yhat = model(row) # retrieve numpy array yhat = yhat.detach().numpy() return yhat # prepare the data path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv' train_dl, test_dl = prepare_data(path) print(len(train_dl.dataset), len(test_dl.dataset)) # define the network model = MLP(4) # train the model train_model(train_dl, model) # evaluate the model acc = evaluate_model(test_dl, model) print('Accuracy: %.3f' % acc) # make a single prediction row = [5.1,3.5,1.4,0.2] yhat = predict(row, model) print('Predicted: %s (class=%d)' % (yhat, argmax(yhat)))

Running the example first reports the shape of the train and test datasets, then fits the model and evaluates it on the test dataset. Finally, a prediction is made for a single row of data.

Your specific results will vary given the stochastic nature of the learning algorithm. Try running the example a few times.

**What result did you get?
Can you change the model to do better?**

Post your findings to the comments below.

In this case, we can see that the model achieved a classification accuracy of about 98 percent and then predicted a probability of a row of data belonging to each class, although class 0 has the highest probability.

100 50 Accuracy: 0.980 Predicted: [[9.5524162e-01 4.4516966e-02 2.4138369e-04]] (class=0)

We will use the Boston housing regression dataset to demonstrate an MLP for regression predictive modeling.

This problem involves predicting house value based on properties of the house and neighborhood.

The dataset will be downloaded automatically using Pandas, but you can learn more about it here.

This is a regression problem that involves predicting a single numeric value. As such, the output layer has a single node and uses the default or linear activation function (no activation function). The mean squared error (mse) loss is minimized when fitting the model.

Recall that this is regression, not classification; therefore, we cannot calculate classification accuracy. For more on this, see the tutorial:

The complete example of fitting and evaluating an MLP on the Boston housing dataset is listed below.

# pytorch mlp for regression from numpy import vstack from numpy import sqrt from pandas import read_csv from sklearn.metrics import mean_squared_error from torch.utils.data import Dataset from torch.utils.data import DataLoader from torch.utils.data import random_split from torch import Tensor from torch.nn import Linear from torch.nn import Sigmoid from torch.nn import Module from torch.optim import SGD from torch.nn import MSELoss from torch.nn.init import xavier_uniform_ # dataset definition class CSVDataset(Dataset): # load the dataset def __init__(self, path): # load the csv file as a dataframe df = read_csv(path, header=None) # store the inputs and outputs self.X = df.values[:, :-1].astype('float32') self.y = df.values[:, -1].astype('float32') # ensure target has the right shape self.y = self.y.reshape((len(self.y), 1)) # number of rows in the dataset def __len__(self): return len(self.X) # get a row at an index def __getitem__(self, idx): return [self.X[idx], self.y[idx]] # get indexes for train and test rows def get_splits(self, n_test=0.33): # determine sizes test_size = round(n_test * len(self.X)) train_size = len(self.X) - test_size # calculate the split return random_split(self, [train_size, test_size]) # model definition class MLP(Module): # define model elements def __init__(self, n_inputs): super(MLP, self).__init__() # input to first hidden layer self.hidden1 = Linear(n_inputs, 10) xavier_uniform_(self.hidden1.weight) self.act1 = Sigmoid() # second hidden layer self.hidden2 = Linear(10, 8) xavier_uniform_(self.hidden2.weight) self.act2 = Sigmoid() # third hidden layer and output self.hidden3 = Linear(8, 1) xavier_uniform_(self.hidden3.weight) # forward propagate input def forward(self, X): # input to first hidden layer X = self.hidden1(X) X = self.act1(X) # second hidden layer X = self.hidden2(X) X = self.act2(X) # third hidden layer and output X = self.hidden3(X) return X # prepare the dataset def prepare_data(path): # load the dataset dataset = CSVDataset(path) # calculate split train, test = dataset.get_splits() # prepare data loaders train_dl = DataLoader(train, batch_size=32, shuffle=True) test_dl = DataLoader(test, batch_size=1024, shuffle=False) return train_dl, test_dl # train the model def train_model(train_dl, model): # define the optimization criterion = MSELoss() optimizer = SGD(model.parameters(), lr=0.01, momentum=0.9) # enumerate epochs for epoch in range(100): # enumerate mini batches for i, (inputs, targets) in enumerate(train_dl): # clear the gradients optimizer.zero_grad() # compute the model output yhat = model(inputs) # calculate loss loss = criterion(yhat, targets) # credit assignment loss.backward() # update model weights optimizer.step() # evaluate the model def evaluate_model(test_dl, model): predictions, actuals = list(), list() for i, (inputs, targets) in enumerate(test_dl): # evaluate the model on the test set yhat = model(inputs) # retrieve numpy array yhat = yhat.detach().numpy() actual = targets.numpy() actual = actual.reshape((len(actual), 1)) # store predictions.append(yhat) actuals.append(actual) predictions, actuals = vstack(predictions), vstack(actuals) # calculate mse mse = mean_squared_error(actuals, predictions) return mse # make a class prediction for one row of data def predict(row, model): # convert row to data row = Tensor([row]) # make prediction yhat = model(row) # retrieve numpy array yhat = yhat.detach().numpy() return yhat # prepare the data path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' train_dl, test_dl = prepare_data(path) print(len(train_dl.dataset), len(test_dl.dataset)) # define the network model = MLP(13) # train the model train_model(train_dl, model) # evaluate the model mse = evaluate_model(test_dl, model) print('MSE: %.3f, RMSE: %.3f' % (mse, sqrt(mse))) # make a single prediction (expect class=1) row = [0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98] yhat = predict(row, model) print('Predicted: %.3f' % yhat)

Running the example first reports the shape of the train and test datasets, then fits the model and evaluates it on the test dataset. Finally, a prediction is made for a single row of data.

Your specific results will vary given the stochastic nature of the learning algorithm. Try running the example a few times.

**What result did you get?
Can you change the model to do better?**

Post your findings to the comments below.

In this case, we can see that the model achieved a MSE of about 82, which is an RMSE of about nine (units are thousands of dollars). A value of 21 is then predicted for the single example.

339 167 MSE: 82.576, RMSE: 9.087 Predicted: 21.909

Convolutional Neural Networks, or CNNs for short, are a type of network designed for image input.

They are comprised of models with convolutional layers that extract features (called feature maps) and pooling layers that distill features down to the most salient elements.

CNNs are best suited to image classification tasks, although they can be used on a wide array of tasks that take images as input.

A popular image classification task is the MNIST handwritten digit classification. It involves tens of thousands of handwritten digits that must be classified as a number between 0 and 9.

The torchvision API provides a convenience function to download and load this dataset directly.

The example below loads the dataset and plots the first few images.

# load mnist dataset in pytorch from torch.utils.data import DataLoader from torchvision.datasets import MNIST from torchvision.transforms import Compose from torchvision.transforms import ToTensor from matplotlib import pyplot # define location to save or load the dataset path = '~/.torch/datasets/mnist' # define the transforms to apply to the data trans = Compose([ToTensor()]) # download and define the datasets train = MNIST(path, train=True, download=True, transform=trans) test = MNIST(path, train=False, download=True, transform=trans) # define how to enumerate the datasets train_dl = DataLoader(train, batch_size=32, shuffle=True) test_dl = DataLoader(test, batch_size=32, shuffle=True) # get one batch of images i, (inputs, targets) = next(enumerate(train_dl)) # plot some images for i in range(25): # define subplot pyplot.subplot(5, 5, i+1) # plot raw pixel data pyplot.imshow(inputs[i][0], cmap='gray') # show the figure pyplot.show()

Running the example loads the MNIST dataset, then summarizes the default train and test datasets.

Train: X=(60000, 28, 28), y=(60000,) Test: X=(10000, 28, 28), y=(10000,)

A plot is then created showing a grid of examples of handwritten images in the training dataset.

We can train a CNN model to classify the images in the MNIST dataset.

Note that the images are arrays of grayscale pixel data, therefore, we must add a channel dimension to the data before we can use the images as input to the model.

It is a good idea to scale the pixel values from the default range of 0-255 to have a zero mean and a standard deviation of 1. For more on scaling pixel values, see the tutorial:

The complete example of fitting and evaluating a CNN model on the MNIST dataset is listed below.

# pytorch cnn for multiclass classification from numpy import vstack from numpy import argmax from pandas import read_csv from sklearn.metrics import accuracy_score from torchvision.datasets import MNIST from torchvision.transforms import Compose from torchvision.transforms import ToTensor from torchvision.transforms import Normalize from torch.utils.data import DataLoader from torch.nn import Conv2d from torch.nn import MaxPool2d from torch.nn import Linear from torch.nn import ReLU from torch.nn import Softmax from torch.nn import Module from torch.optim import SGD from torch.nn import CrossEntropyLoss from torch.nn.init import kaiming_uniform_ from torch.nn.init import xavier_uniform_ # model definition class CNN(Module): # define model elements def __init__(self, n_channels): super(CNN, self).__init__() # input to first hidden layer self.hidden1 = Conv2d(n_channels, 32, (3,3)) kaiming_uniform_(self.hidden1.weight, nonlinearity='relu') self.act1 = ReLU() # first pooling layer self.pool1 = MaxPool2d((2,2), stride=(2,2)) # second hidden layer self.hidden2 = Conv2d(32, 32, (3,3)) kaiming_uniform_(self.hidden2.weight, nonlinearity='relu') self.act2 = ReLU() # second pooling layer self.pool2 = MaxPool2d((2,2), stride=(2,2)) # fully connected layer self.hidden3 = Linear(5*5*32, 100) kaiming_uniform_(self.hidden3.weight, nonlinearity='relu') self.act3 = ReLU() # output layer self.hidden4 = Linear(100, 10) xavier_uniform_(self.hidden4.weight) self.act4 = Softmax(dim=1) # forward propagate input def forward(self, X): # input to first hidden layer X = self.hidden1(X) X = self.act1(X) X = self.pool1(X) # second hidden layer X = self.hidden2(X) X = self.act2(X) X = self.pool2(X) # flatten X = X.view(-1, 4*4*50) # third hidden layer X = self.hidden3(X) X = self.act3(X) # output layer X = self.hidden4(X) X = self.act4(X) return X # prepare the dataset def prepare_data(path): # define standardization trans = Compose([ToTensor(), Normalize((0.1307,), (0.3081,))]) # load dataset train = MNIST(path, train=True, download=True, transform=trans) test = MNIST(path, train=False, download=True, transform=trans) # prepare data loaders train_dl = DataLoader(train, batch_size=64, shuffle=True) test_dl = DataLoader(test, batch_size=1024, shuffle=False) return train_dl, test_dl # train the model def train_model(train_dl, model): # define the optimization criterion = CrossEntropyLoss() optimizer = SGD(model.parameters(), lr=0.01, momentum=0.9) # enumerate epochs for epoch in range(10): # enumerate mini batches for i, (inputs, targets) in enumerate(train_dl): # clear the gradients optimizer.zero_grad() # compute the model output yhat = model(inputs) # calculate loss loss = criterion(yhat, targets) # credit assignment loss.backward() # update model weights optimizer.step() # evaluate the model def evaluate_model(test_dl, model): predictions, actuals = list(), list() for i, (inputs, targets) in enumerate(test_dl): # evaluate the model on the test set yhat = model(inputs) # retrieve numpy array yhat = yhat.detach().numpy() actual = targets.numpy() # convert to class labels yhat = argmax(yhat, axis=1) # reshape for stacking actual = actual.reshape((len(actual), 1)) yhat = yhat.reshape((len(yhat), 1)) # store predictions.append(yhat) actuals.append(actual) predictions, actuals = vstack(predictions), vstack(actuals) # calculate accuracy acc = accuracy_score(actuals, predictions) return acc # prepare the data path = '~/.torch/datasets/mnist' train_dl, test_dl = prepare_data(path) print(len(train_dl.dataset), len(test_dl.dataset)) # define the network model = CNN(1) # # train the model train_model(train_dl, model) # evaluate the model acc = evaluate_model(test_dl, model) print('Accuracy: %.3f' % acc)

Running the example first reports the shape of the train and test datasets, then fits the model and evaluates it on the test dataset.

**What result did you get?
Can you change the model to do better?**

Post your findings to the comments below.

In this case, we can see that the model achieved a classification accuracy of about 98 percent on the test dataset. We can then see that the model predicted class 5 for the first image in the training set.

60000 10000 Accuracy: 0.985

This section provides more resources on the topic if you are looking to go deeper.

- Deep Learning, 2016.
- Programming PyTorch for Deep Learning: Creating and Deploying Deep Learning Applications, 2018.
- Deep Learning with PyTorch, 2020.
- Deep Learning for Coders with fastai and PyTorch: AI Applications Without a PhD, 2020.

- PyTorch Homepage.
- PyTorch Documentation
- PyTorch Installation Guide
- PyTorch, Wikipedia.
- PyTorch on GitHub.

In this tutorial, you discovered a step-by-step guide to developing deep learning models in PyTorch.

Specifically, you learned:

- The difference between Torch and PyTorch and how to install and confirm PyTorch is working.
- The five-step life-cycle of PyTorch models and how to define, fit, and evaluate models.
- How to develop PyTorch deep learning models for regression, classification, and predictive modeling tasks.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post PyTorch Tutorial: How to Develop Deep Learning Models with Python appeared first on Machine Learning Mastery.

]]>The post Basic Data Cleaning for Machine Learning (That You Must Perform) appeared first on Machine Learning Mastery.

]]>Data cleaning is a critically important step in any machine learning project.

In tabular data, there are many different statistical analysis and data visualization techniques you can use to explore your data in order to identify data cleaning operations you may want to perform.

Before jumping to the sophisticated methods, there are some very basic data cleaning operations that you probably should perform on every single machine learning project. These are so basic that they are often overlooked by seasoned machine learning practitioners, yet are so critical that if skipped, models may break or report overly optimistic performance results.

In this tutorial, you will discover basic data cleaning you should always perform on your dataset.

After completing this tutorial, you will know:

- How to identify and remove column variables that only have a single value.
- How to identify and consider column variables with very few unique values.
- How to identify and remove rows that contain duplicate observations.

Let’s get started.

This tutorial is divided into five parts; they are:

- Identify Columns That Contain a Single Value
- Delete Columns That Contain a Single Value
- Consider Columns That Have Very Few Values
- Identify Rows that Contain Duplicate Data
- Delete Rows that Contain Duplicate Data

Columns that have a single observation or value are probably useless for modeling.

Here, a single value means that each row for that column has the same value. For example, the column *X1* has the value 1.0 for all rows in the dataset:

X1 1.0 1.0 1.0 1.0 1.0 ...

Columns that have a single value for all rows do not contain any information for modeling.

Depending on the choice of data preparation and modeling algorithms, variables with a single value can also cause errors or unexpected results.

You can detect rows that have this property using the unique() NumPy function that will report the number of unique values in each column.

The example below loads the oil-spill classification dataset that contains 50 variables and summarizes the number of unique values for each column.

# summarize the number of unique values for each column using numpy from urllib.request import urlopen from numpy import loadtxt from numpy import unique # define the location of the dataset path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/oil-spill.csv' # load the dataset data = loadtxt(urlopen(path), delimiter=',') # summarize the number of unique values in each column for i in range(data.shape[1]): print(i, len(unique(data[:, i])))

Running the example loads the dataset directly from the URL and prints the number of unique values for each column.

We can see that column index 22 only has a single value and should be removed.

0 238 1 297 2 927 3 933 4 179 5 375 6 820 7 618 8 561 9 57 10 577 11 59 12 73 13 107 14 53 15 91 16 893 17 810 18 170 19 53 20 68 21 9 22 1 23 92 24 9 25 8 26 9 27 308 28 447 29 392 30 107 31 42 32 4 33 45 34 141 35 110 36 3 37 758 38 9 39 9 40 388 41 220 42 644 43 649 44 499 45 2 46 937 47 169 48 286 49 2

A simpler approach is to use the nunique() Pandas function that does the hard work for you.

Below is the same example using the Pandas function.

# summarize the number of unique values for each column using numpy from pandas import read_csv # define the location of the dataset path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/oil-spill.csv' # load the dataset df = read_csv(path, header=None) # summarize the number of unique values in each column print(df.nunique())

Running the example, we get the same result, the column index, and the number of unique values for each column.

0 238 1 297 2 927 3 933 4 179 5 375 6 820 7 618 8 561 9 57 10 577 11 59 12 73 13 107 14 53 15 91 16 893 17 810 18 170 19 53 20 68 21 9 22 1 23 92 24 9 25 8 26 9 27 308 28 447 29 392 30 107 31 42 32 4 33 45 34 141 35 110 36 3 37 758 38 9 39 9 40 388 41 220 42 644 43 649 44 499 45 2 46 937 47 169 48 286 49 2 dtype: int64

Variables or columns that have a single value should probably be removed from your dataset

Columns are relatively easy to remove from a NumPy array or Pandas DataFrame.

One approach is to record all columns that have a single unique value, then delete them from the Pandas DataFrame by calling the drop() function.

The complete example is listed below.

# delete columns with a single unique value from pandas import read_csv # define the location of the dataset path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/oil-spill.csv' # load the dataset df = read_csv(path, header=None) print(df.shape) # get number of unique values for each column counts = df.nunique() # record columns to delete to_del = [i for i,v in enumerate(counts) if v == 1] print(to_del) # drop useless columns df.drop(to_del, axis=1, inplace=True) print(df.shape)

Running the example first loads the dataset and reports the number of rows and columns.

The number of unique values for each column is calculated, and those columns that have a single unique value are identified. In this case, column index 22.

The identified columns are then removed from the DataFrame, and the number of rows and columns in the DataFrame are reported to confirm the change.

(937, 50) [22] (937, 49)

In the previous section, we saw that some columns in the example dataset had very few unique values.

For example, there were columns that only had 2, 4, and 9 unique values. This might make sense for ordinal or categorical variables. In this case, the dataset only contains numerical variables. As such, only having 2, 4, or 9 unique numerical values in a column might be surprising.

These columns may or may not contribute to the skill of a model.

Depending on the choice of data preparation and modeling algorithms, variables with very few numerical values can also cause errors or unexpected results. For example, I have seen them cause errors when using power transforms for data preparation and when fitting linear models that assume a “*sensible*” data probability distribution.

To help highlight columns of this type, you can calculate the number of unique values for each variable as a percentage of the total number of rows in the dataset.

Let’s do this manually using NumPy. The complete example is listed below.

# summarize the percentage of unique values for each column using numpy from urllib.request import urlopen from numpy import loadtxt from numpy import unique # define the location of the dataset path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/oil-spill.csv' # load the dataset data = loadtxt(urlopen(path), delimiter=',') # summarize the number of unique values in each column for i in range(data.shape[1]): num = len(unique(data[:, i])) percentage = float(num) / data.shape[0] * 100 print('%d, %d, %.1f%%' % (i, num, percentage))

Running the example reports the column index and the number of unique values for each column, followed by the percentage of unique values out of all rows in the dataset.

Here, we can see that some columns have a very low percentage of unique values, such as below 1 percent.

0, 238, 25.4% 1, 297, 31.7% 2, 927, 98.9% 3, 933, 99.6% 4, 179, 19.1% 5, 375, 40.0% 6, 820, 87.5% 7, 618, 66.0% 8, 561, 59.9% 9, 57, 6.1% 10, 577, 61.6% 11, 59, 6.3% 12, 73, 7.8% 13, 107, 11.4% 14, 53, 5.7% 15, 91, 9.7% 16, 893, 95.3% 17, 810, 86.4% 18, 170, 18.1% 19, 53, 5.7% 20, 68, 7.3% 21, 9, 1.0% 22, 1, 0.1% 23, 92, 9.8% 24, 9, 1.0% 25, 8, 0.9% 26, 9, 1.0% 27, 308, 32.9% 28, 447, 47.7% 29, 392, 41.8% 30, 107, 11.4% 31, 42, 4.5% 32, 4, 0.4% 33, 45, 4.8% 34, 141, 15.0% 35, 110, 11.7% 36, 3, 0.3% 37, 758, 80.9% 38, 9, 1.0% 39, 9, 1.0% 40, 388, 41.4% 41, 220, 23.5% 42, 644, 68.7% 43, 649, 69.3% 44, 499, 53.3% 45, 2, 0.2% 46, 937, 100.0% 47, 169, 18.0% 48, 286, 30.5% 49, 2, 0.2%

We can update the example to only summarize those variables that have unique values that are less than 1 percent of the number of rows.

# summarize the percentage of unique values for each column using numpy from urllib.request import urlopen from numpy import loadtxt from numpy import unique # define the location of the dataset path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/oil-spill.csv' # load the dataset data = loadtxt(urlopen(path), delimiter=',') # summarize the number of unique values in each column for i in range(data.shape[1]): num = len(unique(data[:, i])) percentage = float(num) / data.shape[0] * 100 if percentage < 1: print('%d, %d, %.1f%%' % (i, num, percentage))

Running the example, we can see that 11 of the 50 variables have numerical variables that have unique values that are less than 1 percent of the number of rows.

This does not mean that these rows and columns should be deleted, but they require further attention.

For example:

- Perhaps the unique values can be encoded as ordinal values?
- Perhaps the unique values can be encoded as categorical values?
- Perhaps compare model skill with each variable removed from the dataset?

21, 9, 1.0% 22, 1, 0.1% 24, 9, 1.0% 25, 8, 0.9% 26, 9, 1.0% 32, 4, 0.4% 36, 3, 0.3% 38, 9, 1.0% 39, 9, 1.0% 45, 2, 0.2% 49, 2, 0.2%

For example, if we wanted to delete all 11 columns with unique values less than 1 percent of rows; the example below demonstrates this.

# delete columns where number of unique values is less than 1% of the rows from pandas import read_csv # define the location of the dataset path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/oil-spill.csv' # load the dataset df = read_csv(path, header=None) print(df.shape) # get number of unique values for each column counts = df.nunique() # record columns to delete to_del = [i for i,v in enumerate(counts) if (float(v)/df.shape[0]*100) < 1] print(to_del) # drop useless columns df.drop(to_del, axis=1, inplace=True) print(df.shape)

Running the example first loads the dataset and reports the number of rows and columns.

The number of unique values for each column is calculated, and those columns that have a number of unique values less than 1 percent of the rows are identified. In this case, 11 columns.

The identified columns are then removed from the DataFrame, and the number of rows and columns in the DataFrame are reported to confirm the change.

(937, 50) [21, 22, 24, 25, 26, 32, 36, 38, 39, 45, 49] (937, 39)

Rows that have identical data are probably useless, if not dangerously misleading during model evaluation.

Here, a duplicate row is a row where each value in each column for that row appears in identically the same order (same column values) in another row.

From a probabilistic perspective, you can think of duplicate data as adjusting the priors for a class label or data distribution. This may help an algorithm like Naive Bayes if you wish to purposefully bias the priors. Typically, this is not the case and machine learning algorithms will perform better by identifying and removing rows with duplicate data.

From an algorithm evaluation perspective, duplicate rows will result in misleading performance. For example, if you are using a train/test split or k-fold cross-validation, then it is possible for a duplicate row or rows to appear in both train and test datasets and any evaluation of the model on these rows will be (or should be) correct. This will result in an optimistically biased estimate of performance on unseen data.

If you think this is not the case for your dataset or chosen model, design a controlled experiment to test it. This could be achieved by evaluating model skill with the raw dataset and the dataset with duplicates removed and comparing performance. Another experiment might involve augmenting the dataset with different numbers of randomly selected duplicate examples.

The pandas function duplicated() will report whether a given row is duplicated or not. All rows are marked as either False to indicate that it is not a duplicate or True to indicate that it is a duplicate. If there are duplicates, the first occurrence of the row is marked False (by default), as we might expect.

The example below checks for duplicates.

# locate rows of duplicate data from pandas import read_csv # define the location of the dataset path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv' # load the dataset df = read_csv(path, header=None) # calculate duplicates dups = df.duplicated() # report if there are any duplicates print(dups.any()) # list all duplicate rows print(df[dups])

Running the example first loads the dataset, then calculates row duplicates.

First, the presence of any duplicate rows is reported, and in this case, we can see that there are duplicates (True).

Then all duplicate rows are reported. In this case, we can see that three duplicate rows that were identified are printed.

True 0 1 2 3 4 34 4.9 3.1 1.5 0.1 Iris-setosa 37 4.9 3.1 1.5 0.1 Iris-setosa 142 5.8 2.7 5.1 1.9 Iris-virginica

Rows of duplicate data should probably be deleted from your dataset prior to modeling.

There are many ways to achieve this, although Pandas provides the drop_duplicates() function that achieves exactly this.

The example below demonstrates deleting duplicate rows from a dataset.

# delete rows of duplicate data from the dataset from pandas import read_csv # define the location of the dataset path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv' # load the dataset df = read_csv(path, header=None) print(df.shape) # delete duplicate rows df.drop_duplicates(inplace=True) print(df.shape)

Running the example first loads the dataset and reports the number of rows and columns.

Next, the rows of duplicated data are identified and removed from the DataFrame. Then the shape of the DataFrame is reported to confirm the change.

(150, 5) (147, 5)

This section provides more resources on the topic if you are looking to go deeper.

- numpy.unique API.
- pandas.DataFrame.nunique API.
- pandas.DataFrame.drop API.
- pandas.DataFrame.duplicated API.
- pandas.DataFrame.drop_duplicates API.

In this tutorial, you discovered basic data cleaning you should always perform on your dataset.

Specifically, you learned:

- How to identify and remove column variables that only have a single value.
- How to identify and consider column variables with very few unique values.
- How to identify and remove rows that contain duplicate observations.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Basic Data Cleaning for Machine Learning (That You Must Perform) appeared first on Machine Learning Mastery.

]]>The post Neural Networks are Function Approximation Algorithms appeared first on Machine Learning Mastery.

]]>Supervised learning in machine learning can be described in terms of function approximation.

Given a dataset comprised of inputs and outputs, we assume that there is an unknown underlying function that is consistent in mapping inputs to outputs in the target domain and resulted in the dataset. We then use supervised learning algorithms to approximate this function.

Neural networks are an example of a supervised machine learning algorithm that is perhaps best understood in the context of function approximation. This can be demonstrated with examples of neural networks approximating simple one-dimensional functions that aid in developing the intuition for what is being learned by the model.

In this tutorial, you will discover the intuition behind neural networks as function approximation algorithms.

After completing this tutorial, you will know:

- Training a neural network on data approximates the unknown underlying mapping function from inputs to outputs.
- One dimensional input and output datasets provide a useful basis for developing the intuitions for function approximation.
- How to develop and evaluate a small neural network for function approximation.

Let’s get started.

This tutorial is divided into three parts; they are:

- What Is Function Approximation
- Definition of a Simple Function
- Approximating a Simple Function

Function approximation is a technique for estimating an unknown underlying function using historical or available observations from the domain.

Artificial neural networks learn to approximate a function.

In supervised learning, a dataset is comprised of inputs and outputs, and the supervised learning algorithm learns how to best map examples of inputs to examples of outputs.

We can think of this mapping as being governed by a mathematical function, called the **mapping function**, and it is this function that a supervised learning algorithm seeks to best approximate.

Neural networks are an example of a supervised learning algorithm and seek to approximate the function represented by your data. This is achieved by calculating the error between the predicted outputs and the expected outputs and minimizing this error during the training process.

It is best to think of feedforward networks as function approximation machines that are designed to achieve statistical generalization, occasionally drawing some insights from what we know about the brain, rather than as models of brain function.

— Page 169, Deep Learning, 2016.

We say “*approximate*” because although we suspect such a mapping function exists, we don’t know anything about it.

The true function that maps inputs to outputs is unknown and is often referred to as the **target function**. It is the target of the learning process, the function we are trying to approximate using only the data that is available. If we knew the target function, we would not need to approximate it, i.e. we would not need a supervised machine learning algorithm. Therefore, function approximation is only a useful tool when the underlying target mapping function is unknown.

All we have are observations from the domain that contain examples of inputs and outputs. This implies things about the size and quality of the data; for example:

- The more examples we have, the more we might be able to figure out about the mapping function.
- The less noise we have in observations, the more crisp approximation we can make of the mapping function.

So why do we like using neural networks for function approximation?

The reason is that they are a **universal approximator**. In theory, they can be used to approximate any function.

… the universal approximation theorem states that a feedforward network with a linear output layer and at least one hidden layer with any “squashing” activation function (such as the logistic sigmoid activation function) can approximate any […] function from one finite-dimensional space to another with any desired non-zero amount of error, provided that the network is given enough hidden units

— Page 198, Deep Learning, 2016.

Regression predictive modeling involves predicting a numerical quantity given inputs. Classification predictive modeling involves predicting a class label given inputs.

Both of these predictive modeling problems can be seen as examples of function approximation.

To make this concrete, we can review a worked example.

In the next section, let’s define a simple function that we can later approximate.

We can define a simple function with one numerical input variable and one numerical output variable and use this as the basis for understanding neural networks for function approximation.

We can define a domain of numbers as our input, such as floating-point values from -50 to 50.

We can then select a mathematical operation to apply to the inputs to get the output values. The selected mathematical operation will be the mapping function, and because we are choosing it, we will know what it is. In practice, this is not the case and is the reason why we would use a supervised learning algorithm like a neural network to learn or discover the mapping function.

In this case, we will use the square of the input as the mapping function, defined as:

- y = x^2

Where *y* is the output variable and *x* is the input variable.

We can develop an intuition for this mapping function by enumerating the values in the range of our input variable and calculating the output value for each input and plotting the result.

The example below implements this in Python.

# example of creating a univariate dataset with a given mapping function from matplotlib import pyplot # define the input data x = [i for i in range(-50,51)] # define the output data y = [i**2.0 for i in x] # plot the input versus the output pyplot.scatter(x,y) pyplot.title('Input (x) versus Output (y)') pyplot.xlabel('Input Variable (x)') pyplot.ylabel('Output Variable (y)') pyplot.show()

Running the example first creates a list of integer values across the entire input domain.

The output values are then calculated using the mapping function, then a plot is created with the input values on the x-axis and the output values on the y-axis.

The input and output variables represent our dataset.

Next, we can then pretend to forget that we know what the mapping function is and use a neural network to re-learn or re-discover the mapping function.

We can fit a neural network model on examples of inputs and outputs and see if the model can learn the mapping function.

This is a very simple mapping function, so we would expect a small neural network could learn it quickly.

We will define the network using the Keras deep learning library and use some data preparation tools from the scikit-learn library.

First, let’s define the dataset.

... # define the dataset x = asarray([i for i in range(-50,51)]) y = asarray([i**2.0 for i in x]) print(x.min(), x.max(), y.min(), y.max())

Next, we can reshape the data so that the input and output variables are columns with one observation per row, as is expected when using supervised learning models.

... # reshape arrays into into rows and cols x = x.reshape((len(x), 1)) y = y.reshape((len(y), 1))

Next, we will need to scale the inputs and the outputs.

The inputs will have a range between -50 and 50, whereas the outputs will have a range between -50^2 (2500) and 0^2 (0). Large input and output values can make training neural networks unstable, therefore, it is a good idea to scale data first.

We can use the MinMaxScaler to separately normalize the input values and the output values to values in the range between 0 and 1.

... # separately scale the input and output variables scale_x = MinMaxScaler() x = scale_x.fit_transform(x) scale_y = MinMaxScaler() y = scale_y.fit_transform(y) print(x.min(), x.max(), y.min(), y.max())

We can now define a neural network model.

With some trial and error, I chose a model with two hidden layers and 10 nodes in each layer. Perhaps experiment with other configurations to see if you can do better.

... # design the neural network model model = Sequential() model.add(Dense(10, input_dim=1, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(10, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(1))

We will fit the model using a mean squared loss and use the efficient adam version of stochastic gradient descent to optimize the model.

This means the model will seek to minimize the mean squared error between the predictions made and the expected output values (*y*) while it tries to approximate the mapping function.

... # define the loss function and optimization algorithm model.compile(loss='mse', optimizer='adam')

We don’t have a lot of data (e.g. about 100 rows), so we will fit the model for 500 epochs and use a small batch size of 10.

Again, these values were found after a little trial and error; try different values and see if you can do better.

... # ft the model on the training dataset model.fit(x, y, epochs=500, batch_size=10, verbose=0)

Once fit, we can evaluate the model.

We will make a prediction for each example in the dataset and calculate the error. A perfect approximation would be 0.0. This is not possible in general because of noise in the observations, incomplete data, and complexity of the unknown underlying mapping function.

In this case, it is possible because we have all observations, there is no noise in the data, and the underlying function is not complex.

First, we can make the prediction.

... # make predictions for the input data yhat = model.predict(x)

We then must invert the scaling that we performed.

This is so the error is reported in the original units of the target variable.

... # inverse transforms x_plot = scale_x.inverse_transform(x) y_plot = scale_y.inverse_transform(y) yhat_plot = scale_y.inverse_transform(yhat)

We can then calculate and report the prediction error in the original units of the target variable.

... # report model error print('MSE: %.3f' % mean_squared_error(y_plot, yhat_plot))

Finally, we can create a scatter plot of the real mapping of inputs to outputs and compare it to the mapping of inputs to the predicted outputs and see what the approximation of the mapping function looks like spatially.

This is helpful for developing the intuition behind what neural networks are learning.

... # plot x vs yhat pyplot.scatter(x_plot,yhat_plot, label='Predicted') pyplot.title('Input (x) versus Output (y)') pyplot.xlabel('Input Variable (x)') pyplot.ylabel('Output Variable (y)') pyplot.legend() pyplot.show()

Tying this together, the complete example is listed below.

# example of fitting a neural net on x vs x^2 from sklearn.preprocessing import MinMaxScaler from sklearn.metrics import mean_squared_error from keras.models import Sequential from keras.layers import Dense from numpy import asarray from matplotlib import pyplot # define the dataset x = asarray([i for i in range(-50,51)]) y = asarray([i**2.0 for i in x]) print(x.min(), x.max(), y.min(), y.max()) # reshape arrays into into rows and cols x = x.reshape((len(x), 1)) y = y.reshape((len(y), 1)) # separately scale the input and output variables scale_x = MinMaxScaler() x = scale_x.fit_transform(x) scale_y = MinMaxScaler() y = scale_y.fit_transform(y) print(x.min(), x.max(), y.min(), y.max()) # design the neural network model model = Sequential() model.add(Dense(10, input_dim=1, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(10, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(1)) # define the loss function and optimization algorithm model.compile(loss='mse', optimizer='adam') # ft the model on the training dataset model.fit(x, y, epochs=500, batch_size=10, verbose=0) # make predictions for the input data yhat = model.predict(x) # inverse transforms x_plot = scale_x.inverse_transform(x) y_plot = scale_y.inverse_transform(y) yhat_plot = scale_y.inverse_transform(yhat) # report model error print('MSE: %.3f' % mean_squared_error(y_plot, yhat_plot)) # plot x vs y pyplot.scatter(x_plot,y_plot, label='Actual') # plot x vs yhat pyplot.scatter(x_plot,yhat_plot, label='Predicted') pyplot.title('Input (x) versus Output (y)') pyplot.xlabel('Input Variable (x)') pyplot.ylabel('Output Variable (y)') pyplot.legend() pyplot.show()

Running the example first reports the range of values for the input and output variables, then the range of the same variables after scaling. This confirms that the scaling operation was performed as we expected.

The model is then fit and evaluated on the dataset.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that the mean squared error is about 1,300, in squared units. If we calculate the square root, this gives us the root mean squared error (RMSE) in the original units. We can see that the average error is about 36 units, which is fine, but not great.

**What results did you get?** Can you do better?

Let me know in the comments below.

-50 50 0.0 2500.0 0.0 1.0 0.0 1.0 MSE: 1300.776

A scatter plot is then created comparing the inputs versus the real outputs, and the inputs versus the predicted outputs.

The difference between these two data series is the error in the approximation of the mapping function. We can see that the approximation is reasonable; it captures the general shape. We can see that there are errors, especially around the 0 input values.

This suggests that there is plenty of room for improvement, such as using a different activation function or different network architecture to better approximate the mapping function.

This section provides more resources on the topic if you are looking to go deeper.

- Deep Learning, 2016.

In this tutorial, you discovered the intuition behind neural networks as function approximation algorithms.

Specifically, you learned:

- Training a neural network on data approximates the unknown underlying mapping function from inputs to outputs.
- One dimensional input and output datasets provide a useful basis for developing the intuitions for function approximation.
- How to develop and evaluate a small neural network for function approximation.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Neural Networks are Function Approximation Algorithms appeared first on Machine Learning Mastery.

]]>The post Imbalanced Multiclass Classification with the E.coli Dataset appeared first on Machine Learning Mastery.

]]>Multiclass classification problems are those where a label must be predicted, but there are more than two labels that may be predicted.

These are challenging predictive modeling problems because a sufficiently representative number of examples of each class is required for a model to learn the problem. It is made challenging when the number of examples in each class is imbalanced, or skewed toward one or a few of the classes with very few examples of other classes.

Problems of this type are referred to as imbalanced multiclass classification problems and they require both the careful design of an evaluation metric and test harness and choice of machine learning models. The E.coli protein localization sites dataset is a standard dataset for exploring the challenge of imbalanced multiclass classification.

In this tutorial, you will discover how to develop and evaluate a model for the imbalanced multiclass E.coli dataset.

After completing this tutorial, you will know:

- How to load and explore the dataset and generate ideas for data preparation and model selection.
- How to systematically evaluate a suite of machine learning models with a robust test harness.
- How to fit a final model and use it to predict the class labels for specific examples.

Discover SMOTE, one-class classification, cost-sensitive learning, threshold moving, and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Let’s get started.

This tutorial is divided into five parts; they are:

- E.coli Dataset
- Explore the Dataset
- Model Test and Baseline Result
- Evaluate Models
- Evaluate Machine Learning Algorithms
- Evaluate Data Oversampling

- Make Predictions on New Data

In this project, we will use a standard imbalanced machine learning dataset referred to as the “*E.coli*” dataset, also referred to as the “*protein localization sites*” dataset.

The dataset describes the problem of classifying E.coli proteins using their amino acid sequences in their cell localization sites. That is, predicting how a protein will bind to a cell based on the chemical composition of the protein before it is folded.

The dataset is credited to Kenta Nakai and was developed into its current form by Paul Horton and Kenta Nakai in their 1996 paper titled “A Probabilistic Classification System For Predicting The Cellular Localization Sites Of Proteins.” In it, they achieved a classification accuracy of 81 percent.

336 E.coli proteins were classified into 8 classes with an accuracy of 81% …

— A Probabilistic Classification System For Predicting The Cellular Localization Sites Of Proteins, 1996.

The dataset is comprised of 336 examples of E.coli proteins and each example is described using seven input variables calculated from the proteins amino acid sequence.

Ignoring the sequence name, the input features are described as follows:

**mcg**: McGeoch’s method for signal sequence recognition.**gvh**: von Heijne’s method for signal sequence recognition.**lip**: von Heijne’s Signal Peptidase II consensus sequence score.**chg**: Presence of charge on N-terminus of predicted lipoproteins.**aac**: score of discriminant analysis of the amino acid content of outer membrane and periplasmic proteins.**alm1**: score of the ALOM membrane-spanning region prediction program.**alm2**: score of ALOM program after excluding putative cleavable signal regions from the sequence.

There are eight classes described as follows:

**cp**: cytoplasm**im**: inner membrane without signal sequence**pp**: periplasm**imU**: inner membrane, non cleavable signal sequence**om**: outer membrane**omL**: outer membrane lipoprotein**imL**: inner membrane lipoprotein**imS**: inner membrane, cleavable signal sequence

The distribution of examples across the classes is not equal and in some cases severely imbalanced.

For example, the “*cp*” class has 143 examples, whereas the “*imL*” and “*imS*” classes have just two examples each.

Next, let’s take a closer look at the data.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

First, download and unzip the dataset and save it in your current working directory with the name “*ecoli.csv*“.

Note that this version of the dataset has the first column (sequence name) removed as it does not contain generalizable information for modeling.

Review the contents of the file.

The first few lines of the file should look as follows:

0.49,0.29,0.48,0.50,0.56,0.24,0.35,cp 0.07,0.40,0.48,0.50,0.54,0.35,0.44,cp 0.56,0.40,0.48,0.50,0.49,0.37,0.46,cp 0.59,0.49,0.48,0.50,0.52,0.45,0.36,cp 0.23,0.32,0.48,0.50,0.55,0.25,0.35,cp ...

We can see that the input variables all appear numeric, and the class labels are string values that will need to be label encoded prior to modeling.

The dataset can be loaded as a DataFrame using the read_csv() Pandas function, specifying the location of the file and the fact that there is no header line.

... # define the dataset location filename = 'ecoli.csv' # load the csv file as a data frame dataframe = read_csv(filename, header=None)

Once loaded, we can summarize the number of rows and columns by printing the shape of the DataFrame.

... # summarize the shape of the dataset print(dataframe.shape)

Next, we can calculate a five-number summary for each input variable.

... # describe the dataset set_option('precision', 3) print(dataframe.describe())

Finally, we can also summarize the number of examples in each class using the Counter object.

... # summarize the class distribution target = dataframe.values[:,-1] counter = Counter(target) for k,v in counter.items(): per = v / len(target) * 100 print('Class=%s, Count=%d, Percentage=%.3f%%' % (k, v, per))

Tying this together, the complete example of loading and summarizing the dataset is listed below.

# load and summarize the dataset from pandas import read_csv from pandas import set_option from collections import Counter # define the dataset location filename = 'ecoli.csv' # load the csv file as a data frame dataframe = read_csv(filename, header=None) # summarize the shape of the dataset print(dataframe.shape) # describe the dataset set_option('precision', 3) print(dataframe.describe()) # summarize the class distribution target = dataframe.values[:,-1] counter = Counter(target) for k,v in counter.items(): per = v / len(target) * 100 print('Class=%s, Count=%d, Percentage=%.3f%%' % (k, v, per))

Running the example first loads the dataset and confirms the number of rows and columns, which are 336 rows and 7 input variables and 1 target variable.

Reviewing the summary of each variable, it appears that the variables have been centered, that is, shifted to have a mean of 0.5. It also appears that the variables have been normalized, meaning all values are in the range between about 0 and 1; at least no variables have values outside this range.

The class distribution is then summarized, confirming the severe skew in the observations for each class. We can see that the “*cp*” class is dominant with about 42 percent of the examples and minority classes such as “*imS*“, “*imL*“, and “*omL*” have about 1 percent or less of the dataset.

There may not be sufficient data to generalize from these minority classes. One approach might be to simply remove the examples with these classes.

(336, 8) 0 1 2 3 4 5 6 count 336.000 336.000 336.000 336.000 336.000 336.000 336.000 mean 0.500 0.500 0.495 0.501 0.500 0.500 0.500 std 0.195 0.148 0.088 0.027 0.122 0.216 0.209 min 0.000 0.160 0.480 0.500 0.000 0.030 0.000 25% 0.340 0.400 0.480 0.500 0.420 0.330 0.350 50% 0.500 0.470 0.480 0.500 0.495 0.455 0.430 75% 0.662 0.570 0.480 0.500 0.570 0.710 0.710 max 0.890 1.000 1.000 1.000 0.880 1.000 0.990 Class=cp, Count=143, Percentage=42.560% Class=im, Count=77, Percentage=22.917% Class=imS, Count=2, Percentage=0.595% Class=imL, Count=2, Percentage=0.595% Class=imU, Count=35, Percentage=10.417% Class=om, Count=20, Percentage=5.952% Class=omL, Count=5, Percentage=1.488% Class=pp, Count=52, Percentage=15.476%

We can also take a look at the distribution of the input variables by creating a histogram for each.

The complete example of creating histograms of all input variables is listed below.

# create histograms of all variables from pandas import read_csv from matplotlib import pyplot # define the dataset location filename = 'ecoli.csv' # load the csv file as a data frame df = read_csv(filename, header=None) # create a histogram plot of each variable df.hist(bins=25) # show the plot pyplot.show()

We can see that variables such as 0, 5, and 6 may have a multi-modal distribution. The variables 2 and 3 may have a binary distribution and variables 1 and 4 may have a Gaussian-like distribution.

Depending on the choice of model, the dataset may benefit from standardization, normalization, and perhaps a power transform.

Now that we have reviewed the dataset, let’s look at developing a test harness for evaluating candidate models.

The k-fold cross-validation procedure provides a good general estimate of model performance that is not too optimistically biased, at least compared to a single train-test split. We will use *k=5*, meaning each fold will contain about 336/5 or about 67 examples.

Stratified means that each fold will aim to contain the same mixture of examples by class as the entire training dataset. Repeated means that the evaluation process will be performed multiple times to help avoid fluke results and better capture the variance of the chosen model. We will use three repeats.

This means a single model will be fit and evaluated 5 * 3, or 15, times and the mean and standard deviation of these runs will be reported.

This can be achieved using the RepeatedStratifiedKFold scikit-learn class.

All classes are equally important. As such, in this case, we will use classification accuracy to evaluate models.

First, we can define a function to load the dataset and split the input variables into inputs and output variables and use a label encoder to ensure class labels are numbered sequentially.

# load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X, y

We can define a function to evaluate a candidate model using stratified repeated 5-fold cross-validation, then return a list of scores calculated on the model for each fold and repeat.

The *evaluate_model()* function below implements this.

# evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores

We can then call the *load_dataset()* function to load and confirm the E.coli dataset.

... # define the location of the dataset full_path = 'ecoli.csv' # load the dataset X, y = load_dataset(full_path) # summarize the loaded dataset print(X.shape, y.shape, Counter(y))

In this case, we will evaluate the baseline strategy of predicting the majority class in all cases.

This can be implemented automatically using the DummyClassifier class and setting the “*strategy*” to “*most_frequent*” that will predict the most common class (e.g. class ‘*cp*‘) in the training dataset. As such, we would expect this model to achieve a classification accuracy of about 42 percent given this is the distribution of the most common class in the training dataset.

... # define the reference model model = DummyClassifier(strategy='most_frequent')

We can then evaluate the model by calling our *evaluate_model()* function and report the mean and standard deviation of the results.

... # evaluate the model scores = evaluate_model(X, y, model) # summarize performance print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Tying this all together, the complete example of evaluating the baseline model on the E.coli dataset using classification accuracy is listed below.

# baseline model and test harness for the ecoli dataset from collections import Counter from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.dummy import DummyClassifier # load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X, y # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define the location of the dataset full_path = 'ecoli.csv' # load the dataset X, y = load_dataset(full_path) # summarize the loaded dataset print(X.shape, y.shape, Counter(y)) # define the reference model model = DummyClassifier(strategy='most_frequent') # evaluate the model scores = evaluate_model(X, y, model) # summarize performance print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example first loads the dataset and reports the number of cases correctly as 336 and the distribution of class labels as we expect.

The *DummyClassifier* with our default strategy is then evaluated using repeated stratified k-fold cross-validation and the mean and standard deviation of the classification accuracy is reported as about 42.6 percent.

(336, 7) (336,) Counter({0: 143, 1: 77, 7: 52, 4: 35, 5: 20, 6: 5, 3: 2, 2: 2}) Mean Accuracy: 0.426 (0.006)

Warnings are reported during the evaluation of the model; for example:

Warning: The least populated class in y has only 2 members, which is too few. The minimum number of members in any class cannot be less than n_splits=5.

This is because some of the classes do not have a sufficient number of examples for the 5-fold cross-validation, e.g. classes “*imS*” and “*imL*“.

In this case, we will remove these examples from the dataset. This can be achieved by updating the *load_dataset()* to remove those rows with these classes, e.g. four rows.

# load the dataset def load_dataset(full_path): # load the dataset as a numpy array df = read_csv(full_path, header=None) # remove rows for the minority classes df = df[df[7] != 'imS'] df = df[df[7] != 'imL'] # retrieve numpy array data = df.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X, y

We can then re-run the example to establish a baseline in classification accuracy.

The complete example is listed below.

# baseline model and test harness for the ecoli dataset from collections import Counter from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.dummy import DummyClassifier # load the dataset def load_dataset(full_path): # load the dataset as a numpy array df = read_csv(full_path, header=None) # remove rows for the minority classes df = df[df[7] != 'imS'] df = df[df[7] != 'imL'] # retrieve numpy array data = df.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X, y # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define the location of the dataset full_path = 'ecoli.csv' # load the dataset X, y = load_dataset(full_path) # summarize the loaded dataset print(X.shape, y.shape, Counter(y)) # define the reference model model = DummyClassifier(strategy='most_frequent') # evaluate the model scores = evaluate_model(X, y, model) # summarize performance print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example confirms that the number of examples was reduced by four, from 336 to 332.

We can also see that the number of classes was reduced from eight to six (class 0 through to class 5).

The baseline in performance was established at 43.1 percent. This score provides a baseline on this dataset by which all other classification algorithms can be compared. Achieving a score above about 43.1 percent indicates that a model has skill on this dataset, and a score at or below this value indicates that the model does not have skill on this dataset.

(332, 7) (332,) Counter({0: 143, 1: 77, 5: 52, 2: 35, 3: 20, 4: 5}) Mean Accuracy: 0.431 (0.005)

Now that we have a test harness and a baseline in performance, we can begin to evaluate some models on this dataset.

In this section, we will evaluate a suite of different techniques on the dataset using the test harness developed in the previous section.

The reported performance is good, but not highly optimized (e.g. hyperparameters are not tuned).

**Can you do better?** If you can achieve better classification accuracy using the same test harness, I’d love to hear about it. Let me know in the comments below.

Let’s start by evaluating a mixture of machine learning models on the dataset.

It can be a good idea to spot check a suite of different nonlinear algorithms on a dataset to quickly flush out what works well and deserves further attention, and what doesn’t.

We will evaluate the following machine learning models on the E.coli dataset:

- Linear Discriminant Analysis (LDA)
- Support Vector Machine (SVM)
- Bagged Decision Trees (BAG)
- Random Forest (RF)
- Extra Trees (ET)

We will use mostly default model hyperparameters, with the exception of the number of trees in the ensemble algorithms, which we will set to a reasonable default of 1,000.

We will define each model in turn and add them to a list so that we can evaluate them sequentially. The *get_models()* function below defines the list of models for evaluation, as well as a list of model short names for plotting the results later.

# define models to test def get_models(): models, names = list(), list() # LDA models.append(LinearDiscriminantAnalysis()) names.append('LDA') # SVM models.append(LinearSVC()) names.append('SVM') # Bagging models.append(BaggingClassifier(n_estimators=1000)) names.append('BAG') # RF models.append(RandomForestClassifier(n_estimators=1000)) names.append('RF') # ET models.append(ExtraTreesClassifier(n_estimators=1000)) names.append('ET') return models, names

We can then enumerate the list of models in turn and evaluate each, storing the scores for later evaluation.

... # define models models, names = get_models() results = list() # evaluate each model for i in range(len(models)): # evaluate the model and store results scores = evaluate_model(X, y, models[i]) results.append(scores) # summarize performance print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores)))

At the end of the run, we can plot each sample of scores as a box and whisker plot with the same scale so that we can directly compare the distributions.

... # plot the results pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Tying this all together, the complete example of evaluating a suite of machine learning algorithms on the E.coli dataset is listed below.

# spot check machine learning algorithms on the ecoli dataset from numpy import mean from numpy import std from pandas import read_csv from matplotlib import pyplot from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.svm import LinearSVC from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import ExtraTreesClassifier from sklearn.ensemble import BaggingClassifier # load the dataset def load_dataset(full_path): # load the dataset as a numpy array df = read_csv(full_path, header=None) # remove rows for the minority classes df = df[df[7] != 'imS'] df = df[df[7] != 'imL'] # retrieve numpy array data = df.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable y = LabelEncoder().fit_transform(y) return X, y # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define models to test def get_models(): models, names = list(), list() # LDA models.append(LinearDiscriminantAnalysis()) names.append('LDA') # SVM models.append(LinearSVC()) names.append('SVM') # Bagging models.append(BaggingClassifier(n_estimators=1000)) names.append('BAG') # RF models.append(RandomForestClassifier(n_estimators=1000)) names.append('RF') # ET models.append(ExtraTreesClassifier(n_estimators=1000)) names.append('ET') return models, names # define the location of the dataset full_path = 'ecoli.csv' # load the dataset X, y = load_dataset(full_path) # define models models, names = get_models() results = list() # evaluate each model for i in range(len(models)): # evaluate the model and store results scores = evaluate_model(X, y, models[i]) results.append(scores) # summarize performance print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores))) # plot the results pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example evaluates each algorithm in turn and reports the mean and standard deviation classification accuracy.

Your specific results will vary given the stochastic nature of the learning algorithms; consider running the example a few times.

In this case, we can see that all of the tested algorithms have skill, achieving an accuracy above the default of 43.1 percent.

The results suggest that most algorithms do well on this dataset and that perhaps the ensembles of decision trees perform the best with Extra Trees achieving 88 percent accuracy and Random Forest achieving 89.5 percent accuracy.

>LDA 0.886 (0.027) >SVM 0.883 (0.027) >BAG 0.851 (0.037) >RF 0.895 (0.032) >ET 0.880 (0.030)

A figure is created showing one box and whisker plot for each algorithm’s sample of results. The box shows the middle 50 percent of the data, the orange line in the middle of each box shows the median of the sample, and the green triangle in each box shows the mean of the sample.

We can see that the distributions of scores for the ensembles of decision trees clustered together separate from the other algorithms tested. In most cases, the mean and median are close on the plot, suggesting a somewhat symmetrical distribution of scores that may indicate the models are stable.

With so many classes and so few examples in many of the classes, the dataset may benefit from oversampling.

We can test the SMOTE algorithm applied to all except the majority class (*cp*) results in a lift in performance.

Generally, SMOTE does not appear to help ensembles of decision trees, so we will change the set of algorithms tested to the following:

- Multinomial Logistic Regression (LR)
- Linear Discriminant Analysis (LDA)
- Support Vector Machine (SVM)
- k-Nearest Neighbors (KNN)
- Gaussian Process (GP)

The updated version of the *get_models()* function to define these models is listed below.

# define models to test def get_models(): models, names = list(), list() # LR models.append(LogisticRegression(solver='lbfgs', multi_class='multinomial')) names.append('LR') # LDA models.append(LinearDiscriminantAnalysis()) names.append('LDA') # SVM models.append(LinearSVC()) names.append('SVM') # KNN models.append(KNeighborsClassifier(n_neighbors=3)) names.append('KNN') # GP models.append(GaussianProcessClassifier()) names.append('GP') return models, names

We can use the SMOTE implementation from the imbalanced-learn library, and a Pipeline from the same library to first apply SMOTE to the training dataset, then fit a given model as part of the cross-validation procedure.

SMOTE will synthesize new examples using k-nearest neighbors in the training dataset, where by default, *k* is set to 5.

This is too large for some of the classes in our dataset. Therefore, we will try a *k* value of 2.

... # create pipeline steps = [('o', SMOTE(k_neighbors=2)), ('m', models[i])] pipeline = Pipeline(steps=steps) # evaluate the model and store results scores = evaluate_model(X, y, pipeline)

Tying this together, the complete example of using SMOTE oversampling on the E.coli dataset is listed below.

# spot check smote with machine learning algorithms on the ecoli dataset from numpy import mean from numpy import std from pandas import read_csv from matplotlib import pyplot from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.svm import LinearSVC from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.neighbors import KNeighborsClassifier from sklearn.gaussian_process import GaussianProcessClassifier from sklearn.linear_model import LogisticRegression from imblearn.pipeline import Pipeline from imblearn.over_sampling import SMOTE # load the dataset def load_dataset(full_path): # load the dataset as a numpy array df = read_csv(full_path, header=None) # remove rows for the minority classes df = df[df[7] != 'imS'] df = df[df[7] != 'imL'] # retrieve numpy array data = df.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable y = LabelEncoder().fit_transform(y) return X, y # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define models to test def get_models(): models, names = list(), list() # LR models.append(LogisticRegression(solver='lbfgs', multi_class='multinomial')) names.append('LR') # LDA models.append(LinearDiscriminantAnalysis()) names.append('LDA') # SVM models.append(LinearSVC()) names.append('SVM') # KNN models.append(KNeighborsClassifier(n_neighbors=3)) names.append('KNN') # GP models.append(GaussianProcessClassifier()) names.append('GP') return models, names # define the location of the dataset full_path = 'ecoli.csv' # load the dataset X, y = load_dataset(full_path) # define models models, names = get_models() results = list() # evaluate each model for i in range(len(models)): # create pipeline steps = [('o', SMOTE(k_neighbors=2)), ('m', models[i])] pipeline = Pipeline(steps=steps) # evaluate the model and store results scores = evaluate_model(X, y, pipeline) results.append(scores) # summarize performance print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores))) # plot the results pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example evaluates each algorithm in turn and reports the mean and standard deviation classification accuracy.

Your specific results will vary given the stochastic nature of the learning algorithms; consider running the example a few times.

In this case, we can see that LDA with SMOTE resulted in a small drop from 88.6 percent to about 87.9 percent, whereas SVM with SMOTE saw a small increase from about 88.3 percent to about 88.8 percent.

SVM also appears to be the best-performing method when using SMOTE in this case, although it does not achieve an improvement as compared to random forest in the previous section.

>LR 0.875 (0.024) >LDA 0.879 (0.029) >SVM 0.888 (0.025) >KNN 0.835 (0.040) >GP 0.876 (0.023)

Box and whisker plots of classification accuracy scores are created for each algorithm.

We can see that LDA has a number of performance outliers with high 90-percent values, which is quite interesting. It might suggest that LDA could perform better if focused on the abundant classes.

Now that we have seen how to evaluate models on this dataset, let’s look at how we can use a final model to make predictions.

In this section, we can fit a final model and use it to make predictions on single rows of data.

We will use the Random Forest model as our final model that achieved a classification accuracy of about 89.5 percent.

First, we can define the model.

... # define model to evaluate model = RandomForestClassifier(n_estimators=1000)

Once defined, we can fit it on the entire training dataset.

... # fit the model model.fit(X, y)

Once fit, we can use it to make predictions for new data by calling the *predict()* function. This will return the encoded class label for each example.

We can then use the label encoder to inverse transform to get the string class label.

For example:

... # define a row of data row = [...] # predict the class label yhat = model.predict([row]) label = le.inverse_transform(yhat)[0]

To demonstrate this, we can use the fit model to make some predictions of labels for a few cases where we know the outcome.

The complete example is listed below.

# fit a model and make predictions for the on the ecoli dataset from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.ensemble import RandomForestClassifier # load the dataset def load_dataset(full_path): # load the dataset as a numpy array df = read_csv(full_path, header=None) # remove rows for the minority classes df = df[df[7] != 'imS'] df = df[df[7] != 'imL'] # retrieve numpy array data = df.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable le = LabelEncoder() y = le.fit_transform(y) return X, y, le # define the location of the dataset full_path = 'ecoli.csv' # load the dataset X, y, le = load_dataset(full_path) # define model to evaluate model = RandomForestClassifier(n_estimators=1000) # fit the model model.fit(X, y) # known class "cp" row = [0.49,0.29,0.48,0.50,0.56,0.24,0.35] yhat = model.predict([row]) label = le.inverse_transform(yhat)[0] print('>Predicted=%s (expected cp)' % (label)) # known class "im" row = [0.06,0.61,0.48,0.50,0.49,0.92,0.37] yhat = model.predict([row]) label = le.inverse_transform(yhat)[0] print('>Predicted=%s (expected im)' % (label)) # known class "imU" row = [0.72,0.42,0.48,0.50,0.65,0.77,0.79] yhat = model.predict([row]) label = le.inverse_transform(yhat)[0] print('>Predicted=%s (expected imU)' % (label)) # known class "om" row = [0.78,0.68,0.48,0.50,0.83,0.40,0.29] yhat = model.predict([row]) label = le.inverse_transform(yhat)[0] print('>Predicted=%s (expected om)' % (label)) # known class "omL" row = [0.77,0.57,1.00,0.50,0.37,0.54,0.0] yhat = model.predict([row]) label = le.inverse_transform(yhat)[0] print('>Predicted=%s (expected omL)' % (label)) # known class "pp" row = [0.74,0.49,0.48,0.50,0.42,0.54,0.36] yhat = model.predict([row]) label = le.inverse_transform(yhat)[0] print('>Predicted=%s (expected pp)' % (label))

Running the example first fits the model on the entire training dataset.

Then the fit model is used to predict the label for one example taken from each of the six classes.

We can see that the correct class label is predicted for each of the chosen examples. Nevertheless, on average, we expect that 1 in 10 predictions will be wrong and these errors may not be equally distributed across the classes.

>Predicted=cp (expected cp) >Predicted=im (expected im) >Predicted=imU (expected imU) >Predicted=om (expected om) >Predicted=omL (expected omL) >Predicted=pp (expected pp)

This section provides more resources on the topic if you are looking to go deeper.

- Expert System For Predicting Protein Localization Sites In Gram‐negative Bacteria, 1991.
- A Knowledge Base For Predicting Protein Localization Sites In Eukaryotic Cells, 1992.
- A Probabilistic Classification System For Predicting The Cellular Localization Sites Of Proteins, 1996.

- pandas.read_csv API.
- sklearn.dummy.DummyClassifier API.
- imblearn.over_sampling.SMOTE API.
- imblearn.pipeline.Pipeline API.

In this tutorial, you discovered how to develop and evaluate a model for the imbalanced multiclass E.coli dataset.

Specifically, you learned:

- How to load and explore the dataset and generate ideas for data preparation and model selection.
- How to systematically evaluate a suite of machine learning models with a robust test harness.
- How to fit a final model and use it to predict the class labels for specific examples.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Imbalanced Multiclass Classification with the E.coli Dataset appeared first on Machine Learning Mastery.

]]>The post Imbalanced Multiclass Classification with the Glass Identification Dataset appeared first on Machine Learning Mastery.

]]>Multiclass classification problems are those where a label must be predicted, but there are more than two labels that may be predicted.

These are challenging predictive modeling problems because a sufficiently representative number of examples of each class is required for a model to learn the problem. It is made challenging when the number of examples in each class is imbalanced, or skewed toward one or a few of the classes with very few examples of other classes.

Problems of this type are referred to as imbalanced multiclass classification problems and they require both the careful design of an evaluation metric and test harness and choice of machine learning models. The glass identification dataset is a standard dataset for exploring the challenge of imbalanced multiclass classification.

In this tutorial, you will discover how to develop and evaluate a model for the imbalanced multiclass glass identification dataset.

After completing this tutorial, you will know:

- How to load and explore the dataset and generate ideas for data preparation and model selection.
- How to systematically evaluate a suite of machine learning models with a robust test harness.
- How to fit a final model and use it to predict the class labels for specific examples.

Discover SMOTE, one-class classification, cost-sensitive learning, threshold moving, and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Let’s get started.

This tutorial is divided into five parts; they are:

- Glass Identification Dataset
- Explore the Dataset
- Model Test and Baseline Result
- Evaluate Models
- Make Predictions on New Data

In this project, we will use a standard imbalanced machine learning dataset referred to as the “*Glass Identification*” dataset, or simply “*glass*.”

The dataset describes the chemical properties of glass and involves classifying samples of glass using their chemical properties as one of six classes. The dataset was credited to Vina Spiehler in 1987.

Ignoring the sample identification number, there are nine input variables that summarize the properties of the glass dataset; they are:

**RI**: refractive index**Na**: Sodium**Mg**: Magnesium**Al**: Aluminum**Si**: Silicon**K**: Potassium**Ca**: Calcium**Ba**: Barium**Fe**: Iron

The chemical compositions are measured as the weight percent in corresponding oxide.

There are seven types of glass listed; they are:

**Class 1**: building windows (float processed)**Class 2**: building windows (non-float processed)**Class 3**: vehicle windows (float processed)**Class 4**: vehicle windows (non-float processed)**Class 5**: containers**Class 6**: tableware**Class 7**: headlamps

Float glass refers to the process used to make the glass.

There are 214 observations in the dataset and the number of observations in each class is imbalanced. Note that there are no examples for class 4 (non-float processed vehicle windows) in the dataset.

**Class 1**: 70 examples**Class 2**: 76 examples**Class 3**: 17 examples**Class 4**: 0 examples**Class 5**: 13 examples**Class 6**: 9 examples**Class 7**: 29 examples

Although there are minority classes, all classes are equally important in this prediction problem.

The dataset can be divided into window glass (classes 1-4) and non-window glass (classes 5-7). There are 163 examples of window glass and 51 examples of non-window glass.

**Window Glass**: 163 examples**Non-Window Glass**: 51 examples

Another division of the observations would be between float processed glass and non-float processed glass, in the case of window glass only. This division is more balanced.

**Float Glass**: 87 examples**Non-Float Glass**: 76 examples

Next, let’s take a closer look at the data.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

First, download the dataset and save it in your current working directory with the name “*glass.csv*“.

Note that this version of the dataset has the first column (row) number removed as it does not contain generalizable information for modeling.

Review the contents of the file.

The first few lines of the file should look as follows:

1.52101,13.64,4.49,1.10,71.78,0.06,8.75,0.00,0.00,1 1.51761,13.89,3.60,1.36,72.73,0.48,7.83,0.00,0.00,1 1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.00,0.00,1 1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.00,0.00,1 1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.00,0.00,1 ...

We can see that the input variables are numeric and the class label is an integer is in the final column.

All of the chemical input variables have the same units, although the first variable, the refractive index, has different units. As such, data scaling may be required for some modeling algorithms.

The dataset can be loaded as a DataFrame using the read_csv() Pandas function, specifying the location of the dataset and the fact that there is no header line.

... # define the dataset location filename = 'glass.csv' # load the csv file as a data frame dataframe = read_csv(filename, header=None)

Once loaded, we can summarize the number of rows and columns by printing the shape of the *DataFrame*.

... # summarize the shape of the dataset print(dataframe.shape)

We can also summarize the number of examples in each class using the Counter object.

... # summarize the class distribution target = dataframe.values[:,-1] counter = Counter(target) for k,v in counter.items(): per = v / len(target) * 100 print('Class=%d, Count=%d, Percentage=%.3f%%' % (k, v, per))

Tying this together, the complete example of loading and summarizing the dataset is listed below.

# load and summarize the dataset from pandas import read_csv from collections import Counter # define the dataset location filename = 'glass.csv' # load the csv file as a data frame dataframe = read_csv(filename, header=None) # summarize the shape of the dataset print(dataframe.shape) # summarize the class distribution target = dataframe.values[:,-1] counter = Counter(target) for k,v in counter.items(): per = v / len(target) * 100 print('Class=%d, Count=%d, Percentage=%.3f%%' % (k, v, per))

Running the example first loads the dataset and confirms the number of rows and columns, which are 214 rows and 9 input variables and 1 target variable.

The class distribution is then summarized, confirming the severe skew in the observations for each class.

(214, 10) Class=1, Count=70, Percentage=32.710% Class=2, Count=76, Percentage=35.514% Class=3, Count=17, Percentage=7.944% Class=5, Count=13, Percentage=6.075% Class=6, Count=9, Percentage=4.206% Class=7, Count=29, Percentage=13.551%

We can also take a look at the distribution of the input variables by creating a histogram for each.

The complete example of creating histograms of all variables is listed below.

# create histograms of all variables from pandas import read_csv from matplotlib import pyplot # define the dataset location filename = 'glass.csv' # load the csv file as a data frame df = read_csv(filename, header=None) # create a histogram plot of each variable df.hist() # show the plot pyplot.show()

We can see that some of the variables have a Gaussian-like distribution and others appear to have an exponential or even a bimodal distribution.

Depending on the choice of algorithm, the data may benefit from standardization of some variables and perhaps a power transform.

Now that we have reviewed the dataset, let’s look at developing a test harness for evaluating candidate models.

We will evaluate candidate models using repeated stratified k-fold cross-validation.

The k-fold cross-validation procedure provides a good general estimate of model performance that is not too optimistically biased, at least compared to a single train-test split. We will use k=5, meaning each fold will contain about 214/5, or about 42 examples.

Stratified means that each fold will aim to contain the same mixture of examples by class as the entire training dataset. Repeated means that the evaluation process will be performed multiple times to help avoid fluke results and better capture the variance of the chosen model. We will use three repeats.

This means a single model will be fit and evaluated 5 * 3 or 15 times and the mean and standard deviation of these runs will be reported.

This can be achieved using the RepeatedStratifiedKFold scikit-learn class.

All classes are equally important. There are minority classes that are only represented with 4 percent or 6 percent of the data, yet no class has more than about 35 percent dominance of the dataset.

As such, in this case, we will use classification accuracy to evaluate models.

First, we can define a function to load the dataset and split the input variables into inputs and output variables and use a label encoder to ensure class labels are numbered sequentially from 0 to 5.

# load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X, y

We can define a function to evaluate a candidate model using stratified repeated 5-fold cross-validation then return a list of scores calculated on the model for each fold and repeat. The *evaluate_model()* function below implements this.

# evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores

We can then call the *load_dataset()* function to load and confirm the glass identification dataset.

... # define the location of the dataset full_path = 'glass.csv' # load the dataset X, y = load_dataset(full_path) # summarize the loaded dataset print(X.shape, y.shape, Counter(y))

In this case, we will evaluate the baseline strategy of predicting the majority class in all cases.

This can be implemented automatically using the DummyClassifier class and setting the “*strategy*” to “*most_frequent*” that will predict the most common class (e.g. class 2) in the training dataset.

As such, we would expect this model to achieve a classification accuracy of about 35 percent given this is the distribution of the most common class in the training dataset.

... # define the reference model model = DummyClassifier(strategy='most_frequent')

We can then evaluate the model by calling our *evaluate_model()* function and report the mean and standard deviation of the results.

... # evaluate the model scores = evaluate_model(X, y, model) # summarize performance print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Tying this all together, the complete example of evaluating the baseline model on the glass identification dataset using classification accuracy is listed below.

# baseline model and test harness for the glass identification dataset from collections import Counter from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.dummy import DummyClassifier # load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X, y # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define the location of the dataset full_path = 'glass.csv' # load the dataset X, y = load_dataset(full_path) # summarize the loaded dataset print(X.shape, y.shape, Counter(y)) # define the reference model model = DummyClassifier(strategy='most_frequent') # evaluate the model scores = evaluate_model(X, y, model) # summarize performance print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example first loads the dataset and reports the number of cases correctly as 214 and the distribution of class labels as we expect.

The *DummyClassifier* with our default strategy is then evaluated using repeated stratified k-fold cross-validation and the mean and standard deviation of the classification accuracy is reported as about 35.5 percent.

This score provides a baseline on this dataset by which all other classification algorithms can be compared. Achieving a score above about 35.5 percent indicates that a model has skill on this dataset, and a score at or below this value indicates that the model does not have skill on this dataset.

(214, 9) (214,) Counter({1: 76, 0: 70, 5: 29, 2: 17, 3: 13, 4: 9}) Mean Accuracy: 0.355 (0.011)

Now that we have a test harness and a baseline in performance, we can begin to evaluate some models on this dataset.

In this section, we will evaluate a suite of different techniques on the dataset using the test harness developed in the previous section.

The reported performance is good, but not highly optimized (e.g. hyperparameters are not tuned).

**Can you do better?** If you can achieve better classification accuracy using the same test harness, I’d love to hear about it. Let me know in the comments below.

Let’s evaluate a mixture of machine learning models on the dataset.

It can be a good idea to spot check a suite of different nonlinear algorithms on a dataset to quickly flush out what works well and deserves further attention and what doesn’t.

We will evaluate the following machine learning models on the glass dataset:

- Support Vector Machine (SVM)
- k-Nearest Neighbors (KNN)
- Bagged Decision Trees (BAG)
- Random Forest (RF)
- Extra Trees (ET)

We will use mostly default model hyperparameters, with the exception of the number of trees in the ensemble algorithms, which we will set to a reasonable default of 1,000.

We will define each model in turn and add them to a list so that we can evaluate them sequentially. The *get_models()* function below defines the list of models for evaluation, as well as a list of model short names for plotting the results later.

# define models to test def get_models(): models, names = list(), list() # SVM models.append(SVC(gamma='auto')) names.append('SVM') # KNN models.append(KNeighborsClassifier()) names.append('KNN') # Bagging models.append(BaggingClassifier(n_estimators=1000)) names.append('BAG') # RF models.append(RandomForestClassifier(n_estimators=1000)) names.append('RF') # ET models.append(ExtraTreesClassifier(n_estimators=1000)) names.append('ET') return models, names

We can then enumerate the list of models in turn and evaluate each, storing the scores for later evaluation.

... # define models models, names = get_models() results = list() # evaluate each model for i in range(len(models)): # evaluate the model and store results scores = evaluate_model(X, y, models[i]) results.append(scores) # summarize performance print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores)))

At the end of the run, we can plot each sample of scores as a box and whisker plot with the same scale so that we can directly compare the distributions.

... # plot the results pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Tying this all together, the complete example of evaluating a suite of machine learning algorithms on the glass identification dataset is listed below.

# spot check machine learning algorithms on the glass identification dataset from numpy import mean from numpy import std from pandas import read_csv from matplotlib import pyplot from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.svm import SVC from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import ExtraTreesClassifier from sklearn.ensemble import BaggingClassifier # load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X, y # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define models to test def get_models(): models, names = list(), list() # SVM models.append(SVC(gamma='auto')) names.append('SVM') # KNN models.append(KNeighborsClassifier()) names.append('KNN') # Bagging models.append(BaggingClassifier(n_estimators=1000)) names.append('BAG') # RF models.append(RandomForestClassifier(n_estimators=1000)) names.append('RF') # ET models.append(ExtraTreesClassifier(n_estimators=1000)) names.append('ET') return models, names # define the location of the dataset full_path = 'glass.csv' # load the dataset X, y = load_dataset(full_path) # define models models, names = get_models() results = list() # evaluate each model for i in range(len(models)): # evaluate the model and store results scores = evaluate_model(X, y, models[i]) results.append(scores) # summarize performance print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores))) # plot the results pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example evaluates each algorithm in turn and reports the mean and standard deviation classification accuracy.

Your specific results will vary given the stochastic nature of the learning algorithms; consider running the example a few times.

In this case, we can see that all of the tested algorithms have skill, achieving an accuracy above the default of 35.5 percent.

The results suggest that ensembles of decision trees perform well on this dataset, with perhaps random forest performing the best overall achieving a classification accuracy of approximately 79.6 percent.

>SVM 0.669 (0.057) >KNN 0.647 (0.055) >BAG 0.767 (0.070) >RF 0.796 (0.062) >ET 0.776 (0.057)

A figure is created showing one box and whisker plot for each algorithm’s sample of results. The box shows the middle 50 percent of the data, the orange line in the middle of each box shows the median of the sample, and the green triangle in each box shows the mean of the sample.

We can see that the distributions of scores for the ensembles of decision trees clustered together separate from the other algorithms tested. In most cases, the mean and median are close on the plot, suggesting a somewhat symmetrical distribution of scores that may indicate the models are stable.

Now that we have seen how to evaluate models on this dataset, let’s look at how we can use a final model to make predictions.

In this section, we can fit a final model and use it to make predictions on single rows of data.

We will use the Random Forest model as our final model that achieved a classification accuracy of about 79 percent.

First, we can define the model.

... # define model to evaluate model = RandomForestClassifier(n_estimators=1000)

Once defined, we can fit it on the entire training dataset.

... # fit the model model.fit(X, y)

Once fit, we can use it to make predictions for new data by calling the *predict()* function.

This will return the class label for each example.

For example:

... # define a row of data row = [...] # predict the class label yhat = model.predict([row])

To demonstrate this, we can use the fit model to make some predictions of labels for a few cases where we know the outcome.

The complete example is listed below.

# fit a model and make predictions for the on the glass identification dataset from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.ensemble import RandomForestClassifier # load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X, y # define the location of the dataset full_path = 'glass.csv' # load the dataset X, y = load_dataset(full_path) # define model to evaluate model = RandomForestClassifier(n_estimators=1000) # fit the model model.fit(X, y) # known class 0 (class=1 in the dataset) row = [1.52101,13.64,4.49,1.10,71.78,0.06,8.75,0.00,0.00] print('>Predicted=%d (expected 0)' % (model.predict([row]))) # known class 1 (class=2 in the dataset) row = [1.51574,14.86,3.67,1.74,71.87,0.16,7.36,0.00,0.12] print('>Predicted=%d (expected 1)' % (model.predict([row]))) # known class 2 (class=3 in the dataset) row = [1.51769,13.65,3.66,1.11,72.77,0.11,8.60,0.00,0.00] print('>Predicted=%d (expected 2)' % (model.predict([row]))) # known class 3 (class=5 in the dataset) row = [1.51915,12.73,1.85,1.86,72.69,0.60,10.09,0.00,0.00] print('>Predicted=%d (expected 3)' % (model.predict([row]))) # known class 4 (class=6 in the dataset) row = [1.51115,17.38,0.00,0.34,75.41,0.00,6.65,0.00,0.00] print('>Predicted=%d (expected 4)' % (model.predict([row]))) # known class 5 (class=7 in the dataset) row = [1.51556,13.87,0.00,2.54,73.23,0.14,9.41,0.81,0.01] print('>Predicted=%d (expected 5)' % (model.predict([row])))

Running the example first fits the model on the entire training dataset.

Then the fit model is used to predict the label for one example taken from each of the six classes.

We can see that the correct class label is predicted for each of the chosen examples. Nevertheless, on average, we expect that 1 in 5 predictions will be wrong and these errors may not be equally distributed across the classes.

>Predicted=0 (expected 0) >Predicted=1 (expected 1) >Predicted=2 (expected 2) >Predicted=3 (expected 3) >Predicted=4 (expected 4) >Predicted=5 (expected 5)

This section provides more resources on the topic if you are looking to go deeper.

- pandas.read_csv API.
- sklearn.dummy.DummyClassifier API.
- sklearn.ensemble.RandomForestClassifier API.

- Glass Identification Dataset, UCI Machine Learning Repository.
- Glass Identification Dataset.
- Glass Identification Dataset Description.

In this tutorial, you discovered how to develop and evaluate a model for the imbalanced multiclass glass identification dataset.

Specifically, you learned:

- How to load and explore the dataset and generate ideas for data preparation and model selection.
- How to systematically evaluate a suite of machine learning models with a robust test harness.
- How to fit a final model and use it to predict the class labels for specific examples.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Imbalanced Multiclass Classification with the Glass Identification Dataset appeared first on Machine Learning Mastery.

]]>