The post How to Develop LARS Regression Models in Python appeared first on Machine Learning Mastery.

]]>Linear regression is the standard algorithm for regression that assumes a linear relationship between inputs and the target variable. An extension to linear regression involves adding penalties to the loss function during training that encourage simpler models that have smaller coefficient values. These extensions are referred to as regularized linear regression or penalized linear regression.

Lasso Regression is a popular type of regularized linear regression that includes an L1 penalty. This has the effect of shrinking the coefficients for those input variables that do not contribute much to the prediction task.

**Least Angle Regression** or **LARS** for short provides an alternate, efficient way of fitting a Lasso regularized regression model that does not require any hyperparameters.

In this tutorial, you will discover how to develop and evaluate LARS Regression models in Python.

After completing this tutorial, you will know:

- LARS Regression provides an alternate way to train a Lasso regularized linear regression model that adds a penalty to the loss function during training.
- How to evaluate a LARS Regression model and use a final model to make predictions for new data.
- How to configure the LARS Regression model for a new dataset automatically using a cross-validation version of the estimator.

Let’s get started.

This tutorial is divided into three parts; they are:

- LARS Regression
- Example of LARS Regression
- Tuning LARS Hyperparameters

Linear regression refers to a model that assumes a linear relationship between input variables and the target variable.

With a single input variable, this relationship is a line, and with higher dimensions, this relationship can be thought of as a hyperplane that connects the input variables to the target variable. The coefficients of the model are found via an optimization process that seeks to minimize the sum squared error between the predictions (*yhat*) and the expected target values (*y*).

- loss = sum i=0 to n (y_i – yhat_i)^2

A problem with linear regression is that estimated coefficients of the model can become large, making the model sensitive to inputs and possibly unstable. This is particularly true for problems with few observations (samples) or more input predictors (*p*) than variables than samples (*n*) (so-called *p >> n problems*).

One approach to address the stability of regression models is to change the loss function to include additional costs for a model that has large coefficients. Linear regression models that use these modified loss functions during training are referred to collectively as penalized linear regression.

A popular penalty is to penalize a model based on the sum of the absolute coefficient values. This is called the L1 penalty. An L1 penalty minimizes the size of all coefficients and allows some coefficients to be minimized to the value zero, which removes the predictor from the model.

- l1_penalty = sum j=0 to p abs(beta_j)

An L1 penalty minimizes the size of all coefficients and allows any coefficient to go to the value of zero, effectively removing input features from the model. This acts as a type of automatic feature selection method.

… a consequence of penalizing the absolute values is that some parameters are actually set to 0 for some value of lambda. Thus the lasso yields models that simultaneously use regularization to improve the model and to conduct feature selection.

— Page 125, Applied Predictive Modeling, 2013.

This penalty can be added to the cost function for linear regression and is referred to as Least Absolute Shrinkage And Selection Operator (LASSO), or more commonly, “*Lasso*” (with title case) for short.

The Lasso trains the model using a least-squares loss training procedure.

**Least Angle Regression**, LAR or LARS for short, is an alternative approach to solving the optimization problem of fitting the penalized model. Technically, LARS is a forward stepwise version of feature selection for regression that can be adapted for the Lasso model.

Unlike the Lasso, it does not require a hyperparameter that controls the weighting of the penalty in the loss function. Instead, the weighting is discovered automatically by LARS.

… least angle regression (LARS), is a broad framework that encompasses the lasso and similar models. The LARS model can be used to fit lasso models more efficiently, especially in high-dimensional problems.

— Page 126, Applied Predictive Modeling, 2013.

Now that we are familiar with LARS penalized regression, let’s look at a worked example.

In this section, we will demonstrate how to use the LARS Regression algorithm.

First, let’s introduce a standard regression dataset. We will use the housing dataset.

The housing dataset is a standard machine learning dataset comprising 506 rows of data with 13 numerical input variables and a numerical target variable.

Using a test harness of repeated stratified 10-fold cross-validation with three repeats, a naive model can achieve a mean absolute error (MAE) of about 6.6. A top-performing model can achieve a MAE on this same test harness of about 1.9. This provides the bounds of expected performance on this dataset.

The dataset involves predicting the house price given details of the house’s suburb in the American city of Boston.

No need to download the dataset; we will download it automatically as part of our worked examples.

The example below downloads and loads the dataset as a Pandas DataFrame and summarizes the shape of the dataset and the first five rows of data.

# load and summarize the housing dataset from pandas import read_csv from matplotlib import pyplot # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) # summarize shape print(dataframe.shape) # summarize first few lines print(dataframe.head())

Running the example confirms the 506 rows of data and 13 input variables and a single numeric target variable (14 in total). We can also see that all input variables are numeric.

(506, 14) 0 1 2 3 4 5 ... 8 9 10 11 12 13 0 0.00632 18.0 2.31 0 0.538 6.575 ... 1 296.0 15.3 396.90 4.98 24.0 1 0.02731 0.0 7.07 0 0.469 6.421 ... 2 242.0 17.8 396.90 9.14 21.6 2 0.02729 0.0 7.07 0 0.469 7.185 ... 2 242.0 17.8 392.83 4.03 34.7 3 0.03237 0.0 2.18 0 0.458 6.998 ... 3 222.0 18.7 394.63 2.94 33.4 4 0.06905 0.0 2.18 0 0.458 7.147 ... 3 222.0 18.7 396.90 5.33 36.2 [5 rows x 14 columns]

The scikit-learn Python machine learning library provides an implementation of the LARS penalized regression algorithm via the Lars class.

... # define model model = Lars()

We can evaluate the LARS Regression model on the housing dataset using repeated 10-fold cross-validation and report the average mean absolute error (MAE) on the dataset.

# evaluate an lars regression model on the dataset from numpy import mean from numpy import std from numpy import absolute from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from sklearn.linear_model import Lars # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model model = Lars() # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1) # force scores to be positive scores = absolute(scores) print('Mean MAE: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the LARS Regression algorithm on the housing dataset and reports the average MAE across the three repeats of 10-fold cross-validation.

Your specific results may vary given the stochastic nature of the learning algorithm. Consider running the example a few times.

In this case, we can see that the model achieved a MAE of about 3.432.

Mean MAE: 3.432 (0.552)

We may decide to use the LARS Regression as our final model and make predictions on new data.

This can be achieved by fitting the model on all available data and calling the *predict()* function, passing in a new row of data.

We can demonstrate this with a complete example, listed below.

# make a prediction with a lars regression model on the dataset from pandas import read_csv from sklearn.linear_model import Lars # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model model = Lars() # fit model model.fit(X, y) # define new data row = [0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98] # make a prediction yhat = model.predict([row]) # summarize prediction print('Predicted: %.3f' % yhat)

Running the example fits the model and makes a prediction for the new rows of data.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

Predicted: 29.904

Next, we can look at configuring the model hyperparameters.

As part of the LARS training algorithm, it automatically discovers the best value for the lambda hyperparameter used in the Lasso algorithm.

This hyperparameter is referred to as the “*alpha*” argument in the scikit-learn implementation of Lasso and LARS.

Nevertheless, the process of automatically discovering the best model and *alpha* hyperparameter is still based on a single training dataset.

An alternative approach is to fit the model on multiple subsets of the training dataset and choose the best internal model configuration across the folds, in this case, the value of *alpha*. Generally, this is referred to as a cross-validation estimator.

The scikit-learn libraries offer a cross-validation version of the LARS for finding a more robust value for *alpha* via the LarsCV class.

The example below demonstrates how to fit a *LarsCV* model and report the *alpha* value found via cross-validation

# use automatically configured the lars regression algorithm from numpy import arange from pandas import read_csv from sklearn.linear_model import LarsCV from sklearn.model_selection import RepeatedKFold # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # define model model = LarsCV(cv=cv, n_jobs=-1) # fit model model.fit(X, y) # summarize chosen configuration print('alpha: %f' % model.alpha_)

Running the example fits the *LarsCV* model using repeated cross-validation and reports an optimal *alpha* value found across the runs.

alpha: 0.001623

This version of the LARS model may prove more robust in practice.

We can evaluate it using the same procedure we did in the previous section, although in this case, each model fit is based on the hyperparameters found via repeated k-fold cross-validation internally (e.g. cross-validation of a cross-validation estimator).

The complete example is listed below.

# evaluate an lars cross-validation regression model on the dataset from numpy import mean from numpy import std from numpy import absolute from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from sklearn.linear_model import LarsCV # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # define model model = LarsCV(cv=cv, n_jobs=-1) # evaluate model scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1) # force scores to be positive scores = absolute(scores) print('Mean MAE: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example will evaluate the cross-validated estimation of model hyperparameters using repeated cross-validation.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that we achieved slightly better results with 3.374 vs. 3.432 in the previous section.

Mean MAE: 3.374 (0.558)

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered how to develop and evaluate LARS Regression models in Python.

Specifically, you learned:

- LARS Regression provides an alternate way to train a Lasso regularized linear regression model that adds a penalty to the loss function during training.
- How to evaluate a LARS Regression model and use a final model to make predictions for new data.
- How to configure the LARS Regression model for a new dataset automatically using a cross-validation version of the estimator.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Develop LARS Regression Models in Python appeared first on Machine Learning Mastery.

]]>The post Nearest Shrunken Centroids With Python appeared first on Machine Learning Mastery.

]]>It involves predicting a class label for new examples based on which class-based centroid the example is closest to from the training dataset.

The **Nearest Shrunken Centroids** algorithm is an extension that involves shifting class-based centroids toward the centroid of the entire training dataset and removing those input variables that are less useful at discriminating the classes.

As such, the Nearest Shrunken Centroids algorithm performs an automatic form of feature selection, making it appropriate for datasets with very large numbers of input variables.

In this tutorial, you will discover the Nearest Shrunken Centroids classification machine learning algorithm.

After completing this tutorial, you will know:

- The Nearest Shrunken Centroids is a simple linear machine learning algorithm for classification.
- How to fit, evaluate, and make predictions with the Nearest Shrunken Centroids model with Scikit-Learn.
- How to tune the hyperparameters of the Nearest Shrunken Centroids algorithm on a given dataset.

Let’s get started.

This tutorial is divided into three parts; they are:

- Nearest Centroids Algorithm
- Nearest Centroids With Scikit-Learn
- Tuning Nearest Centroid Hyperparameters

Nearest Centroids is a classification machine learning algorithm.

The algorithm involves first summarizing the training dataset into a set of centroids (centers), then using the centroids to make predictions for new examples.

For each class, the centroid of the data is found by taking the average value of each predictor (per class) in the training set. The overall centroid is computed using the data from all of the classes.

— Page 307, Applied Predictive Modeling, 2013.

A centroid is the geometric center of a data distribution, such as the mean. In multiple dimensions, this would be the mean value along each dimension, forming a point of center of the distribution across each variable.

The Nearest Centroids algorithm assumes that the centroids in the input feature space are different for each target label. The training data is split into groups by class label, then the centroid for each group of data is calculated. Each centroid is simply the mean value of each of the input variables. If there are two classes, then two centroids or points are calculated; three classes give three centroids, and so on.

The centroids then represent the “*model*.” Given new examples, such as those in the test set or new data, the distance between a given row of data and each centroid is calculated and the closest centroid is used to assign a class label to the example.

Distance measures, such as Euclidean distance, are used for numerical data or hamming distance for categorical data, in which case it is best practice to scale input variables via normalization or standardization prior to training the model. This is to ensure that input variables with large values don’t dominate the distance calculation.

An extension to the nearest centroid method for classification is to shrink the centroids of each input variable towards the centroid of the entire training dataset. Those variables that are shrunk down to the value of the data centroid can then be removed as they do not help to discriminate between the class labels.

As such, the amount of shrinkage applied to the centroids is a hyperparameter that can be tuned for the dataset and used to perform an automatic form of feature selection. Thus, it is appropriate for a dataset with a large number of input variables, some of which may be irrelevant or noisy.

Consequently, the nearest shrunken centroid model also conducts feature selection during the model training process.

— Page 307, Applied Predictive Modeling, 2013.

This approach is referred to as “*Nearest Shrunken Centroids*” and was first described by Robert Tibshirani, et al. in their 2002 paper titled “Diagnosis Of Multiple Cancer Types By Shrunken Centroids Of Gene Expression.”

The Nearest Shrunken Centroids is available in the scikit-learn Python machine learning library via the NearestCentroid class.

The class allows the configuration of the distance metric used in the algorithm via the “*metric*” argument, which defaults to ‘*euclidean*‘ for the Euclidean distance metric.

This can be changed to other built-in metrics such as ‘*manhattan*.’

... # create the nearest centroid model model = NearestCentroid(metric='euclidean')

By default, no shrinkage is used, but shrinkage can be specified via the “*shrink_threshold*” argument, which takes a floating point value between 0 and 1.

... # create the nearest centroid model model = NearestCentroid(metric='euclidean', shrink_threshold=0.5)

We can demonstrate the Nearest Shrunken Centroids with a worked example.

First, let’s define a synthetic classification dataset.

We will use the make_classification() function to create a dataset with 1,000 examples, each with 20 input variables.

The example creates and summarizes the dataset.

# test classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and confirms the number of rows and columns of the dataset.

(1000, 20) (1000,)

We can fit and evaluate a Nearest Shrunken Centroids model using repeated stratified k-fold cross-validation via the RepeatedStratifiedKFold class. We will use 10 folds and three repeats in the test harness.

We will use the default configuration of Euclidean distance and no shrinkage.

... # create the nearest centroid model model = NearestCentroid()

The complete example of evaluating the Nearest Shrunken Centroids model for the synthetic binary classification task is listed below.

# evaluate an nearest centroid model on the dataset from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.neighbors import NearestCentroid # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) # define model model = NearestCentroid() # define model evaluation method cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # summarize result print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the Nearest Shrunken Centroids algorithm on the synthetic dataset and reports the average accuracy across the three repeats of 10-fold cross-validation.

Your specific results may vary given the stochastic nature of the learning algorithm. Consider running the example a few times.

In this case, we can see that the model achieved a mean accuracy of about 71 percent.

Mean Accuracy: 0.711 (0.055)

We may decide to use the Nearest Shrunken Centroids as our final model and make predictions on new data.

This can be achieved by fitting the model on all available data and calling the *predict()* function passing in a new row of data.

We can demonstrate this with a complete example listed below.

# make a prediction with a nearest centroid model on the dataset from sklearn.datasets import make_classification from sklearn.neighbors import NearestCentroid # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) # define model model = NearestCentroid() # fit model model.fit(X, y) # define new data row = [2.47475454,0.40165523,1.68081787,2.88940715,0.91704519,-3.07950644,4.39961206,0.72464273,-4.86563631,-6.06338084,-1.22209949,-0.4699618,1.01222748,-0.6899355,-0.53000581,6.86966784,-3.27211075,-6.59044146,-2.21290585,-3.139579] # make a prediction yhat = model.predict([row]) # summarize prediction print('Predicted Class: %d' % yhat)

Running the example fits the model and makes a class label prediction for a new row of data.

Predicted Class: 0

Next, we can look at configuring the model hyperparameters.

The hyperparameters for the Nearest Shrunken Centroid method must be configured for your specific dataset.

Perhaps the most important hyperparameter is the shrinkage controlled via the “*shrink_threshold*” argument. It is a good idea to test values between 0 and 1 on a grid of values such as 0.1 or 0.01.

The example below demonstrates this using the GridSearchCV class with a grid of values we have defined.

# grid search shrinkage for nearest centroid from numpy import arange from sklearn.datasets import make_classification from sklearn.model_selection import GridSearchCV from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.neighbors import NearestCentroid # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) # define model model = NearestCentroid() # define model evaluation method cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define grid grid = dict() grid['shrink_threshold'] = arange(0, 1.01, 0.01) # define search search = GridSearchCV(model, grid, scoring='accuracy', cv=cv, n_jobs=-1) # perform the search results = search.fit(X, y) # summarize print('Mean Accuracy: %.3f' % results.best_score_) print('Config: %s' % results.best_params_)

Running the example will evaluate each combination of configurations using repeated cross-validation.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that we achieved slightly better results than the default, with 71.4 percent vs 71.1 percent. We can see that the model assigned a *shrink_threshold* value of 0.53.

Mean Accuracy: 0.714 Config: {'shrink_threshold': 0.53}

The other key configuration is the distance measure used, which can be chosen based on the distribution of the input variables.

Any of the built-in distance measures can be used, as listed here:

Common distance measures include:

- ‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’

For more on how these distance measures are calculated, see the tutorial:

Given that our input variables are numeric, our dataset only supports ‘*euclidean*‘ and ‘*manhattan*.’

We can include these metrics in our grid search; the complete example is listed below.

# grid search shrinkage and distance metric for nearest centroid from numpy import arange from sklearn.datasets import make_classification from sklearn.model_selection import GridSearchCV from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.neighbors import NearestCentroid # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) # define model model = NearestCentroid() # define model evaluation method cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define grid grid = dict() grid['shrink_threshold'] = arange(0, 1.01, 0.01) grid['metric'] = ['euclidean', 'manhattan'] # define search search = GridSearchCV(model, grid, scoring='accuracy', cv=cv, n_jobs=-1) # perform the search results = search.fit(X, y) # summarize print('Mean Accuracy: %.3f' % results.best_score_) print('Config: %s' % results.best_params_)

Running the example fits the model and discovers the hyperparameters that give the best results using cross-validation.

In this case, we can see that we get slightly better accuracy of 75 percent using no shrinkage and the manhattan instead of the euclidean distance measure.

Mean Accuracy: 0.750 Config: {'metric': 'manhattan', 'shrink_threshold': 0.0}

A good extension to these experiments would be to add data normalization or standardization to the data as part of a modeling Pipeline.

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered the Nearest Shrunken Centroids classification machine learning algorithm.

Specifically, you learned:

- The Nearest Shrunken Centroids is a simple linear machine learning algorithm for classification.
- How to fit, evaluate, and make predictions with the Nearest Shrunken Centroids model with Scikit-Learn.
- How to tune the hyperparameters of the Nearest Shrunken Centroids algorithm on a given dataset.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Nearest Shrunken Centroids With Python appeared first on Machine Learning Mastery.

]]>The post How to Develop LASSO Regression Models in Python appeared first on Machine Learning Mastery.

]]>Linear regression is the standard algorithm for regression that assumes a linear relationship between inputs and the target variable. An extension to linear regression invokes adding penalties to the loss function during training that encourages simpler models that have smaller coefficient values. These extensions are referred to as regularized linear regression or penalized linear regression.

**Lasso Regression** is a popular type of regularized linear regression that includes an L1 penalty. This has the effect of shrinking the coefficients for those input variables that do not contribute much to the prediction task. This penalty allows some coefficient values to go to the value of zero, allowing input variables to be effectively removed from the model, providing a type of automatic feature selection.

In this tutorial, you will discover how to develop and evaluate Lasso Regression models in Python.

After completing this tutorial, you will know:

- Lasso Regression is an extension of linear regression that adds a regularization penalty to the loss function during training.
- How to evaluate a Lasso Regression model and use a final model to make predictions for new data.
- How to configure the Lasso Regression model for a new dataset via grid search and automatically.

Let’s get started.

This tutorial is divided into three parts; they are:

- Lasso Regression
- Example of Lasso Regression
- Tuning Lasso Hyperparameters

Linear regression refers to a model that assumes a linear relationship between input variables and the target variable.

With a single input variable, this relationship is a line, and with higher dimensions, this relationship can be thought of as a hyperplane that connects the input variables to the target variable. The coefficients of the model are found via an optimization process that seeks to minimize the sum squared error between the predictions (*yhat*) and the expected target values (*y*).

- loss = sum i=0 to n (y_i – yhat_i)^2

A problem with linear regression is that estimated coefficients of the model can become large, making the model sensitive to inputs and possibly unstable. This is particularly true for problems with few observations (*samples*) or more samples (*n*) than input predictors (*p*) or variables (so-called *p >> n problems*).

One approach to address the stability of regression models is to change the loss function to include additional costs for a model that has large coefficients. Linear regression models that use these modified loss functions during training are referred to collectively as penalized linear regression.

A popular penalty is to penalize a model based on the sum of the absolute coefficient values. This is called the L1 penalty. An L1 penalty minimizes the size of all coefficients and allows some coefficients to be minimized to the value zero, which removes the predictor from the model.

- l1_penalty = sum j=0 to p abs(beta_j)

An L1 penalty minimizes the size of all coefficients and allows any coefficient to go to the value of zero, effectively removing input features from the model.

This acts as a type of automatic feature selection.

… a consequence of penalizing the absolute values is that some parameters are actually set to 0 for some value of lambda. Thus the lasso yields models that simultaneously use regularization to improve the model and to conduct feature selection.

— Page 125, Applied Predictive Modeling, 2013.

This penalty can be added to the cost function for linear regression and is referred to as Least Absolute Shrinkage And Selection Operator regularization (LASSO), or more commonly, “*Lasso*” (with title case) for short.

A popular alternative to ridge regression is the least absolute shrinkage and selection operator model, frequently called the lasso.

— Page 124, Applied Predictive Modeling, 2013.

A hyperparameter is used called “*lambda*” that controls the weighting of the penalty to the loss function. A default value of 1.0 will give full weightings to the penalty; a value of 0 excludes the penalty. Very small values of *lambda*, such as 1e-3 or smaller, are common.

- lasso_loss = loss + (lambda * l1_penalty)

Now that we are familiar with Lasso penalized regression, let’s look at a worked example.

In this section, we will demonstrate how to use the Lasso Regression algorithm.

First, let’s introduce a standard regression dataset. We will use the housing dataset.

The housing dataset is a standard machine learning dataset comprising 506 rows of data with 13 numerical input variables and a numerical target variable.

Using a test harness of repeated stratified 10-fold cross-validation with three repeats, a naive model can achieve a mean absolute error (MAE) of about 6.6. A top-performing model can achieve a MAE on this same test harness of about 1.9. This provides the bounds of expected performance on this dataset.

The dataset involves predicting the house price given details of the house suburb in the American city of Boston.

No need to download the dataset; we will download it automatically as part of our worked examples.

The example below downloads and loads the dataset as a Pandas DataFrame and summarizes the shape of the dataset and the first five rows of data.

# load and summarize the housing dataset from pandas import read_csv from matplotlib import pyplot # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) # summarize shape print(dataframe.shape) # summarize first few lines print(dataframe.head())

Running the example confirms the 506 rows of data and 13 input variables and a single numeric target variable (14 in total). We can also see that all input variables are numeric.

(506, 14) 0 1 2 3 4 5 ... 8 9 10 11 12 13 0 0.00632 18.0 2.31 0 0.538 6.575 ... 1 296.0 15.3 396.90 4.98 24.0 1 0.02731 0.0 7.07 0 0.469 6.421 ... 2 242.0 17.8 396.90 9.14 21.6 2 0.02729 0.0 7.07 0 0.469 7.185 ... 2 242.0 17.8 392.83 4.03 34.7 3 0.03237 0.0 2.18 0 0.458 6.998 ... 3 222.0 18.7 394.63 2.94 33.4 4 0.06905 0.0 2.18 0 0.458 7.147 ... 3 222.0 18.7 396.90 5.33 36.2 [5 rows x 14 columns]

The scikit-learn Python machine learning library provides an implementation of the Lasso penalized regression algorithm via the Lasso class.

Confusingly, the *lambda* term can be configured via the “*alpha*” argument when defining the class. The default value is 1.0 or a full penalty.

... # define model model = Lasso(alpha=1.0)

We can evaluate the Lasso Regression model on the housing dataset using repeated 10-fold cross-validation and report the average mean absolute error (MAE) on the dataset.

# evaluate an lasso regression model on the dataset from numpy import mean from numpy import std from numpy import absolute from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from sklearn.linear_model import Lasso # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model model = Lasso(alpha=1.0) # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1) # force scores to be positive scores = absolute(scores) print('Mean MAE: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the Lasso Regression algorithm on the housing dataset and reports the average MAE across the three repeats of 10-fold cross-validation.

Your specific results may vary given the stochastic nature of the learning algorithm. Consider running the example a few times.

In this case, we can see that the model achieved a MAE of about 3.711.

Mean MAE: 3.711 (0.549)

We may decide to use the Lasso Regression as our final model and make predictions on new data.

This can be achieved by fitting the model on all available data and calling the *predict()* function, passing in a new row of data.

We can demonstrate this with a complete example, listed below.

# make a prediction with a lasso regression model on the dataset from pandas import read_csv from sklearn.linear_model import Lasso # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model model = Lasso(alpha=1.0) # fit model model.fit(X, y) # define new data row = [0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98] # make a prediction yhat = model.predict([row]) # summarize prediction print('Predicted: %.3f' % yhat)

Running the example fits the model and makes a prediction for the new rows of data.

Predicted: 30.998

Next, we can look at configuring the model hyperparameters.

How do we know that the default hyperparameter of *alpha=1.0* is appropriate for our dataset?

We don’t.

Instead, it is good practice to test a suite of different configurations and discover what works best for our dataset.

One approach would be to gird search *alpha* values from perhaps 1e-5 to 100 on a log-10 scale and discover what works best for a dataset. Another approach would be to test values between 0.0 and 1.0 with a grid separation of 0.01. We will try the latter in this case.

The example below demonstrates this using the GridSearchCV class with a grid of values we have defined.

# grid search hyperparameters for lasso regression from numpy import arange from pandas import read_csv from sklearn.model_selection import GridSearchCV from sklearn.model_selection import RepeatedKFold from sklearn.linear_model import Lasso # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model model = Lasso() # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # define grid grid = dict() grid['alpha'] = arange(0, 1, 0.01) # define search search = GridSearchCV(model, grid, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1) # perform the search results = search.fit(X, y) # summarize print('MAE: %.3f' % results.best_score_) print('Config: %s' % results.best_params_)

Running the example will evaluate each combination of configurations using repeated cross-validation.

You might see some warnings that can be safely ignored, such as:

Objective did not converge. You might want to increase the number of iterations.

In this case, we can see that we achieved slightly better results than the default 3.379 vs. 3.711. Ignore the sign; the library makes the MAE negative for optimization purposes.

We can see that the model assigned an *alpha* weight of 0.01 to the penalty.

MAE: -3.379 Config: {'alpha': 0.01}

The scikit-learn library also provides a built-in version of the algorithm that automatically finds good hyperparameters via the LassoCV class.

To use the class, the model is fit on the training dataset as per normal and the hyperparameters are tuned automatically during the training process. The fit model can then be used to make a prediction.

By default, the model will test 100 *alpha* values. We can change this to a grid of values between 0 and 1 with a separation of 0.01 as we did on the previous example by setting the “*alphas*” argument.

The example below demonstrates this.

# use automatically configured the lasso regression algorithm from numpy import arange from pandas import read_csv from sklearn.linear_model import LassoCV from sklearn.model_selection import RepeatedKFold # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # define model model = LassoCV(alphas=arange(0, 1, 0.01), cv=cv, n_jobs=-1) # fit model model.fit(X, y) # summarize chosen configuration print('alpha: %f' % model.alpha_)

Running the example fits the model and discovers the hyperparameters that give the best results using cross-validation.

In this case, we can see that the model chose the hyperparameter of alpha=0.0. This is different from what we found via our manual grid search, perhaps due to the systematic way in which configurations were searched or selected.

alpha: 0.000000

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered how to develop and evaluate Lasso Regression models in Python.

Specifically, you learned:

- Lasso Regression is an extension of linear regression that adds a regularization penalty to the loss function during training.
- How to evaluate a Lasso Regression model and use a final model to make predictions for new data.
- How to configure the Lasso Regression model for a new dataset via grid search and automatically.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Develop LASSO Regression Models in Python appeared first on Machine Learning Mastery.

]]>The post How to Develop Ridge Regression Models in Python appeared first on Machine Learning Mastery.

]]>Linear regression is the standard algorithm for regression that assumes a linear relationship between inputs and the target variable. An extension to linear regression invokes adding penalties to the loss function during training that encourages simpler models that have smaller coefficient values. These extensions are referred to as regularized linear regression or penalized linear regression.

**Ridge Regression** is a popular type of regularized linear regression that includes an L2 penalty. This has the effect of shrinking the coefficients for those input variables that do not contribute much to the prediction task.

In this tutorial, you will discover how to develop and evaluate Ridge Regression models in Python.

After completing this tutorial, you will know:

- Ridge Regression is an extension of linear regression that adds a regularization penalty to the loss function during training.
- How to evaluate a Ridge Regression model and use a final model to make predictions for new data.
- How to configure the Ridge Regression model for a new dataset via grid search and automatically.

Let’s get started.

**Update Oct/2020**: Updated code in the grid search procedure to match description.

This tutorial is divided into three parts; they are:

- Ridge Regression
- Example of Ridge Regression
- Tuning Ridge Hyperparameters

Linear regression refers to a model that assumes a linear relationship between input variables and the target variable.

With a single input variable, this relationship is a line, and with higher dimensions, this relationship can be thought of as a hyperplane that connects the input variables to the target variable. The coefficients of the model are found via an optimization process that seeks to minimize the sum squared error between the predictions (yhat) and the expected target values (y).

- loss = sum i=0 to n (y_i – yhat_i)^2

A problem with linear regression is that estimated coefficients of the model can become large, making the model sensitive to inputs and possibly unstable. This is particularly true for problems with few observations (*samples*) or less samples (*n*) than input predictors (*p*) or variables (so-called *p >> n problems*).

One approach to address the stability of regression models is to change the loss function to include additional costs for a model that has large coefficients. Linear regression models that use these modified loss functions during training are referred to collectively as penalized linear regression.

One popular penalty is to penalize a model based on the sum of the squared coefficient values (*beta*). This is called an L2 penalty.

- l2_penalty = sum j=0 to p beta_j^2

An L2 penalty minimizes the size of all coefficients, although it prevents any coefficients from being removed from the model by allowing their value to become zero.

The effect of this penalty is that the parameter estimates are only allowed to become large if there is a proportional reduction in SSE. In effect, this method shrinks the estimates towards 0 as the lambda penalty becomes large (these techniques are sometimes called “shrinkage methods”).

— Page 123, Applied Predictive Modeling, 2013.

This penalty can be added to the cost function for linear regression and is referred to as Tikhonov regularization (after the author), or Ridge Regression more generally.

A hyperparameter is used called “*lambda*” that controls the weighting of the penalty to the loss function. A default value of 1.0 will fully weight the penalty; a value of 0 excludes the penalty. Very small values of lambda, such as 1e-3 or smaller are common.

- ridge_loss = loss + (lambda * l2_penalty)

Now that we are familiar with Ridge penalized regression, let’s look at a worked example.

In this section, we will demonstrate how to use the Ridge Regression algorithm.

First, let’s introduce a standard regression dataset. We will use the housing dataset.

The housing dataset is a standard machine learning dataset comprising 506 rows of data with 13 numerical input variables and a numerical target variable.

Using a test harness of repeated stratified 10-fold cross-validation with three repeats, a naive model can achieve a mean absolute error (MAE) of about 6.6. A top-performing model can achieve a MAE on this same test harness of about 1.9. This provides the bounds of expected performance on this dataset.

The dataset involves predicting the house price given details of the house’s suburb in the American city of Boston.

No need to download the dataset; we will download it automatically as part of our worked examples.

The example below downloads and loads the dataset as a Pandas DataFrame and summarizes the shape of the dataset and the first five rows of data.

# load and summarize the housing dataset from pandas import read_csv from matplotlib import pyplot # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) # summarize shape print(dataframe.shape) # summarize first few lines print(dataframe.head())

Running the example confirms the 506 rows of data and 13 input variables and a single numeric target variable (14 in total). We can also see that all input variables are numeric.

(506, 14) 0 1 2 3 4 5 ... 8 9 10 11 12 13 0 0.00632 18.0 2.31 0 0.538 6.575 ... 1 296.0 15.3 396.90 4.98 24.0 1 0.02731 0.0 7.07 0 0.469 6.421 ... 2 242.0 17.8 396.90 9.14 21.6 2 0.02729 0.0 7.07 0 0.469 7.185 ... 2 242.0 17.8 392.83 4.03 34.7 3 0.03237 0.0 2.18 0 0.458 6.998 ... 3 222.0 18.7 394.63 2.94 33.4 4 0.06905 0.0 2.18 0 0.458 7.147 ... 3 222.0 18.7 396.90 5.33 36.2 [5 rows x 14 columns]

The scikit-learn Python machine learning library provides an implementation of the Ridge Regression algorithm via the Ridge class.

Confusingly, the lambda term can be configured via the “*alpha*” argument when defining the class. The default value is 1.0 or a full penalty.

... # define model model = Ridge(alpha=1.0)

We can evaluate the Ridge Regression model on the housing dataset using repeated 10-fold cross-validation and report the average mean absolute error (MAE) on the dataset.

# evaluate an ridge regression model on the dataset from numpy import mean from numpy import std from numpy import absolute from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from sklearn.linear_model import Ridge # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model model = Ridge(alpha=1.0) # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1) # force scores to be positive scores = absolute(scores) print('Mean MAE: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the Ridge Regression algorithm on the housing dataset and reports the average MAE across the three repeats of 10-fold cross-validation.

In this case, we can see that the model achieved a MAE of about 3.382.

Mean MAE: 3.382 (0.519)

We may decide to use the Ridge Regression as our final model and make predictions on new data.

This can be achieved by fitting the model on all available data and calling the *predict()* function, passing in a new row of data.

We can demonstrate this with a complete example listed below.

# make a prediction with a ridge regression model on the dataset from pandas import read_csv from sklearn.linear_model import Ridge # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model model = Ridge(alpha=1.0) # fit model model.fit(X, y) # define new data row = [0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98] # make a prediction yhat = model.predict([row]) # summarize prediction print('Predicted: %.3f' % yhat)

Running the example fits the model and makes a prediction for the new rows of data.

Predicted: 30.253

Next, we can look at configuring the model hyperparameters.

How do we know that the default hyperparameters of *alpha=1.0* is appropriate for our dataset?

We don’t.

Instead, it is good practice to test a suite of different configurations and discover what works best for our dataset.

One approach would be to grid search *alpha* values from perhaps 1e-5 to 100 on a log scale and discover what works best for a dataset. Another approach would be to test values between 0.0 and 1.0 with a grid separation of 0.01. We will try the latter in this case.

The example below demonstrates this using the GridSearchCV class with a grid of values we have defined.

# grid search hyperparameters for ridge regression from numpy import arange from pandas import read_csv from sklearn.model_selection import GridSearchCV from sklearn.model_selection import RepeatedKFold from sklearn.linear_model import Ridge # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model model = Ridge() # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # define grid grid = dict() grid['alpha'] = arange(0, 1, 0.01) # define search search = GridSearchCV(model, grid, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1) # perform the search results = search.fit(X, y) # summarize print('MAE: %.3f' % results.best_score_) print('Config: %s' % results.best_params_)

Running the example will evaluate each combination of configurations using repeated cross-validation.

In this case, we can see that we achieved slightly better results than the default 3.379 vs. 3.382. Ignore the sign; the library makes the MAE negative for optimization purposes.

We can see that the model assigned an *alpha* weight of 0.51 to the penalty.

MAE: -3.379 Config: {'alpha': 0.51}

The scikit-learn library also provides a built-in version of the algorithm that automatically finds good hyperparameters via the RidgeCV class.

To use this class, it is fit on the training dataset and used to make a prediction. During the training process, it automatically tunes the hyperparameter values.

By default, the model will only test the *alpha* values (0.1, 1.0, 10.0). We can change this to a grid of values between 0 and 1 with a separation of 0.01 as we did on the previous example by setting the “*alphas*” argument.

The example below demonstrates this.

# use automatically configured the ridge regression algorithm from numpy import arange from pandas import read_csv from sklearn.linear_model import RidgeCV from sklearn.model_selection import RepeatedKFold # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # define model model = RidgeCV(alphas=arange(0, 1, 0.01), cv=cv, scoring='neg_mean_absolute_error') # fit model model.fit(X, y) # summarize chosen configuration print('alpha: %f' % model.alpha_)

Running the example fits the model and discovers the hyperparameters that give the best results using cross-validation.

In this case, we can see that the model chose the identical hyperparameter of *alpha=0.51* that we found via our manual grid search.

alpha: 0.510000

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered how to develop and evaluate Ridge Regression models in Python.

Specifically, you learned:

- Ridge Regression is an extension of linear regression that adds a regularization penalty to the loss function during training.
- How to evaluate a Ridge Regression model and use a final model to make predictions for new data.
- How to configure the Ridge Regression model for a new dataset via grid search and automatically.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Develop Ridge Regression Models in Python appeared first on Machine Learning Mastery.

]]>The post How to Develop Elastic Net Regression Models in Python appeared first on Machine Learning Mastery.

]]>Linear regression is the standard algorithm for regression that assumes a linear relationship between inputs and the target variable. An extension to linear regression involves adding penalties to the loss function during training that encourage simpler models that have smaller coefficient values. These extensions are referred to as regularized linear regression or penalized linear regression.

**Elastic net** is a popular type of regularized linear regression that combines two popular penalties, specifically the L1 and L2 penalty functions.

In this tutorial, you will discover how to develop Elastic Net regularized regression in Python.

After completing this tutorial, you will know:

- Elastic Net is an extension of linear regression that adds regularization penalties to the loss function during training.
- How to evaluate an Elastic Net model and use a final model to make predictions for new data.
- How to configure the Elastic Net model for a new dataset via grid search and automatically.

Let’s get started.

This tutorial is divided into three parts; they are:

- Elastic Net Regression
- Example of Elastic Net Regression
- Tuning Elastic Net Hyperparameters

With a single input variable, this relationship is a line, and with higher dimensions, this relationship can be thought of as a hyperplane that connects the input variables to the target variable. The coefficients of the model are found via an optimization process that seeks to minimize the sum squared error between the predictions (*yhat*) and the expected target values (*y*).

- loss = sum i=0 to n (y_i – yhat_i)^2

A problem with linear regression is that estimated coefficients of the model can become large, making the model sensitive to inputs and possibly unstable. This is particularly true for problems with few observations (*samples*) or more samples (*n*) than input predictors (*p*) or variables (so-called *p >> n problems*).

One approach to addressing the stability of regression models is to change the loss function to include additional costs for a model that has large coefficients. Linear regression models that use these modified loss functions during training are referred to collectively as penalized linear regression.

One popular penalty is to penalize a model based on the sum of the squared coefficient values. This is called an L2 penalty. An L2 penalty minimizes the size of all coefficients, although it prevents any coefficients from being removed from the model.

- l2_penalty = sum j=0 to p beta_j^2

Another popular penalty is to penalize a model based on the sum of the absolute coefficient values. This is called the L1 penalty. An L1 penalty minimizes the size of all coefficients and allows some coefficients to be minimized to the value zero, which removes the predictor from the model.

- l1_penalty = sum j=0 to p abs(beta_j)

Elastic net is a penalized linear regression model that includes both the L1 and L2 penalties during training.

Using the terminology from “The Elements of Statistical Learning,” a hyperparameter “*alpha*” is provided to assign how much weight is given to each of the L1 and L2 penalties. Alpha is a value between 0 and 1 and is used to weight the contribution of the L1 penalty and one minus the alpha value is used to weight the L2 penalty.

- elastic_net_penalty = (alpha * l1_penalty) + ((1 – alpha) * l2_penalty)

For example, an alpha of 0.5 would provide a 50 percent contribution of each penalty to the loss function. An alpha value of 0 gives all weight to the L2 penalty and a value of 1 gives all weight to the L1 penalty.

The parameter alpha determines the mix of the penalties, and is often pre-chosen on qualitative grounds.

— Page 663, The Elements of Statistical Learning, 2016.

The benefit is that elastic net allows a balance of both penalties, which can result in better performance than a model with either one or the other penalty on some problems.

Another hyperparameter is provided called “*lambda*” that controls the weighting of the sum of both penalties to the loss function. A default value of 1.0 is used to use the fully weighted penalty; a value of 0 excludes the penalty. Very small values of lambada, such as 1e-3 or smaller, are common.

- elastic_net_loss = loss + (lambda * elastic_net_penalty)

Now that we are familiar with elastic net penalized regression, let’s look at a worked example.

In this section, we will demonstrate how to use the Elastic Net regression algorithm.

First, let’s introduce a standard regression dataset. We will use the housing dataset.

The dataset involves predicting the house price given details of the house’s suburb in the American city of Boston.

No need to download the dataset; we will download it automatically as part of our worked examples.

Running the example confirms the 506 rows of data and 13 input variables and a single numeric target variable (14 in total).

We can also see that all input variables are numeric.

The scikit-learn Python machine learning library provides an implementation of the Elastic Net penalized regression algorithm via the ElasticNet class.

Confusingly, the *alpha* hyperparameter can be set via the “*l1_ratio*” argument that controls the contribution of the L1 and L2 penalties and the *lambda* hyperparameter can be set via the “*alpha*” argument that controls the contribution of the sum of both penalties to the loss function.

By default, an equal balance of 0.5 is used for “*l1_ratio*” and a full weighting of 1.0 is used for alpha.

... # define model model = ElasticNet(alpha=1.0, l1_ratio=0.5)

We can evaluate the Elastic Net model on the housing dataset using repeated 10-fold cross-validation and report the average mean absolute error (MAE) on the dataset.

# evaluate an elastic net model on the dataset from numpy import mean from numpy import std from numpy import absolute from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from sklearn.linear_model import ElasticNet # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model model = ElasticNet(alpha=1.0, l1_ratio=0.5) # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1) # force scores to be positive scores = absolute(scores) print('Mean MAE: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the Elastic Net algorithm on the housing dataset and reports the average MAE across the three repeats of 10-fold cross-validation.

In this case, we can see that the model achieved a MAE of about 3.682.

Mean MAE: 3.682 (0.530)

We may decide to use the Elastic Net as our final model and make predictions on new data.

*predict()* function, passing in a new row of data.

We can demonstrate this with a complete example, listed below.

# make a prediction with an elastic net model on the dataset from pandas import read_csv from sklearn.linear_model import ElasticNet # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model model = ElasticNet(alpha=1.0, l1_ratio=0.5) # fit model model.fit(X, y) # define new data row = [0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98] # make a prediction yhat = model.predict([row]) # summarize prediction print('Predicted: %.3f' % yhat)

Running the example fits the model and makes a prediction for the new rows of data.

Predicted: 31.047

Next, we can look at configuring the model hyperparameters.

How do we know that the default hyperparameters of alpha=1.0 and l1_ratio=0.5 are any good for our dataset?

We don’t.

Instead, it is good practice to test a suite of different configurations and discover what works best.

One approach would be to gird search *l1_ratio* values between 0 and 1 with a 0.1 or 0.01 separation and *alpha* values from perhaps 1e-5 to 100 on a log-10 scale and discover what works best for a dataset.

# grid search hyperparameters for the elastic net from numpy import arange from pandas import read_csv from sklearn.model_selection import GridSearchCV from sklearn.model_selection import RepeatedKFold from sklearn.linear_model import ElasticNet # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model model = ElasticNet() # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # define grid grid = dict() grid['alpha'] = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 0.0, 1.0, 10.0, 100.0] grid['l1_ratio'] = arange(0, 1, 0.01) # define search search = GridSearchCV(model, grid, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1) # perform the search results = search.fit(X, y) # summarize print('MAE: %.3f' % results.best_score_) print('Config: %s' % results.best_params_)

You might see some warnings that can be safely ignored, such as:

Objective did not converge. You might want to increase the number of iterations.

In this case, we can see that we achieved slightly better results than the default 3.378 vs. 3.682. Ignore the sign; the library makes the MAE negative for optimization purposes.

We can see that the model assigned an alpha weight of 0.01 to the penalty and focuses exclusively on the L2 penalty.

MAE: -3.378 Config: {'alpha': 0.01, 'l1_ratio': 0.97}

The scikit-learn library also provides a built-in version of the algorithm that automatically finds good hyperparameters via the ElasticNetCV class.

To use this class, it is first fit on the dataset, then used to make a prediction. It will automatically find appropriate hyperparameters.

By default, the model will test 100 alpha values and use a default ratio. We can specify our own lists of values to test via the “*l1_ratio*” and “*alphas*” arguments, as we did with the manual grid search.

The example below demonstrates this.

# use automatically configured elastic net algorithm from numpy import arange from pandas import read_csv from sklearn.linear_model import ElasticNetCV from sklearn.model_selection import RepeatedKFold # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # define model ratios = arange(0, 1, 0.01) alphas = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 0.0, 1.0, 10.0, 100.0] model = ElasticNetCV(l1_ratio=ratios, alphas=alphas, cv=cv, n_jobs=-1) # fit model model.fit(X, y) # summarize chosen configuration print('alpha: %f' % model.alpha_) print('l1_ratio_: %f' % model.l1_ratio_)

Again, you might see some warnings that can be safely ignored, such as:

Objective did not converge. You might want to increase the number of iterations.

In this case, we can see that an alpha of 0.0 was chosen, removing both penalties from the loss function.

This is different from what we found via our manual grid search, perhaps due to the systematic way in which configurations were searched or selected.

alpha: 0.000000 l1_ratio_: 0.470000

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered how to develop Elastic Net regularized regression in Python.

Specifically, you learned:

- Elastic Net is an extension of linear regression that adds regularization penalties to the loss function during training.
- How to evaluate an Elastic Net model and use a final model to make predictions for new data.
- How to configure the Elastic Net model for a new dataset via grid search and automatically.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Develop Elastic Net Regression Models in Python appeared first on Machine Learning Mastery.

]]>The post Robust Regression for Machine Learning in Python appeared first on Machine Learning Mastery.

]]>Algorithms used for regression tasks are also referred to as “*regression*” algorithms, with the most widely known and perhaps most successful being linear regression.

Linear regression fits a line or hyperplane that best describes the linear relationship between inputs and the target numeric value. If the data contains outlier values, the line can become biased, resulting in worse predictive performance. **Robust regression** refers to a suite of algorithms that are robust in the presence of outliers in training data.

In this tutorial, you will discover robust regression algorithms for machine learning.

After completing this tutorial, you will know:

- Robust regression algorithms can be used for data with outliers in the input or target values.
- How to evaluate robust regression algorithms for a regression predictive modeling task.
- How to compare robust regression algorithms using their line of best fit on the dataset.

Let’s get started.

This tutorial is divided into four parts; they are:

- Regression With Outliers
- Regression Dataset With Outliers
- Robust Regression Algorithms
- Compare Robust Regression Algorithms

Regression predictive modeling involves predicting a numeric variable given some input, often numerical input.

Machine learning algorithms used for regression predictive modeling tasks are also referred to as “*regression*” or “*regression algorithms*.” The most common method is linear regression.

Many regression algorithms are linear in that they assume that the relationship between the input variable or variables and the target variable is linear, such as a line in two-dimensions, a plane in three dimensions, and a hyperplane in higher dimensions. This is a reasonable assumption for many prediction tasks.

Linear regression assumes that the probability distribution of each variable is well behaved, such as has a Gaussian distribution. The less well behaved the probability distribution for a feature is in a dataset, the less likely that linear regression will find a good fit.

A specific problem with the probability distribution of variables when using linear regression is outliers. These are observations that are far outside the expected distribution. For example, if a variable has a Gaussian distribution, then an observation that is 3 or 4 (or more) standard deviations from the mean is considered an outlier.

A dataset may have outliers on either the input variables or the target variable, and both can cause problems for a linear regression algorithm.

Outliers in a dataset can skew summary statistics calculated for the variable, such as the mean and standard deviation, which in turn can skew the model towards the outlier values, away from the central mass of observations. This results in models that try to balance performing well on outliers and normal data, and performing worse on both overall.

The solution instead is to use modified versions of linear regression that specifically address the expectation of outliers in the dataset. These methods are referred to as robust regression algorithms.

We can define a synthetic regression dataset using the make_regression() function.

In this case, we want a dataset that we can plot and understand easily. This can be achieved by using a single input variable and a single output variable. We don’t want the task to be too easy, so we will add a large amount of statistical noise.

... X, y = make_regression(n_samples=100, n_features=1, tail_strength=0.9, effective_rank=1, n_informative=1, noise=3, bias=50, random_state=1)

Once we have the dataset, we can augment it by adding outliers. Specifically, we will add outliers to the input variables.

This can be done by changing some of the input variables to have a value that is a factor of the number of standard deviations away from the mean, such as 2-to-4. We will add 10 outliers to the dataset.

# add some artificial outliers seed(1) for i in range(10): factor = randint(2, 4) if random() > 0.5: X[i] += factor * X.std() else: X[i] -= factor * X.std()

We can tie this together into a function that will prepare the dataset. This function can then be called and we can plot the dataset with the input values on the x-axis and the target or outcome on the y-axis.

The complete example of preparing and plotting the dataset is listed below.

# create a regression dataset with outliers from random import random from random import randint from random import seed from sklearn.datasets import make_regression from matplotlib import pyplot # prepare the dataset def get_dataset(): X, y = make_regression(n_samples=100, n_features=1, tail_strength=0.9, effective_rank=1, n_informative=1, noise=3, bias=50, random_state=1) # add some artificial outliers seed(1) for i in range(10): factor = randint(2, 4) if random() > 0.5: X[i] += factor * X.std() else: X[i] -= factor * X.std() return X, y # load dataset X, y = get_dataset() # summarize shape print(X.shape, y.shape) # scatter plot of input vs output pyplot.scatter(X, y) pyplot.show()

Running the example creates the synthetic regression dataset and adds outlier values.

The dataset is then plotted, and we can clearly see the linear relationship in the data, with statistical noise, and a modest number of outliers as points far from the main mass of data.

Now that we have a dataset, let’s fit different regression models on it.

In this section, we will consider different robust regression algorithms for the dataset.

Before diving into robust regression algorithms, let’s start with linear regression.

We can evaluate linear regression using repeated k-fold cross-validation on the regression dataset with outliers. We will measure mean absolute error and this will provide a lower bound on model performance on this task that we might expect some robust regression algorithms to out-perform.

# evaluate a model def evaluate_model(X, y, model): # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1) # force scores to be positive return absolute(scores)

We can also plot the model’s line of best fit on the dataset. To do this, we first fit the model on the entire training dataset, then create an input dataset that is a grid across the entire input domain, make a prediction for each, then draw a line for the inputs and predicted outputs.

This plot shows how the model “*sees*” the problem, specifically the relationship between the input and output variables. The idea is that the line will be skewed by the outliers when using linear regression.

# plot the dataset and the model's line of best fit def plot_best_fit(X, y, model): # fut the model on all data model.fit(X, y) # plot the dataset pyplot.scatter(X, y) # plot the line of best fit xaxis = arange(X.min(), X.max(), 0.01) yaxis = model.predict(xaxis.reshape((len(xaxis), 1))) pyplot.plot(xaxis, yaxis, color='r') # show the plot pyplot.title(type(model).__name__) pyplot.show()

Tying this together, the complete example for linear regression is listed below.

# linear regression on a dataset with outliers from random import random from random import randint from random import seed from numpy import arange from numpy import mean from numpy import std from numpy import absolute from sklearn.datasets import make_regression from sklearn.linear_model import LinearRegression from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from matplotlib import pyplot # prepare the dataset def get_dataset(): X, y = make_regression(n_samples=100, n_features=1, tail_strength=0.9, effective_rank=1, n_informative=1, noise=3, bias=50, random_state=1) # add some artificial outliers seed(1) for i in range(10): factor = randint(2, 4) if random() > 0.5: X[i] += factor * X.std() else: X[i] -= factor * X.std() return X, y # evaluate a model def evaluate_model(X, y, model): # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1) # force scores to be positive return absolute(scores) # plot the dataset and the model's line of best fit def plot_best_fit(X, y, model): # fut the model on all data model.fit(X, y) # plot the dataset pyplot.scatter(X, y) # plot the line of best fit xaxis = arange(X.min(), X.max(), 0.01) yaxis = model.predict(xaxis.reshape((len(xaxis), 1))) pyplot.plot(xaxis, yaxis, color='r') # show the plot pyplot.title(type(model).__name__) pyplot.show() # load dataset X, y = get_dataset() # define the model model = LinearRegression() # evaluate model results = evaluate_model(X, y, model) print('Mean MAE: %.3f (%.3f)' % (mean(results), std(results))) # plot the line of best fit plot_best_fit(X, y, model)

Running the example first reports the mean MAE for the model on the dataset.

We can see that linear regression achieves a MAE of about 5.2 on this dataset, providing an upper-bound in error.

Mean MAE: 5.260 (1.149)

Next, the dataset is plotted as a scatter plot showing the outliers and this is overlaid with the line of best fit from the linear regression algorithm.

In this case, we can see that the line of best fit is not aligning with the data and it has been skewed by the outliers. In turn, we expect this has caused the model to have a worse-than-expected performance on the dataset.

Huber regression is a type of robust regression that is aware of the possibility of outliers in a dataset and assigns them less weight than other examples in the dataset.

We can use Huber regression via the HuberRegressor class in scikit-learn. The “*epsilon*” argument controls what is considered an outlier, where smaller values consider more of the data outliers, and in turn, make the model more robust to outliers. The default is 1.35.

The example below evaluates Huber regression on the regression dataset with outliers, first evaluating the model with repeated cross-validation and then plotting the line of best fit.

# huber regression on a dataset with outliers from random import random from random import randint from random import seed from numpy import arange from numpy import mean from numpy import std from numpy import absolute from sklearn.datasets import make_regression from sklearn.linear_model import HuberRegressor from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from matplotlib import pyplot # prepare the dataset def get_dataset(): X, y = make_regression(n_samples=100, n_features=1, tail_strength=0.9, effective_rank=1, n_informative=1, noise=3, bias=50, random_state=1) # add some artificial outliers seed(1) for i in range(10): factor = randint(2, 4) if random() > 0.5: X[i] += factor * X.std() else: X[i] -= factor * X.std() return X, y # evaluate a model def evaluate_model(X, y, model): # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1) # force scores to be positive return absolute(scores) # plot the dataset and the model's line of best fit def plot_best_fit(X, y, model): # fut the model on all data model.fit(X, y) # plot the dataset pyplot.scatter(X, y) # plot the line of best fit xaxis = arange(X.min(), X.max(), 0.01) yaxis = model.predict(xaxis.reshape((len(xaxis), 1))) pyplot.plot(xaxis, yaxis, color='r') # show the plot pyplot.title(type(model).__name__) pyplot.show() # load dataset X, y = get_dataset() # define the model model = HuberRegressor() # evaluate model results = evaluate_model(X, y, model) print('Mean MAE: %.3f (%.3f)' % (mean(results), std(results))) # plot the line of best fit plot_best_fit(X, y, model)

Running the example first reports the mean MAE for the model on the dataset.

We can see that Huber regression achieves a MAE of about 4.435 on this dataset, outperforming the linear regression model in the previous section.

Mean MAE: 4.435 (1.868)

Next, the dataset is plotted as a scatter plot showing the outliers and this is overlaid with the line of best fit from the algorithm.

In this case, we can see that the line of best fit is better aligned with the main body of the data, and does not appear to be obviously influenced by the outliers that are present.

Random Sample Consensus, or RANSAC for short, is another robust regression algorithm.

RANSAC tries to separate data into outliers and inliers and fits the model on the inliers.

The scikit-learn library provides an implementation via the RANSACRegressor class.

The example below evaluates RANSAC regression on the regression dataset with outliers, first evaluating the model with repeated cross-validation and then plotting the line of best fit.

# ransac regression on a dataset with outliers from random import random from random import randint from random import seed from numpy import arange from numpy import mean from numpy import std from numpy import absolute from sklearn.datasets import make_regression from sklearn.linear_model import RANSACRegressor from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from matplotlib import pyplot # prepare the dataset def get_dataset(): X, y = make_regression(n_samples=100, n_features=1, tail_strength=0.9, effective_rank=1, n_informative=1, noise=3, bias=50, random_state=1) # add some artificial outliers seed(1) for i in range(10): factor = randint(2, 4) if random() > 0.5: X[i] += factor * X.std() else: X[i] -= factor * X.std() return X, y # evaluate a model def evaluate_model(X, y, model): # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1) # force scores to be positive return absolute(scores) # plot the dataset and the model's line of best fit def plot_best_fit(X, y, model): # fut the model on all data model.fit(X, y) # plot the dataset pyplot.scatter(X, y) # plot the line of best fit xaxis = arange(X.min(), X.max(), 0.01) yaxis = model.predict(xaxis.reshape((len(xaxis), 1))) pyplot.plot(xaxis, yaxis, color='r') # show the plot pyplot.title(type(model).__name__) pyplot.show() # load dataset X, y = get_dataset() # define the model model = RANSACRegressor() # evaluate model results = evaluate_model(X, y, model) print('Mean MAE: %.3f (%.3f)' % (mean(results), std(results))) # plot the line of best fit plot_best_fit(X, y, model)

Running the example first reports the mean MAE for the model on the dataset.

We can see that RANSAC regression achieves a MAE of about 4.454 on this dataset, outperforming the linear regression model but perhaps not Huber regression.

Mean MAE: 4.454 (2.165)

Next, the dataset is plotted as a scatter plot showing the outliers, and this is overlaid with the line of best fit from the algorithm.

In this case, we can see that the line of best fit is aligned with the main body of the data, perhaps even better than the plot for Huber regression.

Theil Sen regression involves fitting multiple regression models on subsets of the training data and combining the coefficients together in the end.

The scikit-learn provides an implementation via the TheilSenRegressor class.

The example below evaluates Theil Sen regression on the regression dataset with outliers, first evaluating the model with repeated cross-validation and then plotting the line of best fit.

# theilsen regression on a dataset with outliers from random import random from random import randint from random import seed from numpy import arange from numpy import mean from numpy import std from numpy import absolute from sklearn.datasets import make_regression from sklearn.linear_model import TheilSenRegressor from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from matplotlib import pyplot # prepare the dataset def get_dataset(): X, y = make_regression(n_samples=100, n_features=1, tail_strength=0.9, effective_rank=1, n_informative=1, noise=3, bias=50, random_state=1) # add some artificial outliers seed(1) for i in range(10): factor = randint(2, 4) if random() > 0.5: X[i] += factor * X.std() else: X[i] -= factor * X.std() return X, y # evaluate a model def evaluate_model(X, y, model): # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1) # force scores to be positive return absolute(scores) # plot the dataset and the model's line of best fit def plot_best_fit(X, y, model): # fut the model on all data model.fit(X, y) # plot the dataset pyplot.scatter(X, y) # plot the line of best fit xaxis = arange(X.min(), X.max(), 0.01) yaxis = model.predict(xaxis.reshape((len(xaxis), 1))) pyplot.plot(xaxis, yaxis, color='r') # show the plot pyplot.title(type(model).__name__) pyplot.show() # load dataset X, y = get_dataset() # define the model model = TheilSenRegressor() # evaluate model results = evaluate_model(X, y, model) print('Mean MAE: %.3f (%.3f)' % (mean(results), std(results))) # plot the line of best fit plot_best_fit(X, y, model)

Running the example first reports the mean MAE for the model on the dataset.

We can see that Theil Sen regression achieves a MAE of about 4.371 on this dataset, outperforming the linear regression model as well as RANSAC and Huber regression.

Mean MAE: 4.371 (1.961)

Next, the dataset is plotted as a scatter plot showing the outliers, and this is overlaid with the line of best fit from the algorithm.

In this case, we can see that the line of best fit is aligned with the main body of the data.

Now that we are familiar with some popular robust regression algorithms and how to use them, we can look at how we might compare them directly.

It can be useful to run an experiment to directly compare the robust regression algorithms on the same dataset. We can compare the mean performance of each method, and more usefully, use tools like a box and whisker plot to compare the distribution of scores across the repeated cross-validation folds.

The complete example is listed below.

# compare robust regression algorithms on a regression dataset with outliers from random import random from random import randint from random import seed from numpy import mean from numpy import std from numpy import absolute from sklearn.datasets import make_regression from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from sklearn.linear_model import LinearRegression from sklearn.linear_model import HuberRegressor from sklearn.linear_model import RANSACRegressor from sklearn.linear_model import TheilSenRegressor from matplotlib import pyplot # prepare the dataset def get_dataset(): X, y = make_regression(n_samples=100, n_features=1, tail_strength=0.9, effective_rank=1, n_informative=1, noise=3, bias=50, random_state=1) # add some artificial outliers seed(1) for i in range(10): factor = randint(2, 4) if random() > 0.5: X[i] += factor * X.std() else: X[i] -= factor * X.std() return X, y # dictionary of model names and model objects def get_models(): models = dict() models['Linear'] = LinearRegression() models['Huber'] = HuberRegressor() models['RANSAC'] = RANSACRegressor() models['TheilSen'] = TheilSenRegressor() return models # evaluate a model def evalute_model(X, y, model, name): # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1) # force scores to be positive scores = absolute(scores) return scores # load the dataset X, y = get_dataset() # retrieve models models = get_models() results = dict() for name, model in models.items(): # evaluate the model results[name] = evalute_model(X, y, model, name) # summarize progress print('>%s %.3f (%.3f)' % (name, mean(results[name]), std(results[name]))) # plot model performance for comparison pyplot.boxplot(results.values(), labels=results.keys(), showmeans=True) pyplot.show()

Running the example evaluates each model in turn, reporting the mean and standard deviation MAE scores of reach.

Note: your specific results will differ given the stochastic nature of the learning algorithms and evaluation procedure. Try running the example a few times.

We can see some minor differences between these scores and those reported in the previous section, although the differences may or may not be statistically significant. The general pattern of the robust regression methods performing better than linear regression holds, TheilSen achieving better performance than the other methods.

>Linear 5.260 (1.149) >Huber 4.435 (1.868) >RANSAC 4.405 (2.206) >TheilSen 4.371 (1.961)

A plot is created showing a box and whisker plot summarizing the distribution of results for each evaluated algorithm.

We can clearly see the distributions for the robust regression algorithms sitting and extending lower than the linear regression algorithm.

It may also be interesting to compare robust regression algorithms based on a plot of their line of best fit.

The example below fits each robust regression algorithm and plots their line of best fit on the same plot in the context of a scatter plot of the entire training dataset.

# plot line of best for multiple robust regression algorithms from random import random from random import randint from random import seed from numpy import arange from sklearn.datasets import make_regression from sklearn.linear_model import LinearRegression from sklearn.linear_model import HuberRegressor from sklearn.linear_model import RANSACRegressor from sklearn.linear_model import TheilSenRegressor from matplotlib import pyplot # prepare the dataset def get_dataset(): X, y = make_regression(n_samples=100, n_features=1, tail_strength=0.9, effective_rank=1, n_informative=1, noise=3, bias=50, random_state=1) # add some artificial outliers seed(1) for i in range(10): factor = randint(2, 4) if random() > 0.5: X[i] += factor * X.std() else: X[i] -= factor * X.std() return X, y # dictionary of model names and model objects def get_models(): models = list() models.append(LinearRegression()) models.append(HuberRegressor()) models.append(RANSACRegressor()) models.append(TheilSenRegressor()) return models # plot the dataset and the model's line of best fit def plot_best_fit(X, y, xaxis, model): # fit the model on all data model.fit(X, y) # calculate outputs for grid across the domain yaxis = model.predict(xaxis.reshape((len(xaxis), 1))) # plot the line of best fit pyplot.plot(xaxis, yaxis, label=type(model).__name__) # load the dataset X, y = get_dataset() # define a uniform grid across the input domain xaxis = arange(X.min(), X.max(), 0.01) for model in get_models(): # plot the line of best fit plot_best_fit(X, y, xaxis, model) # plot the dataset pyplot.scatter(X, y) # show the plot pyplot.title('Robust Regression') pyplot.legend() pyplot.show()

Running the example creates a plot showing the dataset as a scatter plot and the line of best fit for each algorithm.

We can clearly see the off-axis line for the linear regression algorithm and the much better lines for the robust regression algorithms that follow the main body of the data.

This section provides more resources on the topic if you are looking to go deeper.

- Linear Models, scikit-learn.
- sklearn.datasets.make_regression API.
- sklearn.linear_model.LinearRegression API.
- sklearn.linear_model.HuberRegressor API.
- sklearn.linear_model.RANSACRegressor API.
- sklearn.linear_model.TheilSenRegressor API.

- Robust regression, Wikipedia.
- M-estimator, Wikipedia.
- Random sample consensus, Wikipedia.
- Theil–Sen estimator, Wikipedia.

In this tutorial, you discovered robust regression algorithms for machine learning.

Specifically, you learned:

- Robust regression algorithms can be used for data with outliers in the input or target values.
- How to evaluate robust regression algorithms for a regression predictive modeling task.
- How to compare robust regression algorithms using their line of best fit on the dataset.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Robust Regression for Machine Learning in Python appeared first on Machine Learning Mastery.

]]>The post Gaussian Processes for Classification With Python appeared first on Machine Learning Mastery.

]]>Gaussian Processes are a generalization of the Gaussian probability distribution and can be used as the basis for sophisticated non-parametric machine learning algorithms for classification and regression.

They are a type of kernel model, like SVMs, and unlike SVMs, they are capable of predicting highly calibrated class membership probabilities, although the choice and configuration of the kernel used at the heart of the method can be challenging.

In this tutorial, you will discover the Gaussian Processes Classifier classification machine learning algorithm.

After completing this tutorial, you will know:

- The Gaussian Processes Classifier is a non-parametric algorithm that can be applied to binary classification tasks.
- How to fit, evaluate, and make predictions with the Gaussian Processes Classifier model with Scikit-Learn.
- How to tune the hyperparameters of the Gaussian Processes Classifier algorithm on a given dataset.

Let’s get started.

This tutorial is divided into three parts; they are:

- Gaussian Processes for Classification
- Gaussian Processes With Scikit-Learn
- Tune Gaussian Processes Hyperparameters

Gaussian Processes, or GP for short, are a generalization of the Gaussian probability distribution (e.g. the bell-shaped function).

Gaussian probability distribution functions summarize the distribution of random variables, whereas Gaussian processes summarize the properties of the functions, e.g. the parameters of the functions. As such, you can think of Gaussian processes as one level of abstraction or indirection above Gaussian functions.

A Gaussian process is a generalization of the Gaussian probability distribution. Whereas a probability distribution describes random variables which are scalars or vectors (for multivariate distributions), a stochastic process governs the properties of functions.

— Page 2, Gaussian Processes for Machine Learning, 2006.

Gaussian processes can be used as a machine learning algorithm for classification predictive modeling.

Gaussian processes are a type of kernel method, like SVMs, although they are able to predict highly calibrated probabilities, unlike SVMs.

Gaussian processes require specifying a kernel that controls how examples relate to each other; specifically, it defines the covariance function of the data. This is called the latent function or the “*nuisance*” function.

The latent function f plays the role of a nuisance function: we do not observe values of f itself (we observe only the inputs X and the class labels y) and we are not particularly interested in the values of f …

— Page 40, Gaussian Processes for Machine Learning, 2006.

The way that examples are grouped using the kernel controls how the model “*perceives*” the examples, given that it assumes that examples that are “*close*” to each other have the same class label.

Therefore, it is important to both test different kernel functions for the model and different configurations for sophisticated kernel functions.

… a covariance function is the crucial ingredient in a Gaussian process predictor, as it encodes our assumptions about the function which we wish to learn.

— Page 79, Gaussian Processes for Machine Learning, 2006.

It also requires a link function that interprets the internal representation and predicts the probability of class membership. The logistic function can be used, allowing the modeling of a Binomial probability distribution for binary classification.

For the binary discriminative case one simple idea is to turn the output of a regression model into a class probability using a response function (the inverse of a link function), which “squashes” its argument, which can lie in the domain (−inf, inf), into the range [0, 1], guaranteeing a valid probabilistic interpretation.

— Page 35, Gaussian Processes for Machine Learning, 2006.

Gaussian processes and Gaussian processes for classification is a complex topic.

To learn more see the text:

The Gaussian Processes Classifier is available in the scikit-learn Python machine learning library via the GaussianProcessClassifier class.

The class allows you to specify the kernel to use via the “*kernel*” argument and defaults to 1 * RBF(1.0), e.g. a RBF kernel.

... # define model model = GaussianProcessClassifier(kernel=1*RBF(1.0))

Given that a kernel is specified, the model will attempt to best configure the kernel for the training dataset.

This is controlled via setting an “*optimizer*“, the number of iterations for the optimizer via the “*max_iter_predict*“, and the number of repeats of this optimization process performed in an attempt to overcome local optima “*n_restarts_optimizer*“.

By default, a single optimization run is performed, and this can be turned off by setting “*optimize*” to *None*.

... # define model model = GaussianProcessClassifier(optimizer=None)

We can demonstrate the Gaussian Processes Classifier with a worked example.

First, let’s define a synthetic classification dataset.

We will use the make_classification() function to create a dataset with 100 examples, each with 20 input variables.

The example below creates and summarizes the dataset.

# test classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=100, n_features=20, n_informative=15, n_redundant=5, random_state=1) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and confirms the number of rows and columns of the dataset.

(100, 20) (100,)

We can fit and evaluate a Gaussian Processes Classifier model using repeated stratified k-fold cross-validation via the RepeatedStratifiedKFold class. We will use 10 folds and three repeats in the test harness.

We will use the default configuration.

... # create the model model = GaussianProcessClassifier()

The complete example of evaluating the Gaussian Processes Classifier model for the synthetic binary classification task is listed below.

# evaluate a gaussian process classifier model on the dataset from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.gaussian_process import GaussianProcessClassifier # define dataset X, y = make_classification(n_samples=100, n_features=20, n_informative=15, n_redundant=5, random_state=1) # define model model = GaussianProcessClassifier() # define model evaluation method cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # summarize result print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the Gaussian Processes Classifier algorithm on the synthetic dataset and reports the average accuracy across the three repeats of 10-fold cross-validation.

In this case, we can see that the model achieved a mean accuracy of about 79.0 percent.

Mean Accuracy: 0.790 (0.101)

We may decide to use the Gaussian Processes Classifier as our final model and make predictions on new data.

This can be achieved by fitting the model pipeline on all available data and calling the *predict()* function passing in a new row of data.

We can demonstrate this with a complete example listed below.

# make a prediction with a gaussian process classifier model on the dataset from sklearn.datasets import make_classification from sklearn.gaussian_process import GaussianProcessClassifier # define dataset X, y = make_classification(n_samples=100, n_features=20, n_informative=15, n_redundant=5, random_state=1) # define model model = GaussianProcessClassifier() # fit model model.fit(X, y) # define new data row = [2.47475454,0.40165523,1.68081787,2.88940715,0.91704519,-3.07950644,4.39961206,0.72464273,-4.86563631,-6.06338084,-1.22209949,-0.4699618,1.01222748,-0.6899355,-0.53000581,6.86966784,-3.27211075,-6.59044146,-2.21290585,-3.139579] # make a prediction yhat = model.predict([row]) # summarize prediction print('Predicted Class: %d' % yhat)

Running the example fits the model and makes a class label prediction for a new row of data.

Predicted Class: 0

Next, we can look at configuring the model hyperparameters.

The hyperparameters for the Gaussian Processes Classifier method must be configured for your specific dataset.

Perhaps the most important hyperparameter is the kernel controlled via the “*kernel*” argument. The scikit-learn library provides many built-in kernels that can be used.

Perhaps some of the more common examples include:

- RBF
- DotProduct
- Matern
- RationalQuadratic
- WhiteKernel

You can learn more about the kernels offered by the library here:

We will evaluate the performance of the Gaussian Processes Classifier with each of these common kernels, using default arguments.

... # define grid grid = dict() grid['kernel'] = [1*RBF(), 1*DotProduct(), 1*Matern(), 1*RationalQuadratic(), 1*WhiteKernel()]

# grid search kernel for gaussian process classifier from sklearn.datasets import make_classification from sklearn.model_selection import GridSearchCV from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.gaussian_process import GaussianProcessClassifier from sklearn.gaussian_process.kernels import RBF from sklearn.gaussian_process.kernels import DotProduct from sklearn.gaussian_process.kernels import Matern from sklearn.gaussian_process.kernels import RationalQuadratic from sklearn.gaussian_process.kernels import WhiteKernel # define dataset X, y = make_classification(n_samples=100, n_features=20, n_informative=15, n_redundant=5, random_state=1) # define model model = GaussianProcessClassifier() # define model evaluation method cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define grid grid = dict() grid['kernel'] = [1*RBF(), 1*DotProduct(), 1*Matern(), 1*RationalQuadratic(), 1*WhiteKernel()] # define search search = GridSearchCV(model, grid, scoring='accuracy', cv=cv, n_jobs=-1) # perform the search results = search.fit(X, y) # summarize best print('Best Mean Accuracy: %.3f' % results.best_score_) print('Best Config: %s' % results.best_params_) # summarize all means = results.cv_results_['mean_test_score'] params = results.cv_results_['params'] for mean, param in zip(means, params): print(">%.3f with: %r" % (mean, param))

In this case, we can see that the *RationalQuadratic* kernel achieved a lift in performance with an accuracy of about 91.3 percent as compared to 79.0 percent achieved with the RBF kernel in the previous section.

Best Mean Accuracy: 0.913 Best Config: {'kernel': 1**2 * RationalQuadratic(alpha=1, length_scale=1)} >0.790 with: {'kernel': 1**2 * RBF(length_scale=1)} >0.800 with: {'kernel': 1**2 * DotProduct(sigma_0=1)} >0.830 with: {'kernel': 1**2 * Matern(length_scale=1, nu=1.5)} >0.913 with: {'kernel': 1**2 * RationalQuadratic(alpha=1, length_scale=1)} >0.510 with: {'kernel': 1**2 * WhiteKernel(noise_level=1)}

This section provides more resources on the topic if you are looking to go deeper.

- Gaussian Processes for Machine Learning, 2006.
- Gaussian Processes for Machine Learning, Homepage.
- Machine Learning: A Probabilistic Perspective, 2012.
- Pattern Recognition and Machine Learning, 2006.

- sklearn.gaussian_process.GaussianProcessClassifier API.
- sklearn.gaussian_process.GaussianProcessRegressor API.
- Gaussian Processes, Scikit-Learn User Guide.
- Gaussian Process Kernels API.

In this tutorial, you discovered the Gaussian Processes Classifier classification machine learning algorithm.

Specifically, you learned:

- The Gaussian Processes Classifier is a non-parametric algorithm that can be applied to binary classification tasks.
- How to fit, evaluate, and make predictions with the Gaussian Processes Classifier model with Scikit-Learn.
- How to tune the hyperparameters of the Gaussian Processes Classifier algorithm on a given dataset.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Gaussian Processes for Classification With Python appeared first on Machine Learning Mastery.

]]>The post Radius Neighbors Classifier Algorithm With Python appeared first on Machine Learning Mastery.

]]>It is an extension to the k-nearest neighbors algorithm that makes predictions using all examples in the radius of a new example rather than the k-closest neighbors.

As such, the radius-based approach to selecting neighbors is more appropriate for sparse data, preventing examples that are far away in the feature space from contributing to a prediction.

In this tutorial, you will discover the **Radius Neighbors Classifier** classification machine learning algorithm.

After completing this tutorial, you will know:

- The Nearest Radius Neighbors Classifier is a simple extension of the k-nearest neighbors classification algorithm.
- How to fit, evaluate, and make predictions with the Radius Neighbors Classifier model with Scikit-Learn.
- How to tune the hyperparameters of the Radius Neighbors Classifier algorithm on a given dataset.

Let’s get started.

This tutorial is divided into three parts; they are:

- Radius Neighbors Classifier
- Radius Neighbors Classifier With Scikit-Learn
- Tune Radius Neighbors Classifier Hyperparameters

Radius Neighbors is a classification machine learning algorithm.

It is based on the k-nearest neighbors algorithm, or kNN. kNN involves taking the entire training dataset and storing it. Then, at prediction time, the k-closest examples in the training dataset are located for each new example for which we want to predict. The mode (most common value) class label from the k neighbors is then assigned to the new example.

For more on the k-nearest neighbours algorithm, see the tutorial:

The Radius Neighbors Classifier is similar in that training involves storing the entire training dataset. The way that the training dataset is used during prediction is different.

Instead of locating the k-neighbors, the Radius Neighbors Classifier locates all examples in the training dataset that are within a given radius of the new example. The radius neighbors are then used to make a prediction for the new example.

The radius is defined in the feature space and generally assumes that the input variables are numeric and scaled to the range 0-1, e.g. normalized.

The radius-based approach to locating neighbors is appropriate for those datasets where it is desirable for the contribution of neighbors to be proportional to the density of examples in the feature space.

Given a fixed radius, dense regions of the feature space will contribute more information and sparse regions will contribute less information. It is this latter case that is most desirable and it prevents examples very far in feature space from the new example from contributing to the prediction.

As such, the Radius Neighbors Classifier may be more appropriate for prediction problems where there are sparse regions of the feature space.

Given that the radius is fixed in all dimensions of the feature space, it will become less effective as the number of input features is increased, which causes examples in the feature space to spread further and further apart. This property is referred to as the curse of dimensionality.

The Radius Neighbors Classifier is available in the scikit-learn Python machine learning library via the RadiusNeighborsClassifier class.

The class allows you to specify the size of the radius used when making a prediction via the “*radius*” argument, which defaults to 1.0.

... # create the model model = RadiusNeighborsClassifier(radius=1.0)

Another important hyperparameter is the “*weights*” argument that controls whether neighbors contribute to the prediction in a ‘*uniform*‘ manner or inverse to the distance (‘*distance*‘) from the example. Uniform weight is used by default.

... # create the model model = RadiusNeighborsClassifier(weights='uniform')

We can demonstrate the Radius Neighbors Classifier with a worked example.

First, let’s define a synthetic classification dataset.

We will use the make_classification() function to create a dataset with 1,000 examples, each with 20 input variables.

The example below creates and summarizes the dataset.

# test classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and confirms the number of rows and columns of the dataset.

(1000, 20) (1000,)

We can fit and evaluate a Radius Neighbors Classifier model using repeated stratified k-fold cross-validation via the RepeatedStratifiedKFold class. We will use 10 folds and three repeats in the test harness.

We will use the default configuration.

... # create the model model = RadiusNeighborsClassifier()

It is important that the feature space is scaled prior to preparing and using the model.

We can achieve this by using the MinMaxScaler to normalize the input features and use a Pipeline to first apply the scaling, then use the model.

... # define model model = RadiusNeighborsClassifier() # create pipeline pipeline = Pipeline(steps=[('norm', MinMaxScaler()),('model',model)])

The complete example of evaluating the Radius Neighbors Classifier model for the synthetic binary classification task is listed below.

# evaluate an radius neighbors classifier model on the dataset from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from sklearn.preprocessing import MinMaxScaler from sklearn.neighbors import RadiusNeighborsClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) # define model model = RadiusNeighborsClassifier() # create pipeline pipeline = Pipeline(steps=[('norm', MinMaxScaler()),('model',model)]) # define model evaluation method cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # summarize result print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the Radius Neighbors Classifier algorithm on the synthetic dataset and reports the average accuracy across the three repeats of 10-fold cross-validation.

In this case, we can see that the model achieved a mean accuracy of about 75.4 percent.

Mean Accuracy: 0.754 (0.042)

We may decide to use the Radius Neighbors Classifier as our final model and make predictions on new data.

This can be achieved by fitting the model pipeline on all available data and calling the *predict()* function passing in a new row of data.

We can demonstrate this with a complete example listed below.

# make a prediction with a radius neighbors classifier model on the dataset from sklearn.datasets import make_classification from sklearn.pipeline import Pipeline from sklearn.preprocessing import MinMaxScaler from sklearn.neighbors import RadiusNeighborsClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) # define model model = RadiusNeighborsClassifier() # create pipeline pipeline = Pipeline(steps=[('norm', MinMaxScaler()),('model',model)]) # fit model pipeline.fit(X, y) # define new data row = [2.47475454,0.40165523,1.68081787,2.88940715,0.91704519,-3.07950644,4.39961206,0.72464273,-4.86563631,-6.06338084,-1.22209949,-0.4699618,1.01222748,-0.6899355,-0.53000581,6.86966784,-3.27211075,-6.59044146,-2.21290585,-3.139579] # make a prediction yhat = pipeline.predict([row]) # summarize prediction print('Predicted Class: %d' % yhat)

Running the example fits the model and makes a class label prediction for a new row of data.

Predicted Class: 0

Next, we can look at configuring the model hyperparameters.

The hyperparameters for the Radius Neighbors Classifier method must be configured for your specific dataset.

Perhaps the most important hyperparameter is the radius controlled via the “*radius*” argument. It is a good idea to test a range of values, perhaps around the value of 1.0.

We will explore values between 0.8 and 1.5 with a grid of 0.01 on our synthetic dataset.

... # define grid grid = dict() grid['model__radius'] = arange(0.8, 1.5, 0.01)

Note that we are grid searching the “*radius*” hyperparameter of the *RadiusNeighborsClassifier* within the *Pipeline* where the model is named “*model*” and, therefore, the radius parameter is accessed via *model->radius* with a double underscore (*__*) separator, e.g. “*model__radius*“.

# grid search radius for radius neighbors classifier from numpy import arange from sklearn.datasets import make_classification from sklearn.model_selection import GridSearchCV from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from sklearn.preprocessing import MinMaxScaler from sklearn.neighbors import RadiusNeighborsClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) # define model model = RadiusNeighborsClassifier() # create pipeline pipeline = Pipeline(steps=[('norm', MinMaxScaler()),('model',model)]) # define model evaluation method cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define grid grid = dict() grid['model__radius'] = arange(0.8, 1.5, 0.01) # define search search = GridSearchCV(pipeline, grid, scoring='accuracy', cv=cv, n_jobs=-1) # perform the search results = search.fit(X, y) # summarize print('Mean Accuracy: %.3f' % results.best_score_) print('Config: %s' % results.best_params_)

In this case, we can see that we achieved better results using a radius of 0.8 that gave an accuracy of about 87.2 percent compared to a radius of 1.0 in the previous example that gave an accuracy of about 75.4 percent.

Mean Accuracy: 0.872 Config: {'model__radius': 0.8}

Another key hyperparameter is the manner in which examples in the radius contribute to the prediction via the “*weights*” argument. This can be set to “*uniform*” (the default), “*distance*” for inverse distance, or a custom function.

We can test both of these built-in weightings and see which performs better with our radius of 0.8.

... # define grid grid = dict() grid['model__weights'] = ['uniform', 'distance']

The complete example is listed below.

# grid search weights for radius neighbors classifier from sklearn.datasets import make_classification from sklearn.model_selection import GridSearchCV from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from sklearn.preprocessing import MinMaxScaler from sklearn.neighbors import RadiusNeighborsClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) # define model model = RadiusNeighborsClassifier(radius=0.8) # create pipeline pipeline = Pipeline(steps=[('norm', MinMaxScaler()),('model',model)]) # define model evaluation method cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define grid grid = dict() grid['model__weights'] = ['uniform', 'distance'] # define search search = GridSearchCV(pipeline, grid, scoring='accuracy', cv=cv, n_jobs=-1) # perform the search results = search.fit(X, y) # summarize print('Mean Accuracy: %.3f' % results.best_score_) print('Config: %s' % results.best_params_)

In this case, we can see an additional lift in mean classification accuracy from about 87.2 percent with ‘*uniform*‘ weights in the previous example to about 89.3 percent with ‘*distance*‘ weights in this case.

Mean Accuracy: 0.893 Config: {'model__weights': 'distance'}

Another metric that you might wish to explore is the distance metric used via the ‘*metric*‘ argument that defaults to ‘*minkowski*‘.

It might be interesting to compare results to ‘*euclidean*‘ distance and perhaps ‘*cityblock*‘.

This section provides more resources on the topic if you are looking to go deeper.

- Applied Predictive Modeling, 2013.
- An Introduction to Statistical Learning with Applications in R, 2014.

- Nearest Neighbors, Scikit-Learn User Guide.
- sklearn.neighbors.RadiusNeighborsClassifier API.
- sklearn.pipeline.Pipeline API.
- sklearn.preprocessing.MinMaxScaler API.

In this tutorial, you discovered the Radius Neighbors Classifier classification machine learning algorithm.

Specifically, you learned:

- The Nearest Radius Neighbors Classifier is a simple extension of the k-nearest neighbors classification algorithm.
- How to fit, evaluate, and make predictions with the Radius Neighbors Classifier model with Scikit-Learn.
- How to tune the hyperparameters of the Radius Neighbors Classifier algorithm on a given dataset.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Radius Neighbors Classifier Algorithm With Python appeared first on Machine Learning Mastery.

]]>The post Linear Discriminant Analysis With Python appeared first on Machine Learning Mastery.

]]>The algorithm involves developing a probabilistic model per class based on the specific distribution of observations for each input variable. A new example is then classified by calculating the conditional probability of it belonging to each class and selecting the class with the highest probability.

As such, it is a relatively simple probabilistic classification model that makes strong assumptions about the distribution of each input variable, although it can make effective predictions even when these expectations are violated (e.g. it fails gracefully).

In this tutorial, you will discover the Linear Discriminant Analysis classification machine learning algorithm in Python.

After completing this tutorial, you will know:

- The Linear Discriminant Analysis is a simple linear machine learning algorithm for classification.
- How to fit, evaluate, and make predictions with the Linear Discriminant Analysis model with Scikit-Learn.
- How to tune the hyperparameters of the Linear Discriminant Analysis algorithm on a given dataset.

Let’s get started.

This tutorial is divided into three parts; they are:

- Linear Discriminant Analysis
- Linear Discriminant Analysis With scikit-learn
- Tune LDA Hyperparameters

Linear Discriminant Analysis, or LDA for short, is a classification machine learning algorithm.

It works by calculating summary statistics for the input features by class label, such as the mean and standard deviation. These statistics represent the model learned from the training data. In practice, linear algebra operations are used to calculate the required quantities efficiently via matrix decomposition.

Predictions are made by estimating the probability that a new example belongs to each class label based on the values of each input feature. The class that results in the largest probability is then assigned to the example. As such, LDA may be considered a simple application of Bayes Theorem for classification.

LDA assumes that the input variables are numeric and normally distributed and that they have the same variance (spread). If this is not the case, it may be desirable to transform the data to have a Gaussian distribution and standardize or normalize the data prior to modeling.

… the LDA classifier results from assuming that the observations within each class come from a normal distribution with a class-specific mean vector and a common variance

— Page 142, An Introduction to Statistical Learning with Applications in R, 2014.

It also assumes that the input variables are not correlated; if they are, a PCA transform may be helpful to remove the linear dependence.

… practitioners should be particularly rigorous in pre-processing data before using LDA. We recommend that predictors be centered and scaled and that near-zero variance predictors be removed.

— Page 293, Applied Predictive Modeling, 2013.

Nevertheless, the model can perform well, even when violating these expectations.

The LDA model is naturally multi-class. This means that it supports two-class classification problems and extends to more than two classes (multi-class classification) without modification or augmentation.

It is a linear classification algorithm, like logistic regression. This means that classes are separated in the feature space by lines or hyperplanes. Extensions of the method can be used that allow other shapes, like Quadratic Discriminant Analysis (QDA), which allows curved shapes in the decision boundary.

… unlike LDA, QDA assumes that each class has its own covariance matrix.

— Page 149, An Introduction to Statistical Learning with Applications in R, 2014.

Now that we are familiar with LDA, let’s look at how to fit and evaluate models using the scikit-learn library.

The Linear Discriminant Analysis is available in the scikit-learn Python machine learning library via the LinearDiscriminantAnalysis class.

The method can be used directly without configuration, although the implementation does offer arguments for customization, such as the choice of solver and the use of a penalty.

... # create the lda model model = LinearDiscriminantAnalysis()

We can demonstrate the Linear Discriminant Analysis method with a worked example.

First, let’s define a synthetic classification dataset.

We will use the make_classification() function to create a dataset with 1,000 examples, each with 10 input variables.

The example creates and summarizes the dataset.

# test classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=10, n_redundant=0, random_state=1) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and confirms the number of rows and columns of the dataset.

(1000, 10) (1000,)

We can fit and evaluate a Linear Discriminant Analysis model using repeated stratified k-fold cross-validation via the RepeatedStratifiedKFold class. We will use 10 folds and three repeats in the test harness.

The complete example of evaluating the Linear Discriminant Analysis model for the synthetic binary classification task is listed below.

# evaluate a lda model on the dataset from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.discriminant_analysis import LinearDiscriminantAnalysis # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=10, n_redundant=0, random_state=1) # define model model = LinearDiscriminantAnalysis() # define model evaluation method cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # summarize result print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the Linear Discriminant Analysis algorithm on the synthetic dataset and reports the average accuracy across the three repeats of 10-fold cross-validation.

In this case, we can see that the model achieved a mean accuracy of about 89.3 percent.

Mean Accuracy: 0.893 (0.033)

We may decide to use the Linear Discriminant Analysis as our final model and make predictions on new data.

This can be achieved by fitting the model on all available data and calling the predict() function passing in a new row of data.

We can demonstrate this with a complete example listed below.

# make a prediction with a lda model on the dataset from sklearn.datasets import make_classification from sklearn.discriminant_analysis import LinearDiscriminantAnalysis # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=10, n_redundant=0, random_state=1) # define model model = LinearDiscriminantAnalysis() # fit model model.fit(X, y) # define new data row = [0.12777556,-3.64400522,-2.23268854,-1.82114386,1.75466361,0.1243966,1.03397657,2.35822076,1.01001752,0.56768485] # make a prediction yhat = model.predict([row]) # summarize prediction print('Predicted Class: %d' % yhat)

Running the example fits the model and makes a class label prediction for a new row of data.

Predicted Class: 1

Next, we can look at configuring the model hyperparameters.

The hyperparameters for the Linear Discriminant Analysis method must be configured for your specific dataset.

An important hyperparameter is the solver, which defaults to ‘*svd*‘ but can also be set to other values for solvers that support the shrinkage capability.

The example below demonstrates this using the GridSearchCV class with a grid of different solver values.

# grid search solver for lda from sklearn.datasets import make_classification from sklearn.model_selection import GridSearchCV from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.discriminant_analysis import LinearDiscriminantAnalysis # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=10, n_redundant=0, random_state=1) # define model model = LinearDiscriminantAnalysis() # define model evaluation method cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define grid grid = dict() grid['solver'] = ['svd', 'lsqr', 'eigen'] # define search search = GridSearchCV(model, grid, scoring='accuracy', cv=cv, n_jobs=-1) # perform the search results = search.fit(X, y) # summarize print('Mean Accuracy: %.3f' % results.best_score_) print('Config: %s' % results.best_params_)

In this case, we can see that the default SVD solver performs the best compared to the other built-in solvers.

Mean Accuracy: 0.893 Config: {'solver': 'svd'}

Next, we can explore whether using shrinkage with the model improves performance.

Shrinkage adds a penalty to the model that acts as a type of regularizer, reducing the complexity of the model.

Regularization reduces the variance associated with the sample based estimate at the expense of potentially increased bias. This bias variance trade-off is generally regulated by one or more (degree-of-belief) parameters that control the strength of the biasing towards the “plausible” set of (population) parameter values.

— Regularized Discriminant Analysis, 1989.

This can be set via the “*shrinkage*” argument and can be set to a value between 0 and 1. We will test values on a grid with a spacing of 0.01.

In order to use the penalty, a solver must be chosen that supports this capability, such as ‘*eigen*’ or ‘*lsqr*‘. We will use the latter in this case.

The complete example of tuning the shrinkage hyperparameter is listed below.

# grid search shrinkage for lda from numpy import arange from sklearn.datasets import make_classification from sklearn.model_selection import GridSearchCV from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.discriminant_analysis import LinearDiscriminantAnalysis # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=10, n_redundant=0, random_state=1) # define model model = LinearDiscriminantAnalysis(solver='lsqr') # define model evaluation method cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define grid grid = dict() grid['shrinkage'] = arange(0, 1, 0.01) # define search search = GridSearchCV(model, grid, scoring='accuracy', cv=cv, n_jobs=-1) # perform the search results = search.fit(X, y) # summarize print('Mean Accuracy: %.3f' % results.best_score_) print('Config: %s' % results.best_params_)

In this case, we can see that using shrinkage offers a slight lift in performance from about 89.3 percent to about 89.4 percent, with a value of 0.02.

Mean Accuracy: 0.894 Config: {'shrinkage': 0.02}

This section provides more resources on the topic if you are looking to go deeper.

- Applied Predictive Modeling, 2013.
- An Introduction to Statistical Learning with Applications in R, 2014.

- sklearn.discriminant_analysis.LinearDiscriminantAnalysis API.
- Linear and Quadratic Discriminant Analysis, scikit-learn.

In this tutorial, you discovered the Linear Discriminant Analysis classification machine learning algorithm in Python.

Specifically, you learned:

- The Linear Discriminant Analysis is a simple linear machine learning algorithm for classification.
- How to fit, evaluate, and make predictions with the Linear Discriminant Analysis model with Scikit-Learn.
- How to tune the hyperparameters of the Linear Discriminant Analysis algorithm on a given dataset.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Linear Discriminant Analysis With Python appeared first on Machine Learning Mastery.

]]>The post Multi-Core Machine Learning in Python With Scikit-Learn appeared first on Machine Learning Mastery.

]]>Common machine learning tasks that can be made parallel include training models like ensembles of decision trees, evaluating models using resampling procedures like k-fold cross-validation, and tuning model hyperparameters, such as grid and random search.

Using multiple cores for common machine learning tasks can dramatically decrease the execution time as a factor of the number of cores available on your system. A common laptop and desktop computer may have 2, 4, or 8 cores. Larger server systems may have 32, 64, or more cores available, allowing machine learning tasks that take hours to be completed in minutes.

In this tutorial, you will discover how to configure scikit-learn for multi-core machine learning.

After completing this tutorial, you will know:

- How to train machine learning models using multiple cores.
- How to make the evaluation of machine learning models parallel.
- How to use multiple cores to tune machine learning model hyperparameters.

Let’s get started.

This tutorial is divided into five parts; they are:

- Multi-Core Scikit-Learn
- Multi-Core Model Training
- Multi-Core Model Evaluation
- Multi-Core Hyperparameter Tuning
- Recommendations

Machine learning can be computationally expensive.

There are three main centers of this computational cost; they are:

- Training machine learning models.
- Evaluating machine learning models.
- Hyperparameter tuning machine learning models.

Worse, these concerns compound.

For example, evaluating machine learning models using a resampling technique like k-fold cross-validation requires that the training process is repeated multiple times.

- Evaluation Requires Repeated Training

Tuning model hyperparameters compounds this further as it requires the evaluation procedure repeated for each combination of hyperparameters tested.

- Tuning Requires Repeated Evaluation

Most, if not all, modern computers have multi-core CPUs. This includes your workstation, your laptop, as well as larger servers.

You can configure your machine learning models to harness multiple cores of your computer, dramatically speeding up computationally expensive operations.

The scikit-learn Python machine learning library provides this capability via the n_jobs argument on key machine learning tasks, such as model training, model evaluation, and hyperparameter tuning.

This configuration argument allows you to specify the number of cores to use for the task. The default is None, which will use a single core. You can also specify a number of cores as an integer, such as 1 or 2. Finally, you can specify -1, in which case the task will use all of the cores available on your system.

**n_jobs**: Specify the number of cores to use for key machine learning tasks.

Common values are:

**n_jobs=None**: Use a single core or the default configured by your backend library.**n_jobs=4**: Use the specified number of cores, in this case 4.**n_jobs=-1**: Use all available cores.

**What is a core?**

A CPU may have multiple physical CPU cores, which is essentially like having multiple CPUs. Each core may also have hyper-threading, a technology that under many circumstances allows you to double the number of cores.

For example, my workstation has four physical cores, which are doubled to eight cores due to hyper-threading. Therefore, I can experiment with 1-8 cores or specify -1 to use all cores on my workstation.

Now that we are familiar with the scikit-learn library’s capability to support multi-core parallel processing for machine learning, let’s work through some examples.

You will get different timings for all of the examples in this tutorial; share your results in the comments. You may also need to change the number of cores to match the number of cores on your system.

**Note**: Yes, I am aware of the timeit API, but chose against it for this tutorial. We are not profiling the code examples per se; instead, I want you to focus on how and when to use the multi-core capabilities of scikit-learn and that they offer real benefits. I wanted the code examples to be clean and simple to read, even for beginners. I set it as an extension to update all examples to use the timeit API and get more accurate timings. Share your results in the comments.

Many machine learning algorithms support multi-core training via an n_jobs argument when the model is defined.

This affects not just the training of the model, but also the use of the model when making predictions.

A popular example is the ensemble of decision trees, such as bagged decision trees, random forest, and gradient boosting.

In this section we will explore accelerating the training of a RandomForestClassifier model using multiple cores. We will use a synthetic classification task for our experiments.

In this case, we will define a random forest model with 500 trees and use a single core to train the model.

... # define the model model = RandomForestClassifier(n_estimators=500, n_jobs=1)

We can record the time before and after the call to the *train()* function using the *time()* function. We can then subtract the start time from the end time and report the execution time in the number of seconds.

The complete example of evaluating the execution time of training a random forest model with a single core is listed below.

# example of timing the training of a random forest model on one core from time import time from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier # define dataset X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=3) # define the model model = RandomForestClassifier(n_estimators=500, n_jobs=1) # record current time start = time() # fit the model model.fit(X, y) # record current time end = time() # report execution time result = end - start print('%.3f seconds' % result)

Running the example reports the time taken to train the model with a single core.

In this case, we can see that it takes about 10 seconds.

How long does it take on your system? Share your results in the comments below.

10.702 seconds

We can now change the example to use all of the physical cores on the system, in this case, four.

... # define the model model = RandomForestClassifier(n_estimators=500, n_jobs=4)

The complete example of multi-core training of the model with four cores is listed below.

# example of timing the training of a random forest model on 4 cores from time import time from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier # define dataset X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=3) # define the model model = RandomForestClassifier(n_estimators=500, n_jobs=4) # record current time start = time() # fit the model model.fit(X, y) # record current time end = time() # report execution time result = end - start print('%.3f seconds' % result)

Running the example reports the time taken to train the model with a single core.

In this case, we can see that the speed of execution more than halved to about 3.151 seconds.

**How long does it take on your system?** Share your results in the comments below.

3.151 seconds

We can now change the number of cores to eight to account for the hyper-threading supported by the four physical cores.

... # define the model model = RandomForestClassifier(n_estimators=500, n_jobs=8)

We can achieve the same effect by setting *n_jobs* to -1 to automatically use all cores; for example:

... # define the model model = RandomForestClassifier(n_estimators=500, n_jobs=-1)

We will stick to manually specifying the number of cores for now.

The complete example of multi-core training of the model with eight cores is listed below.

# example of timing the training of a random forest model on 8 cores from time import time from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier # define dataset X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=3) # define the model model = RandomForestClassifier(n_estimators=500, n_jobs=8) # record current time start = time() # fit the model model.fit(X, y) # record current time end = time() # report execution time result = end - start print('%.3f seconds' % result)

Running the example reports the time taken to train the model with a single core.

In this case, we can see that we got another drop in execution speed from about 3.151 to about 2.521 by using all cores.

How long does it take on your system? Share your results in the comments below.

2.521 seconds

We can make the relationship between the number of cores used during training and execution speed more concrete by comparing all values between one and eight and plotting the result.

The complete example is listed below.

# example of comparing number of cores used during training to execution speed from time import time from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=3) results = list() # compare timing for number of cores n_cores = [1, 2, 3, 4, 5, 6, 7, 8] for n in n_cores: # capture current time start = time() # define the model model = RandomForestClassifier(n_estimators=500, n_jobs=n) # fit the model model.fit(X, y) # capture current time end = time() # store execution time result = end - start print('>cores=%d: %.3f seconds' % (n, result)) results.append(result) pyplot.plot(n_cores, results) pyplot.show()

Running the example first reports the execution speed for each number of cores used during training.

We can see a steady decrease in execution speed from one to eight cores, although the dramatic benefits stop after four physical cores.

How long does it take on your system? Share your results in the comments below.

>cores=1: 10.798 seconds >cores=2: 5.743 seconds >cores=3: 3.964 seconds >cores=4: 3.158 seconds >cores=5: 2.868 seconds >cores=6: 2.631 seconds >cores=7: 2.528 seconds >cores=8: 2.440 seconds

A plot is also created to show the relationship between the number of cores used during training and the execution speed, showing that we continue to see a benefit all the way to eight cores.

Now that we are familiar with the benefit of multi-core training of machine learning models, let’s look at multi-core model evaluation.

The gold standard for model evaluation is k-fold cross-validation.

This is a resampling procedure that requires that the model is trained and evaluated *k* times on different partitioned subsets of the dataset. The result is an estimate of the performance of a model when making predictions on data not used during training that can be used to compare and select a good or best model for a dataset.

In addition, it is also a good practice to repeat this evaluation process multiple times, referred to as repeated k-fold cross-validation.

The evaluation procedure can be configured to use multiple cores, where each model training and evaluation happens on a separate core. This can be done by setting the *n_jobs* argument on the call to cross_val_score() function; for example:

We can explore the effect of multiple cores on model evaluation.

First, let’s evaluate the model using a single core.

... # evaluate the model n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=1)

We will evaluate the random forest model and use a single core in the training of the model (for now).

... # define the model model = RandomForestClassifier(n_estimators=100, n_jobs=1)

The complete example is listed below.

# example of evaluating a model using a single core from time import time from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import RandomForestClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=3) # define the model model = RandomForestClassifier(n_estimators=100, n_jobs=1) # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # record current time start = time() # evaluate the model n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=1) # record current time end = time() # report execution time result = end - start print('%.3f seconds' % result)

Running the example evaluates the model using 10-fold cross-validation with three repeats.

In this case, we see that the evaluation of the model took about 6.412 seconds.

How long does it take on your system? Share your results in the comments below.

6.412 seconds

We can update the example to use all eight cores of the system and expect a large speedup.

... # evaluate the model n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=8)

The complete example is listed below.

# example of evaluating a model using 8 cores from time import time from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import RandomForestClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=3) # define the model model = RandomForestClassifier(n_estimators=100, n_jobs=1) # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # record current time start = time() # evaluate the model n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=8) # record current time end = time() # report execution time result = end - start print('%.3f seconds' % result)

Running the example evaluates the model using multiple cores.

In this case, we can see the execution timing dropped from 6.412 seconds to about 2.371 seconds, giving a welcome speedup.

How long does it take on your system? Share your results in the comments below.

2.371 seconds

As we did in the previous section, we can time the execution speed for each number of cores from one to eight to get an idea of the relationship.

The complete example is listed below.

# compare execution speed for model evaluation vs number of cpu cores from time import time from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import RandomForestClassifier from matplotlib import pyplot # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=3) results = list() # compare timing for number of cores n_cores = [1, 2, 3, 4, 5, 6, 7, 8] for n in n_cores: # define the model model = RandomForestClassifier(n_estimators=100, n_jobs=1) # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # record the current time start = time() # evaluate the model n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=n) # record the current time end = time() # store execution time result = end - start print('>cores=%d: %.3f seconds' % (n, result)) results.append(result) pyplot.plot(n_cores, results) pyplot.show()

Running the example first reports the execution time in seconds for each number of cores for evaluating the model.

We can see that there is not a dramatic improvement above four physical cores.

We can also see a difference here when training with eight cores from the previous experiment. In this case, evaluating performance took 1.492 seconds whereas the standalone case took about 2.371 seconds.

This highlights the limitation of the evaluation methodology we are using where we are reporting the performance of a single run rather than repeated runs. There is some spin-up time required to load classes into memory and perform any JIT optimization.

Regardless of the accuracy of our flimsy profiling, we do see the familiar speedup of model evaluation with the increase of cores used during the process.

How long does it take on your system? Share your results in the comments below.

>cores=1: 6.339 seconds >cores=2: 3.765 seconds >cores=3: 2.404 seconds >cores=4: 1.826 seconds >cores=5: 1.806 seconds >cores=6: 1.686 seconds >cores=7: 1.587 seconds >cores=8: 1.492 seconds

A plot of the relationship between the number of cores and the execution speed is also created.

We can also make the model training process parallel during the model evaluation procedure.

Although this is possible, should we?

To explore this question, let’s first consider the case where model training uses all cores and model evaluation uses a single core.

... # define the model model = RandomForestClassifier(n_estimators=100, n_jobs=8) ... # evaluate the model n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=1)

The complete example is listed below.

# example of using multiple cores for model training but not model evaluation from time import time from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import RandomForestClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=3) # define the model model = RandomForestClassifier(n_estimators=100, n_jobs=8) # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # record current time start = time() # evaluate the model n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=1) # record current time end = time() # report execution time result = end - start print('%.3f seconds' % result)

Running the example evaluates the model using a single core, but each trained model uses a single core.

In this case, we can see that the model evaluation takes more than 10 seconds, much longer than the 1 or 2 seconds when we use a single core for training and all cores for parallel model evaluation.

How long does it take on your system? Share your results in the comments below.

10.461 seconds

What if we split the number of cores between the training and evaluation procedures?

... # define the model model = RandomForestClassifier(n_estimators=100, n_jobs=4) ... # evaluate the model n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=4)

The complete example is listed below.

# example of using multiple cores for model training and evaluation from time import time from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import RandomForestClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=3) # define the model model = RandomForestClassifier(n_estimators=100, n_jobs=8) # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=4) # record current time start = time() # evaluate the model n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=4) # record current time end = time() # report execution time result = end - start print('%.3f seconds' % result)

Running the example evaluates the model using four cores, and each model is trained using four different cores.

We can see an improvement over training with all cores and evaluating with one core, but at least for this model on this dataset, it is more efficient to use all cores for model evaluation and a single core for model training.

How long does it take on your system? Share your results in the comments below.

3.434 seconds

It is common to tune the hyperparameters of a machine learning model using a grid search or a random search.

The scikit-learn library provides these capabilities via the GridSearchCV and RandomizedSearchCV classes respectively.

Both of these search procedures can be made parallel by setting the *n_jobs* argument, assigning each hyperparameter configuration to a core for evaluation.

The model evaluation itself could also be multi-core, as we saw in the previous section, and the model training for a given evaluation can also be training as we saw in the second before that. Therefore, the stack of potentially multi-core processes is starting to get challenging to configure.

In this specific implementation, we can make the model training parallel, but we don’t have control over how each model hyperparameter and how each model evaluation is made multi-core. The documentation is not clear at the time of writing, but I would guess that each model evaluation using a single core hyperparameter configuration is split into jobs.

Let’s explore the benefits of performing model hyperparameter tuning using multiple cores.

First, let’s evaluate a grid of different configurations of the random forest algorithm using a single core.

... # define grid search search = GridSearchCV(model, grid, n_jobs=1, cv=cv)

The complete example is listed below.

# example of tuning model hyperparameters with a single core from time import time from sklearn.datasets import make_classification from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import GridSearchCV # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=3) # define the model model = RandomForestClassifier(n_estimators=100, n_jobs=1) # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define grid grid = dict() grid['max_features'] = [1, 2, 3, 4, 5] # define grid search search = GridSearchCV(model, grid, n_jobs=1, cv=cv) # record current time start = time() # perform search search.fit(X, y) # record current time end = time() # report execution time result = end - start print('%.3f seconds' % result)

Running the example tests different values of the *max_features* configuration for random forest, where each configuration is evaluated using repeated k-fold cross-validation.

In this case, the grid search on a single core takes about 28.838 seconds.

How long does it take on your system? Share your results in the comments below.

28.838 seconds

We can now configure the grid search to use all available cores on the system, in this case, eight cores.

... # define grid search search = GridSearchCV(model, grid, n_jobs=8, cv=cv)

We can then evaluate how long this multi-core grids search takes to execute. The complete example is listed below.

# example of tuning model hyperparameters with 8 cores from time import time from sklearn.datasets import make_classification from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import GridSearchCV # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=3) # define the model model = RandomForestClassifier(n_estimators=100, n_jobs=1) # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define grid grid = dict() grid['max_features'] = [1, 2, 3, 4, 5] # define grid search search = GridSearchCV(model, grid, n_jobs=8, cv=cv) # record current time start = time() # perform search search.fit(X, y) # record current time end = time() # report execution time result = end - start print('%.3f seconds' % result)

Running the example reports execution time for the grid search.

In this case, we see a factor of about four speed up from roughly 28.838 seconds to around 7.418 seconds.

How long does it take on your system? Share your results in the comments below.

7.418 seconds

Intuitively, we would expect that making the grid search multi-core should be the focus and not model training.

Nevertheless, we can divide the number of cores between model training and the grid search to see if it offers a benefit for this model on this dataset.

... # define the model model = RandomForestClassifier(n_estimators=100, n_jobs=4) ... # define grid search search = GridSearchCV(model, grid, n_jobs=4, cv=cv)

The complete example of multi-core model training and multi-core hyperparameter tuning is listed below.

# example of multi-core model training and hyperparameter tuning from time import time from sklearn.datasets import make_classification from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import GridSearchCV # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=3) # define the model model = RandomForestClassifier(n_estimators=100, n_jobs=4) # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define grid grid = dict() grid['max_features'] = [1, 2, 3, 4, 5] # define grid search search = GridSearchCV(model, grid, n_jobs=4, cv=cv) # record current time start = time() # perform search search.fit(X, y) # record current time end = time() # report execution time result = end - start print('%.3f seconds' % result)

In this case, we do see a decrease in execution speed compared to a single core case, but not as much benefit as assigning all cores to the grid search process.

How long does it take on your system? Share your results in the comments below.

14.148 seconds

This section lists some general recommendations when using multiple cores for machine learning.

- Confirm the number of cores available on your system.
- Consider using an AWS EC2 instance with many cores to get an immediate speed up.
- Check the API documentation to see if the model/s you are using support multi-core training.
- Confirm multi-core training offers a measurable benefit on your system.
- When using k-fold cross-validation, it is probably better to assign cores to the resampling procedure and leave model training single core.
- When using hyperparamter tuning, it is probably better to make the search multi-core and leave the model training and evaluation single core.

Do you have any recommendations of your own?

This section provides more resources on the topic if you are looking to go deeper.

- How to optimize for speed, scikit-learn Documentation.
- Joblib: running Python functions as pipeline jobs
- timeit API.
- sklearn.ensemble.RandomForestClassifier API.
- sklearn.model_selection.cross_val_score API.
- sklearn.model_selection.GridSearchCV API.
- sklearn.model_selection.RandomizedSearchCV API.
- n_jobs scikit-learn argument.

In this tutorial, you discovered how to configure scikit-learn for multi-core machine learning.

Specifically, you learned:

- How to train machine learning models using multiple cores.
- How to make the evaluation of machine learning models parallel.
- How to use multiple cores to tune machine learning model hyperparameters.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Multi-Core Machine Learning in Python With Scikit-Learn appeared first on Machine Learning Mastery.

]]>