The post Error-Correcting Output Codes (ECOC) for Machine Learning appeared first on Machine Learning Mastery.

]]>As such, these algorithms must either be modified for multi-class (more than two) classification problems or not used at all. The **Error-Correcting Output Codes** method is a technique that allows a multi-class classification problem to be reframed as multiple binary classification problems, allowing the use of native binary classification models to be used directly.

Unlike one-vs-rest and one-vs-one methods that offer a similar solution by dividing a multi-class classification problem into a fixed number of binary classification problems, the error-correcting output codes technique allows each class to be encoded as an arbitrary number of binary classification problems. When an overdetermined representation is used, it allows the extra models to act as “error-correction” predictions that can result in better predictive performance.

In this tutorial, you will discover how to use error-correcting output codes for classification.

After completing this tutorial, you will know:

- Error-correcting output codes is a technique for using binary classification models on multi-class classification prediction tasks.
- How to fit, evaluate, and use error-correcting output codes classification models to make predictions.
- How to tune and evaluate different values for the number of bits per class hyperparameter used by error-correcting output codes.

Let’s get started.

This tutorial is divided into three parts; they are:

- Error-Correcting Output Codes
- Evaluate and Use ECOC Classifiers
- Tune Number of Bits Per Class

Classification tasks are those where a label is predictive for a given input variable.

Binary classification tasks are those classification problems where the target contains two values, whereas multi-class classification problems are those that have more than two target class labels.

Many machine learning models have been developed for binary classification, although they may require modification to work with multi-class classification problems. For example, logistic regression and support vector machines were specifically designed for binary classification.

Several machine learning algorithms, such as SVM, were originally designed to solve only binary classification tasks.

— Page 133, Pattern Classification Using Ensemble Methods, 2010.

Rather than limiting the choice of algorithms or adapting the algorithms for multi-class problems, an alternative approach is to reframe the multi-class classification problem as multiple binary classification problems. Two common methods that can be used to achieve this include the one-vs-rest (OvR) and one-vs-one (OvO) techniques.

**OvR**: splits a multi-class problem into one binary problem per class.**OvO**: splits a multi-class problem into one binary problem per each pair of classes.

Once split into subtasks, a binary classification model can be fit on each task and the model with the largest response can be taken as the prediction.

Both the OvR and OvO may be thought of as a type of ensemble learning model given that multiple separate models are fit for a predictive modeling task and used in concert to make a prediction. In both cases, the prediction of the “*ensemble members*” is a simple winner take all approach.

… convert the multiclass task into an ensemble of binary classification tasks, whose results are then combined.

— Page 134, Pattern Classification Using Ensemble Methods, 2010.

For more on one-vs-rest and one-vs-one models, see the tutorial:

A related approach is to prepare a binary encoding (e.g. a bitstring) to represent each class in the problem. Each bit in the string can be predicted by a separate binary classification problem. Arbitrarily, length encodings can be chosen for a given multi-class classification problem.

To be clear, each model receives the full input pattern and only predicts one position in the output string. During training, each model can be trained to produce the correct 0 or 1 output for the binary classification task. A prediction can then be made for new examples by using each model to make a prediction for the input to create the binary string, then compare the binary string to each class’s known encoding. The class encoding that has the smallest distance to the prediction is then chosen as the output.

A codeword of length l is ascribed to each class. Commonly, the size of the codewords has more bits than needed in order to uniquely represent each class.

— Page 138, Pattern Classification Using Ensemble Methods, 2010.

It is an interesting approach that allows the class representation to be more elaborate than is required (perhaps overdetermined) as compared to a one-hot encoding and introduces redundancy into the representation and modeling of the problem. This is intentional as the additional bits in the representation act like error-correcting codes to fix, correct, or improve the prediction.

… the idea is that the redundant “error-correcting” bits allow for some inaccuracies, and can improve performance.

— Page 606, The Elements of Statistical Learning, 2016.

This gives the technique its name: error-correcting output codes, or ECOC for short.

Error-Correcting Output Codes (ECOC) is a simple yet powerful approach to deal with a multi-class problem based on the combination of binary classifiers.

— Page 90, Ensemble Methods, 2012.

Care can be taken to ensure that each encoded class has a very different binary string encoding. A suite of different encoding schemes has been explored as well as specific methods for constructing the encodings to ensure they are sufficiently far apart in the encoding space. Interestingly, random encodings have been found to work perhaps just as well.

… analyzed the ECOC approach, and showed that random code assignment worked as well as the optimally constructed error-correcting codes

— Page 606, The Elements of Statistical Learning, 2016.

For a detailed review of the various different encoding schemes and methods for mapping predicted strings to encoded classes, I recommend Chapter 6 “*Error Correcting Output Codes*” of the book “Pattern Classification Using Ensemble Methods“.

The scikit-learn library provides an implementation of ECOC via the OutputCodeClassifier class.

The class takes as an argument the model to use to fit each binary classifier, and any machine learning model can be used. In this case, we will use a logistic regression model, intended for binary classification.

The class also provides the “*code_size*” argument that specifies the size of the encoding for the classes as a multiple of the number of classes, e.g. the number of bits to encode for each class label.

For example, if we wanted an encoding with bit strings with a length of 6 bits, and we had three classes, then we can specify the coding size as 2:

- encoding_length = code_size * num_classes
- encoding_length = 2 * 3
- encoding_length = 6

The example below demonstrates how to define an example of the *OutputCodeClassifier* with 2 bits per class and using a LogisticRegression model for each bit in the encoding.

... # define the binary classification model model = LogisticRegression() # define the ecoc model ecoc = OutputCodeClassifier(model, code_size=2, random_state=1)

Although there are many sophisticated ways to construct the encoding for each class, the *OutputCodeClassifier* class selects a random bit string encoding for each class, at least at the time of writing.

We can explore the use of the *OutputCodeClassifier* on a synthetic multi-class classification problem.

We can use the make_classification() function to define a multi-class classification problem with 1,000 examples, 20 input features, and three classes.

The example below demonstrates how to create the dataset and summarize the number of rows, columns, and classes in the dataset.

# multi-class classification dataset from collections import Counter from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1, n_classes=3) # summarize the dataset print(X.shape, y.shape) # summarize the number of classes print(Counter(y))

Running the example creates the dataset and reports the number of rows and columns, confirming the dataset was created as expected.

The number of examples in each class is then reported, showing a nearly equal number of cases for each of the three configured classes.

(1000, 20) (1000,) Counter({2: 335, 1: 333, 0: 332})

Next, we can evaluate an error-correcting output codes model on the dataset.

We will use a logistic regression with 2 bits per class as we defined above. The model will then be evaluated using repeated stratified k-fold cross-validation with three repeats and 10 folds. We will summarize the performance of the model using the mean and and standard deviation of classification accuracy across all repeats and folds.

... # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model and collect the scores n_scores = cross_val_score(ecoc, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # summarize the performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Tying this together, the complete example is listed below.

# evaluate error-correcting output codes for multi-class classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.linear_model import LogisticRegression from sklearn.multiclass import OutputCodeClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1, n_classes=3) # define the binary classification model model = LogisticRegression() # define the ecoc model ecoc = OutputCodeClassifier(model, code_size=2, random_state=1) # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model and collect the scores n_scores = cross_val_score(ecoc, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # summarize the performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example defines the model and evaluates it on our synthetic multi-class classification dataset using the defined test procedure.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model achieved a mean classification accuracy of about 76.6 percent.

Accuracy: 0.766 (0.037)

We may choose to use this as our final model.

This requires that we fit the model on all available data and use it to make predictions on new data.

The example below provides a full example of how to fit and use an error-correcting output model as a final model.

# use error-correcting output codes model as a final model and make a prediction from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.multiclass import OutputCodeClassifier # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1, n_classes=3) # define the binary classification model model = LogisticRegression() # define the ecoc model ecoc = OutputCodeClassifier(model, code_size=2, random_state=1) # fit the model on the whole dataset ecoc.fit(X, y) # make a single prediction row = [[0.04339387, 2.75542632, -3.79522705, -0.71310994, -3.08888853, -1.2963487, -1.92065166, -3.15609907, 1.37532356, 3.61293237, 1.00353523, -3.77126962, 2.26638828, -10.22368666, -0.35137382, 1.84443763, 3.7040748, 2.50964286, 2.18839505, -2.31211692]] yhat = ecoc.predict(row) print('Predicted Class: %d' % yhat[0])

Running the example fits the ECOC model on the entire dataset and uses the model to predict the class label for a single row of data.

In this case, we can see that the model predicted the class label 0.

Predicted Class: 0

Now that we are familiar with how to fit and use the ECOC model, let’s take a closer look at how to configure it.

The key hyperparameter for the ECOC model is the encoding of class labels.

This includes properties such as:

- The choice of representation (bits, real numbers, etc.)
- The encoding of each class label (random, etc.)
- The length of representation (number of bits, etc.)
- How predictions are mapped to classes (distance, etc.)

The OutputCodeClassifier scikit-learn implementation does not currently provide a lot of control over these elements.

The element it does give control over is the number of bits used to encode each class label.

In this section, we can perform a manual grid search across different numbers of bits per class label and compare the results. This provides a template that you can adapt and use on your own project.

First, we can define a function to create and return the dataset.

# get the dataset def get_dataset(): X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1, n_classes=3) return X, y

We can then define a function that will create a collection of models to evaluate.

Each model will be an example of the *OutputCodeClassifier* using a LogisticRegression for each binary classification problem. We will configure the *code_size* of each model to be different, with values ranging from 1 to 20.

# get a list of models to evaluate def get_models(): models = dict() for i in range(1,21): # create model model = LogisticRegression() # create error correcting output code classifier models[str(i)] = OutputCodeClassifier(model, code_size=i, random_state=1) return models

We can evaluate each model using related k-fold cross-validation as we did in the previous section to give a sample of classification accuracy scores.

# evaluate a give model using cross-validation def evaluate_model(model): cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores

We can report the mean and standard deviation of the scores for each configuration and plot the distributions as box and whisker plots side by side to visually compare the results.

... # evaluate the models and store results results, names = list(), list() for name, model in models.items(): scores = evaluate_model(model) results.append(scores) names.append(name) print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Tying this all together, the complete example of comparing ECOC classification with a grid of the number of bits per class is listed below.

# compare the number of bits per class for error-correcting output code classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.linear_model import LogisticRegression from sklearn.multiclass import OutputCodeClassifier from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1, n_classes=3) return X, y # get a list of models to evaluate def get_models(): models = dict() for i in range(1,21): # create model model = LogisticRegression() # create error correcting output code classifier models[str(i)] = OutputCodeClassifier(model, code_size=i, random_state=1) return models # evaluate a give model using cross-validation def evaluate_model(model): cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): scores = evaluate_model(model) results.append(scores) names.append(name) print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example first evaluates each model configuration and reports the mean and standard deviation of the accuracy scores.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that perhaps 5 or 6 bits per class results in the best performance with reported mean accuracy scores of about 78.2 percent and 78.0 percent respectively. We also see good results for 9, 13, 17, and 20 bits per class, with perhaps 17 bits per class giving the best result of about 78.5 percent.

>1 0.545 (0.032) >2 0.766 (0.037) >3 0.776 (0.036) >4 0.769 (0.035) >5 0.782 (0.037) >6 0.780 (0.037) >7 0.776 (0.039) >8 0.775 (0.036) >9 0.782 (0.038) >10 0.779 (0.036) >11 0.770 (0.033) >12 0.777 (0.037) >13 0.781 (0.037) >14 0.779 (0.039) >15 0.771 (0.033) >16 0.769 (0.035) >17 0.785 (0.034) >18 0.776 (0.038) >19 0.776 (0.034) >20 0.780 (0.038)

A figure is created showing the box and whisker plots for the accuracy scores for each model configuration.

We can see that besides a value of 1, the number of bits per class delivers similar results in terms of spread and mean accuracy scores that cluster around 77 percent. This suggests that the approach is reasonably stable across configurations.

This section provides more resources on the topic if you are looking to go deeper.

- Ensemble Methods, 2012.
- Pattern Classification Using Ensemble Methods, 2010.
- The Elements of Statistical Learning, 2016.

- sklearn.multiclass.OutputCodeClassifier API.
- Error-Correcting Output-Codes, scikit-learn documentation.

In this tutorial, you discovered how to use error-correcting output codes for classification.

Specifically, you learned:

- Error-correcting output codes is a technique for using binary classification models on multi-class classification prediction tasks.
- How to fit, evaluate, and use error-correcting output codes classification models to make predictions.
- How to tune and evaluate different values for the number of bits per class hyperparameter used by error-correcting output codes.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Error-Correcting Output Codes (ECOC) for Machine Learning appeared first on Machine Learning Mastery.

]]>The post Why Use Ensemble Learning? appeared first on Machine Learning Mastery.

]]>Ensembles are predictive models that combine predictions from two or more other models.

Ensemble learning methods are popular and the go-to technique when the best performance on a predictive modeling project is the most important outcome.

Nevertheless, they are not always the most appropriate technique to use and beginners the field of applied machine learning have the expectation that ensembles or a specific ensemble method are always the best method to use.

Ensembles offer two specific benefits on a predictive modeling project, and it is important to know what these benefits are and how to measure them to ensure that using an ensemble is the right decision on your project.

In this tutorial, you will discover the benefits of using ensemble methods for machine learning.

After reading this tutorial, you will know:

- A minimum benefit of using ensembles is to reduce the spread in the average skill of a predictive model.
- A key benefit of using ensembles is to improve the average prediction performance over any contributing member in the ensemble.
- The mechanism for improved performance with ensembles is often the reduction in the variance component of prediction errors made by the contributing models.

Let’s get started.

This tutorial is divided into four parts; they are:

- Ensemble Learning
- Use Ensembles to Improve Robustness
- Bias, Variance, and Ensembles
- Use Ensembles to Improve Performance

An ensemble is a machine learning model that combines the predictions from two or more models.

The models that contribute to the ensemble, referred to as ensemble members, may be the same type or different types and may or may not be trained on the same training data.

The predictions made by the ensemble members may be combined using statistics, such as the mode or mean, or by more sophisticated methods that learn how much to trust each member and under what conditions.

The study of ensemble methods really picked up in the 1990s, and that decade was when papers on the most popular and widely used methods were published, such as core bagging, boosting, and stacking methods.

In the late 2000s, adoption of ensembles picked up due in part to their huge success in machine learning competitions, such as the Netflix prize and later competitions on Kaggle.

Over the last couple of decades, multiple classifier systems, also called ensemble systems have enjoyed growing attention within the computational intelligence and machine learning community.

— Page 1, Ensemble Machine Learning, 2012.

Ensemble methods greatly increase computational cost and complexity. This increase comes from the expertise and time required to train and maintain multiple models rather than a single model. This forces the question:

**Why should we consider using an ensemble?**

There are two main reasons to use an ensemble over a single model, and they are related; they are:

**Performance:**An ensemble can make better predictions and achieve better performance than any single contributing model.**Robustness**: An ensemble reduces the spread or dispersion of the predictions and model performance.

Ensembles are used to achieve better predictive performance on a predictive modeling problem than a single predictive model. The way this is achieved can be understood as the model reducing the variance component of the prediction error by adding bias (i.e. in the context of the bias-variance trade-off).

Originally developed to reduce the variance—thereby improving the accuracy—of an automated decision-making system …

— Page 1, Ensemble Machine Learning, 2012.

There is another important and less discussed benefit of ensemble methods is improved robustness or reliability in the average performance of a model.

These are both important concerns on a machine learning project and sometimes we may prefer one or both properties from a model.

Let’s take a closer look at these two properties in order to better understand the benefits of using ensemble learning on a project.

On a predictive modeling project, we often evaluate multiple models or modeling pipelines and choose one that performs well or best as our final model.

The algorithm or pipeline is then fit on all available data and used to make predictions on new data.

We have an idea of how well the model will perform on average from our test harness, typically estimated using repeated k-fold cross-validation as a gold standard. The problem is, average performance might not be sufficient.

An average accuracy or error of a model is a summary of the expected performance, when in fact, some models performed better and some models performed worse on different subsets of the data.

The standard deviation is the average difference between an observation and the mean and summarizes the dispersion or spread of data. For an accuracy or error measure for a model, it can give you an idea of the spread of the model’s behavior.

Looking at the minimum and maximum model performance scores will give you an idea of the worst and best performance you might expect from the model, and this might not be acceptable for your application.

The simplest ensemble is to fit the model multiple times on the training datasets and combine the predictions using a summary statistic, such as the mean for regression or the mode for classification. Importantly, each model needs to be slightly different due to the stochastic learning algorithm, difference in the composition of the training dataset, or differences in the model itself.

This will reduce the spread in the predictions made by the model. The mean performance will probably be about the same, although the worst- and best-case performance will be brought closer to the mean performance.

In effect, it smooths out the expected performance of the model.

We can refer to this as the “*robustness*” in the expected performance of the model and is a minimum benefit of using an ensemble method.

An ensemble may or may not improve modeling performance over any single contributing member, discussed more further, but at minimum, it should reduce the spread in the average performance of the model.

For more on this topic, see the tutorial:

Machine learning models for classification and regression learn a mapping function from inputs to outputs.

This mapping is learned from examples from the problem domain, the training dataset, and is evaluated on data not used during training, the test dataset.

The errors made by a machine learning model are often described in terms of two properties: the **bias** and the **variance**.

The bias is a measure of how close the model can capture the mapping function between inputs and outputs. It captures the rigidity of the model: the strength of the assumption the model has about the functional form of the mapping between inputs and outputs.

The variance of the model is the amount the performance of the model changes when it is fit on different training data. It captures the impact of the specifics of the data has on the model.

Variance refers to the amount by which [the model] would change if we estimated it using a different training data set.

— Page 34, An Introduction to Statistical Learning with Applications in R, 2014.

The bias and the variance of a model’s performance are connected.

Ideally, we would prefer a model with low bias and low variance, although in practice, this is very challenging. In fact, this could be described as the goal of applied machine learning for a given predictive modeling problem.

Reducing the bias can often easily be achieved by increasing the variance. Conversely, reducing the variance can easily be achieved by increasing the bias.

This is referred to as a trade-off because it is easy to obtain a method with extremely low bias but high variance […] or a method with very low variance but high bias …

— Page 36, An Introduction to Statistical Learning with Applications in R, 2014.

Some models naturally have a high bias or a high variance, which can be often relaxed or increased using hyperparameters that change the learning behavior of the algorithm.

Ensembles provide a way to reduce the variance of the predictions; that is the amount of error in the predictions made that can be attributed to “*variance*.”

This is not always the case, but when it is, this reduction in variance, in turn, leads to improved predictive performance.

Empirical and theoretical evidence show that some ensemble techniques (such as bagging) act as a variance reduction mechanism, i.e., they reduce the variance component of the error. Moreover, empirical results suggest that other ensemble techniques (such as AdaBoost) reduce both the bias and the variance parts of the error.

— Page 39, Pattern Classification Using Ensemble Methods, 2010.

Using ensembles to reduce the variance properties of prediction errors leads to the key benefit of using ensembles in the first place: to improve predictive performance.

Reducing the variance element of the prediction error improves predictive performance.

We explicitly use ensemble learning to seek better predictive performance, such as lower error on regression or high accuracy for classification.

… there is a way to improve model accuracy that is easier and more powerful than judicious algorithm selection: one can gather models into ensembles.

— Page 2, Ensemble Methods in Data Mining, 2010.

This is the **primary use of ensemble learning methods** and the benefit demonstrated through the use of ensembles by the majority of winners of machine learning competitions, such as the Netflix prize and competitions on Kaggle.

In the Netflix Prize, a contest ran for two years in which the first team to submit a model improving on Netflix’s internal recommendation system by 10% would win $1,000,000. […] the final edge was obtained by weighing contributions from the models of up to 30 competitors.

— Page 8, Ensemble Methods in Data Mining, 2010.

This benefit has also been demonstrated with academic competitions, such as top solutions for the famous ImageNet dataset in computer vision.

An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task.

— Deep Residual Learning for Image Recognition, 2015.

When used in this way, an ensemble should only be adopted if it performs better on average than any contributing member of the ensemble. If this is not the case, then the contributing member that performs better should be used instead.

Consider the distribution of expected scores calculated by a model on a test harness, such as repeated k-fold cross-validation, as we did above when considering the “*robustness*” offered by an ensemble. An ensemble that reduces the variance in the error, in effect, will shift the distribution rather than simply shrink the spread of the distribution.

This can result in a better average performance as compared to any single model.

This is not always the case, and having this expectation is a common mistake made by beginners.

It is possible, and even common, for the performance of an ensemble to perform no better than the best-performing member of the ensemble. This can happen if the ensemble has one top-performing model and the other members do not offer any benefit or the ensemble is not able to harness their contribution effectively.

It is also possible for an ensemble to perform worse than the best-performing member of the ensemble. This too is common any typically involves one top-performing model whose predictions are made worse by one or more poor-performing other models and the ensemble is not able to harness their contributions effectively.

As such, it is important to test a suite of ensemble methods and tune their behavior, just as we do for any individual machine learning model.

This section provides more resources on the topic if you are looking to go deeper.

- How to Reduce Variance in a Final Machine Learning Model
- How to Develop a Horizontal Voting Deep Learning Ensemble to Reduce Variance

- Pattern Classification Using Ensemble Methods, 2010.
- Ensemble Methods, 2012.
- Ensemble Machine Learning, 2012.
- Ensemble Methods in Data Mining, 2010.

In this post, you discovered the benefits of using ensemble methods for machine learning.

Specifically, you learned:

- A minimum benefit of using ensembles is to reduce the spread in the average skill of a predictive model.
- A key benefit of using ensembles is to improve the average prediction performance over any contributing member in the ensemble.
- The mechanism for improved performance with ensembles is often the reduction in the variance component of prediction errors made by the contributing models.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Why Use Ensemble Learning? appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Ensemble Learning appeared first on Machine Learning Mastery.

]]>This includes choosing a book to read based on reviews, choosing a course of action based on the advice of multiple medical doctors, and determining guilt.

Often, decision making by a group of individuals results in a better outcome than a decision made by any one member of the group. This is generally referred to as the wisdom of the crowd.

We can achieve a similar result by combining the predictions of multiple machine learning models for regression and classification predictive modeling problems. This is referred to generally as ensemble machine learning, or simply **ensemble learning**.

In this post, you will discover a gentle introduction to ensemble learning.

After reading this post, you will know:

- Many decisions we make involve the opinions or votes of other people.
- The ability of groups of people to make better decisions than individuals is called the wisdom of the crowd.
- Ensemble machine learning involves combining predictions from multiple skillful models.

Let’s get started.

This tutorial is divided into three parts; they are:

- Making Important Decisions
- Wisdom of Crowds
- Ensemble Machine Learning

Consider important decisions you make in your life.

For example:

- What book to purchase and read next.
- What university to attend.

Candidate books are those that sound interesting, but the book we purchase might have the most favorable reviews. Candidate universities are those that offer the courses we’re interested in, but we might choose one based on the feedback from friends and acquaintances that have first-hand experience.

We might trust the reviews and star ratings because each individual that contributed a review was (hopefully) unaffiliated with the book and independent of the other people leaving a review. When this is not the case, trust in the outcome is questionable and trust in the system is shaken, which is why Amazon works hard to delete fake reviews for books.

Also, consider important decisions we make more personally.

For example, medical treatment for an illness.

We take advice from an expert, but we seek a second, third, and even more opinions to confirm we are taking the best course of action.

The advice from the second and third opinion may or may not match the first opinion, but we weigh it heavily because it is provided dispassionately, objectively, and independently. If the doctors colluded on their opinion, then we would feel like the process of seeking a second and third opinion has failed.

… whenever we are faced with making a decision that has some important consequence, we often seek the opinions of different “experts” to help us make that decision …

— Page 2, Ensemble Machine Learning, 2012.

Finally, consider decisions we make as a society.

For example:

- Who should represent a geographical area in a government.
- Whether someone is guilty of a crime.

The democratic election of representatives is based (in some form) on the independent votes of citizens.

Making decisions based on the input of multiple people or experts has been a common practice in human civilization and serves as the foundation of a democratic society.

— Page v, Ensemble Methods, 2012.

An individual’s guilt of a serious crime may be determined by a jury of independent peers, often sequestered to enforce the independence of their interpretation. Cases may also be appealed at multiple levels, providing second, third, and more opinions on the outcome.

The judicial system in many countries, whether based on a jury of peers or a panel of judges, is also based on ensemble-based decision making.

— Pages 1-2, Ensemble Machine Learning, 2012.

These are all examples of an outcome arrived at through the combination of lower-level opinions, votes, or decisions.

… ensemble-based decision making is nothing new to us; as humans, we use such systems in our daily lives so often that it is perhaps second nature to us.

— Page 1, Ensemble Machine Learning, 2012.

In each case, we can see that there are properties of the lower-level decisions that are critical for the outcome to be useful, such as a belief in their independence and that each has some validity on their own.

This approach to decision making is so common, it has a name.

This approach to decision making when using humans that make the lower-level decisions is often referred to as the “wisdom of the crowd.”

It refers to the case where the opinion calculated from the aggregate of a group of people is often more accurate, useful, or correct than the opinion of any individual in the group.

A famous case of this from more than 100 years ago, and often cited, is that of a contest at a fair in Plymouth, England to estimate the weight of an ox. Individuals made their guess and the person whose guess was closest to the actual weight won the meat.

The statistician Francis Galton collected all of the guesses afterward and calculated the average of the guesses.

… he added all the contestants’ estimates, and calculated the mean of the group’s guesses. That number represented, you could say, the collective wisdom of the Plymouth crowd. If the crowd were a single person, that was how much it would have guessed the ox weighed.

— Page xiii, The Wisdom of Crowds, 2004.

He found that the mean of the guesses made by the contestants was very close to the actual weight. That is, taking the average value of all the numerical weights from the 800 participants was an accurate way of determining the true weight.

The crowd had guessed that the ox, after it had been slaughtered and dressed, would weigh 1,197 pounds. After it had been slaughtered and dressed, the ox weighed 1,198 pounds. In other words, the crowd’s judgment was essentially perfect.

— Page xiii, The Wisdom of Crowds, 2004.

This example is given at the beginning of James Surowiecki’s 2004 book titled “The Wisdom of Crowds” that explores the ability of groups of humans to make decisions and predictions that are often better than the members of the group.

This intelligence, or what I’ll call “the wisdom of crowds,” is at work in the world in many different guises.

— Page xiv, The Wisdom of Crowds, 2004.

The book motivates the preference to average the guesses, votes, and opinions of groups of people when making some important decisions instead of searching for and consulting a single expert.

… we feel the need to “chase the expert.” The argument of this book is that chasing the expert is a mistake, and a costly one at that. We should stop hunting and ask the crowd (which, of course, includes the geniuses as well as everyone else) instead. Chances are, it knows.

— Page xv, The Wisdom of Crowds, 2004.

The book goes on to highlight a number properties of any system that makes decisions based on groups of people, summarized nicely in Lior Rokach’s 2010 book titled “Pattern Classification Using Ensemble Methods” (page 22), as:

**Diversity of opinion**: Each member should have private information even if it is just an eccentric interpretation of the known facts.**Independence**: Members’ opinions are not determined by the opinions of those around them.**Decentralization**: Members are able to specialize and draw conclusions based on local knowledge.**Aggregation**: Some mechanism exists for turning private judgments into a collective decision.

As a decision-making system, the approach is not always the most effective (e.g. stock market bubbles, fads, etc.), but can be effective in a range of different domains where the outcomes are important.

We can use this approach to decision making in applied machine learning.

Applied machine learning often involves fitting and evaluating models on a dataset.

Given that we cannot know which model will perform best on the dataset beforehand, this may involve a lot of trial and error until we find a model that performs well or best for our project.

This is akin to making a decision using a single expert. Perhaps the best expert we can find.

A complementary approach is to prepare multiple different models, then combine their predictions. This is called an ensemble machine learning model, or simply an ensemble, and the process of finding a well-performing ensemble model is referred to as “**ensemble learning**“.

Ensemble methodology imitates our second nature to seek several opinions before making a crucial decision.

— Page vii, Pattern Classification Using Ensemble Methods, 2010.

This is akin to making a decision using the opinions from multiple experts.

The most common type of ensemble involves training multiple versions of the same machine learning model in a way that ensures that each ensemble member is different (e.g. decision trees fit on different subsamples of the training dataset), then combining the predictions using averaging or voting.

A less common, although just as effective, approach involves training different algorithms on the same data (e.g. a decision tree, a support vector machine, and a neural network) and combining their predictions.

Like combining the opinions of humans in a crowd, the effectiveness of the ensemble relies on each model having some skill (better than random) and some independence from the other models. This latter point is often interpreted as meaning that the model is skillful in a different way from other models in the ensemble.

The hope is that the ensemble results in a better performing model than any contributing member.

The core principle is to weigh several individual pattern classifiers, and combine them in order to reach a classification that is better than the one obtained by each of them separately.

— Page vii, Pattern Classification Using Ensemble Methods, 2010.

At worst, the ensemble limits the worst case of predictions by reducing the variance of the predictions. Model performance can vary with the training data (and the stochastic nature of the learning algorithm in some cases), resulting in better or worse performance for any specific model.

… the goal of ensemble systems is to create several classifiers with relatively fixed (or similar) bias and then combining their outputs, say by averaging, to reduce the variance.

— Page 2, Ensemble Machine Learning, 2012.

An ensemble can smooth this out and ensure that predictions made are closer to the average performance of contributing members. Further, reducing the variance in predictions often results in a lift in the skill of the ensemble. This comes at the added computational cost of fitting and maintaining multiple models instead of a single model.

Although ensemble predictions will have a lower variance, they are not guaranteed to have better performance than any single contributing member.

… researchers in the computational intelligence and machine learning community have studied schemes that share such a joint decision procedure. These schemes are generally referred to as ensemble learning, which is known to reduce the classifiers’ variance and improve the decision system’s robustness and accuracy.

— Page v, Ensemble Methods, 2012.

Sometimes, the best performing model, e.g. the best expert, is sufficiently superior compared to other models that combining its predictions with other models can result in worse performance.

As such, selecting models, even ensemble models, still requires carefully controlled experiments on a robust test harness.

This section provides more resources on the topic if you are looking to go deeper.

- The Wisdom of Crowds, 2004.
- Pattern Classification Using Ensemble Methods, 2010.
- Ensemble Methods, 2012.
- Ensemble Machine Learning, 2012.

- Ensemble learning, Wikipedia.
- Ensemble learning, Scholarpedia.
- Wisdom of the crowd, Wikipedia.
- The Wisdom of Crowds, Wikipedia.

In this post, you discovered a gentle introduction to ensemble learning.

Specifically, you learned:

- Many decisions we make involve the opinions or votes of other people.
- The ability of groups of people to make better decisions than individuals is called the wisdom of the crowd.
- Ensemble machine learning involves combining predictions from multiple skillful models.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Ensemble Learning appeared first on Machine Learning Mastery.

]]>The post 6 Books on Ensemble Learning appeared first on Machine Learning Mastery.

]]>The effect can be both improved predictive performance and lower variance of the predictions made by the model.

Ensemble methods are covered in most textbooks on machine learning; nevertheless, there are books dedicated to the topic.

In this post, you will discover the top books on the topic of ensemble machine learning.

After reading this post, you will know:

- Books on ensemble learning, including their table of contents and where to learn more about them.
- Sections and chapters on ensemble learning in the most popular and common machine learning textbooks.
- Book recommendations for machine learning practitioners interested in ensemble learning.

Let’s get started.

The books dedicated to the topic of ensemble learning that we will cover are as follows:

- Supervised and Unsupervised Ensemble Methods and their Applications, 2008.
- Pattern Classification Using Ensemble Methods, 2010.
- Ensemble Learning, 2019.
- Ensemble Methods in Data Mining, 2010.
- Ensemble Methods, 2012.
- Ensemble Machine Learning, 2012.

There are also some books from Packt, but I won’t be reviewing them; they are:

- Hands-On Ensemble Learning with R, 2018.
- Hands-On Ensemble Learning with Python, 2019.
- Ensemble Machine Learning Cookbook, 2019.

**Did I miss a book on ensemble learning?**

Let me know in the comments below.

**Have you read any of these books on ensemble learning?**

What did you think? Let me know in the comments.

Let’s take a closer look at these books, including their author, table of contents, and where to learn more.

The full title of this book is “Supervised and Unsupervised Ensemble Methods and their Applications” and it was edited by Oleg Okun and Giorgio Valentini and published in 2008.

This book is a collection of academic papers by a range of different authors on the topic of applications of ensemble learning.

The book includes nine chapters divided into two parts, assembling contributions to the applications of supervised and unsupervised ensembles.

— Page VIII, Supervised and Unsupervised Ensemble Methods and their Applications, 2008.

- Part I: Ensembles of Clustering Methods and Their Applications
- Chapter 01: Cluster Ensemble Methods: From Single Clusterings to Combined Solutions
- Chapter 02: Random Subspace Ensembles for Clustering Categorical Data
- Chapter 03: Ensemble Clustering with a Fuzzy Approach
- Chapter 04: Collaborative Multi-Strategical Clustering for OBject-Oriented Image Analysis

- Part II: Ensembles of Classification Methods and Their Applications
- Chapter 05: Intrusion Detection in Computer Systems Using Multiple Classifier Systems
- Chapter 06: Ensembles of Nearest Neighbors for Gene Expression Based Cancer Classification
- Chapter 07: Multivariate Time Series Classification via Stacking of Univariate Classifiers
- Chapter 08: Gradient Boosting GARCH and Neural Networks for Time Series Prediction
- Chapter 09: Cascading with VDM and Binary Decision Trees for Nominal Data

I generally would not recommend this book to machine learning practitioners unless one of the applications covered by the book is directly related to your current project.

You can learn more about this book here:

The full title of this book is “Pattern Classification Using Ensemble Methods” and it was written by Lior Rokach and published in 2010.

This book provides a technical introduction to the topic of ensemble machine learning written for students and academics.

Throughout the book, special emphasis was put on the extensive use of illustrative examples. Accordingly, in addition to ensemble theory, the reader is also provided with an abundance of artificial as well as real-world applications from a wide range of fields. The data referred to in this book, as well as most of the Java implementations of the presented algorithms, can be obtained via the Web.

— Page viii, Pattern Classification Using Ensemble Methods, 2010.

- Chapter 01: Introduction to Pattern Classification
- Chapter 02: Introduction to Ensemble Learning
- Chapter 03: Ensemble Classification
- Chapter 04: Ensemble Diversity
- Chapter 05: Ensemble Selection
- Chapter 06: Error Correcting Output Codes
- Chapter 07: Evaluating Ensembles of Classifiers

I like the level of this book. It is technical, but not overly so, and stays grounded in the concerns of using ensemble algorithms on supervised predictive modeling projects. I think it is a good textbook on ensemble learning for practitioners.

You can learn more about this book here:

The full title of this book is “Ensemble Learning: Pattern Classification Using Ensemble Methods” and it was written by Lior Rokach and published in 2019.

This is a direct update to the book “Pattern Classification Using Ensemble Methods” and given a different title.

The first edition of this book was published a decade ago. The book was well-received by the machine learning and data science communities and was translated into Chinese. […] The second edition aims to update the previously presented material on the fundamental areas, and to present new findings in the field; more than a third of this edition is comprised of new material.

— Page vii, Ensemble Learning: Pattern Classification Using Ensemble Methods, 2019.

- Chapter 01: Introduction to Machine Learning
- Chapter 02: Classification and Regression Trees
- Chapter 03: Introduction to Ensemble Learning
- Chapter 04: Ensemble Classification
- Chapter 05: Gradient Boosting Machines
- Chapter 06: Ensemble Diversity
- Chapter 07: Ensemble Selection
- Chapter 08: Error Correcting Output Codes
- Chapter 09: Evaluating Ensemble Classifiers

This is a great textbook on ensemble learning for students and practitioners and is preferred over “*Pattern Classification Using Ensemble Methods*” if you must choose between the two.

You can learn more about this book here:

The full title of this book is “Ensemble Methods in Data Mining: Improving Accuracy

Through Combining Predictions” and it was written by Giovanni Seni and John Elder and published in 2010.

This is a technical book on ensembles, although concepts are demonstrated with full examples in R.

This book is aimed at novice and advanced analytic researchers and practitioners – especially in Engineering, Statistics, and Computer Science. Those with little exposure to ensembles will learn why and how to employ this breakthrough method, and advanced practitioners will gain insight into building even more powerful models. Throughout, snippets of code in R are provided to illustrate the algorithms described and to encourage the reader to try the technique.

— Page i, Ensemble Methods in Data Mining, 2010.

- Chapter 01: Ensembles Discovered
- Chapter 02: Predictive Learning and Decision Trees
- Chapter 03: Model Complexity, Model Selection and Regularization
- Chapter 04: Importance Sampling and the Classic Ensemble Methods
- Chapter 05: Rule Ensembles and Interpretation Statistics
- Chapter 06: Ensemble Complexity
- Appendix A: AdaBoost Equivalence to FSF Procedure
- Appendix B: Gradient Boosting and Robust Loss Functions

I believe this is the first book I purchased on ensemble learning years ago. It is a good crash course in ensemble learning for practitioners, especially those already using R. It may be a little too mathematical for most practitioners; nevertheless, I think it might be a good smaller substitute for the above textbook on ensemble methods.

You can learn more about this book here:

The full title of this book is “Ensemble Methods: Foundations and Algorithms” and it was written by Zhi-Hua Zhou and published in 2012.

This is another focused textbook on the topic of ensemble learning targeted at students and academics.

This book provides researchers, students and practitioners with an introduction to ensemble methods. The book consists of eight chapters which naturally constitute three parts.

— Page vii, Ensemble Methods: Foundations and Algorithms, 2012.

- Chapter 01: Introduction
- Chapter 02: Boosting
- Chapter 03: Bagging
- Chapter 04: Combination Methods
- Chapter 05: Diversity
- Chapter 06: Ensemble Pruning
- Chapter 07: Clustering Ensembles
- Chapter 08: Advanced Topics

This book is well written and covers the main methods with good references. I think it’s another great jump-start on the basics of ensemble methods as long as the reader is comfortable with a little math. I liked the algorithm descriptions and worked examples.

You can learn more about this book here:

The full title of this book is “Ensemble Machine Learning: Methods and Applications” and it was edited by Cha Zhang and Yunqian Ma and published in 2012.

This book is a collection of academic papers written by a range of authors on the topic of applications of ensemble machine learning.

Despite the great success of ensemble learning methods recently, we found very few books that were dedicated to this topic, and even fewer that provided insights about how such methods shall be applied in real-world applications. The primary goal of this book is to fill the existing gap in the literature and comprehensively cover the state-of-the-art ensemble learning methods, and provide a set of applications that demonstrate the various usages of ensemble learning methods in the real world.

— Page v, Ensemble Machine Learning: Methods and Applications, 2012.

- Chapter 01: Ensemble Learning
- Chapter 02: Boosting Algorithms: A Review of Methods, Theory, and Applications
- Chapter 03: Boosting Kernel Estimators
- Chapter 04: Targeted Learning
- Chapter 05: Random Forests
- Chapter 06: Ensemble Learning by Negative Correlation Learning
- Chapter 07: Ensemble Nystrom
- Chapter 08: Object Detection
- Chapter 09: Classifier Boosting for Human Activity Recognition
- Chapter 10: Discriminative Learning for Anatomical Structure Detection and Segmentation
- Chapter 1: Random Forest for Bioinformatics

Like other collections of papers, I would generally not recommend this book unless you are an academic or one of the chapters is directly related to your current machine learning project. Nevertheless, many of the chapters provide a solid and compact introduction to ensemble methods and how to use them on specific applications.

You can learn more about this book here:

Many machine learning textbooks have sections on ensemble learning.

In this section, we will take a quick tour of some of the more popular textbooks and the relevant sections on ensemble learning.

The book “An Introduction to Statistical Learning with Applications in R” published in 2016 provides a solid introduction to boosting and bagging for decision trees in chapter 8.

- Section 8.2: Bagging, RandomForests, Boosting

The book “Applied Predictive Modeling” published in 2013 covers the most popular ensemble algorithms with examples in R, with a focus on the ensembles of decision trees.

- Chapter 8: Regression Trees and Rule-Based Model
- Chapter 14: Classification Trees and Rule-Based Models

The book “Data Mining: Practical Machine Learning Tools and Techniques” published in 2016 provides a chapter dedicated to ensemble learning and covers a range of popular techniques, including boosting, bagging, and stacking.

- Chapter 12: Ensemble Learning.

The book “Machine Learning: A Probabilistic Perspective” published in 2012 provides a number of sections on algorithms that perform ensembling, as well as a dedicated section on the topic focused on stacking and error-correcting output codes.

- Section 16.2: Classification and regression trees (CART)
- Section 16.4: Boosting
- Section 16.6: Ensemble learning

The book “The Elements of Statistical Learning” published in 2016 covers the key ensemble learning algorithms as well as the theory for ensemble learning generally.

- Chapter 8: Model Inference and Averaging
- Chapter 10: Boosting and Additive Trees
- Chapter 15: Random Forests
- Chapter 16: Ensemble Learning

**Did I miss your favorite machine learning textbook that has a section on ensemble learning?**

Let me know in the comments below.

I have a copy of each of these books as I like to read about a given topic from multiple perspectives.

If you are looking for a solid textbook dedicated to the topic of ensemble learning, I would recommend one of the following:

- Ensemble Methods: Foundations and Algorithms, 2012.
- Ensemble Learning: Pattern Classification Using Ensemble Methods, 2019.

A close runner-up is “*Ensemble Methods in Data Mining*” that mixes theory and examples in R.

Also, I recommend “*Pattern Classification Using Ensemble Methods*” if you cannot get your hands on the more recent “*Ensemble Learning: Pattern Classification Using Ensemble Methods*“.

In this post, you discovered a suite of books on the topic of ensemble machine learning.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post 6 Books on Ensemble Learning appeared first on Machine Learning Mastery.

]]>The post Softmax Activation Function with Python appeared first on Machine Learning Mastery.

]]>The most common use of the softmax function in applied machine learning is in its use as an activation function in a neural network model. Specifically, the network is configured to output N values, one for each class in the classification task, and the softmax function is used to normalize the outputs, converting them from weighted sum values into probabilities that sum to one. Each value in the output of the softmax function is interpreted as the probability of membership for each class.

In this tutorial, you will discover the softmax activation function used in neural network models.

After completing this tutorial, you will know:

- Linear and Sigmoid activation functions are inappropriate for multi-class classification tasks.
- Softmax can be thought of as a softened version of the argmax function that returns the index of the largest value in a list.
- How to implement the softmax function from scratch in Python and how to convert the output into a class label.

Let’s get started.

This tutorial is divided into three parts; they are:

- Predicting Probabilities With Neural Networks
- Max, Argmax, and Softmax
- Softmax Activation Function

Neural network models can be used to model classification predictive modeling problems.

Classification problems are those that involve predicting a class label for a given input. A standard approach to modeling classification problems is to use a model to predict the probability of class membership. That is, given an example, what is the probability of it belonging to each of the known class labels?

- For a binary classification problem, a Binomial probability distribution is used. This is achieved using a network with a single node in the output layer that predicts the probability of an example belonging to class 1.
- For a multi-class classification problem, a Multinomial probability is used. This is achieved using a network with one node for each class in the output layer and the sum of the predicted probabilities equals one.

A neural network model requires an activation function in the output layer of the model to make the prediction.

There are different activation functions to choose from; let’s look at a few.

One approach to predicting class membership probabilities is to use a linear activation.

A linear activation function is simply the sum of the weighted input to the node, required as input for any activation function. As such, it is often referred to as “*no activation function*” as no additional transformation is performed.

Recall that a probability or a likelihood is a numeric value between 0 and 1.

Given that no transformation is performed on the weighted sum of the input, it is possible for the linear activation function to output any numeric value. This makes the linear activation function inappropriate for predicting probabilities for either the binomial or multinomial case.

Another approach to predicting class membership probabilities is to use a sigmoid activation function.

This function is also called the logistic function. Regardless of the input, the function always outputs a value between 0 and 1. The form of the function is an S-shape between 0 and 1 with the vertical or middle of the “*S*” at 0.5.

This allows very large values given as the weighted sum of the input to be output as 1.0 and very small or negative values to be mapped to 0.0.

The sigmoid activation is an ideal activation function for a binary classification problem where the output is interpreted as a Binomial probability distribution.

The sigmoid activation function can also be used as an activation function for multi-class classification problems where classes are non-mutually exclusive. These are often referred to as a multi-label classification rather than multi-class classification.

The sigmoid activation function is not appropriate for multi-class classification problems with mutually exclusive classes where a multinomial probability distribution is required.

Instead, an alternate activation is required called the **softmax function**.

The maximum, or “*max*,” mathematical function returns the largest numeric value for a list of numeric values.

We can implement this using the *max()* Python function; for example:

# example of the max of a list of numbers # define data data = [1, 3, 2] # calculate the max of the list result = max(data) print(result)

Running the example returns the largest value “3” from the list of numbers.

3

The argmax, or “*arg max*,” mathematical function returns the index in the list that contains the largest value.

Think of it as the meta version of max: one level of indirection above max, pointing to the position in the list that has the max value rather than the value itself.

We can implement this using the argmax() NumPy function; for example:

# example of the argmax of a list of numbers from numpy import argmax # define data data = [1, 3, 2] # calculate the argmax of the list result = argmax(data) print(result)

Running the example returns the list index value “1” that points to the array index [1] that contains the largest value in the list “3”.

1

The softmax, or “*soft max*,” mathematical function can be thought to be a probabilistic or “*softer*” version of the argmax function.

The term softmax is used because this activation function represents a smooth version of the winner-takes-all activation model in which the unit with the largest input has output +1 while all other units have output 0.

— Page 238, Neural Networks for Pattern Recognition, 1995.

From a probabilistic perspective, if the *argmax()* function returns 1 in the previous section, it returns 0 for the other two array indexes, giving full weight to index 1 and no weight to index 0 and index 2 for the largest value in the list [1, 3, 2].

[0, 1, 0]

What if we were less sure and wanted to express the argmax probabilistically, with likelihoods?

This can be achieved by scaling the values in the list and converting them into probabilities such that all values in the returned list sum to 1.0.

This can be achieved by calculating the exponent of each value in the list and dividing it by the sum of the exponent values.

- probability = exp(value) / sum v in list exp(v)

For example, we can turn the first value “1” in the list [1, 3, 2] into a probability as follows:

- probability = exp(1) / (exp(1) + exp(3) + exp(2))
- probability = exp(1) / (exp(1) + exp(3) + exp(2))
- probability = 2.718281828459045 / 30.19287485057736
- probability = 0.09003057317038046

We can demonstrate this for each value in the list [1, 3, 2] in Python as follows:

# transform values into probabilities from math import exp # calculate each probability p1 = exp(1) / (exp(1) + exp(3) + exp(2)) p2 = exp(3) / (exp(1) + exp(3) + exp(2)) p3 = exp(2) / (exp(1) + exp(3) + exp(2)) # report probabilities print(p1, p2, p3) # report sum of probabilities print(p1 + p2 + p3)

Running the example converts each value in the list into a probability and reports the values, then confirms that all probabilities sum to the value 1.0.

We can see that most weight is put on index 1 (67 percent) with less weight on index 2 (24 percent) and even less on index 0 (9 percent).

0.09003057317038046 0.6652409557748219 0.24472847105479767 1.0

This is the softmax function.

We can implement it as a function that takes a list of numbers and returns the softmax or multinomial probability distribution for the list.

The example below implements the function and demonstrates it on our small list of numbers.

# example of a function for calculating softmax for a list of numbers from numpy import exp # calculate the softmax of a vector def softmax(vector): e = exp(vector) return e / e.sum() # define data data = [1, 3, 2] # convert list of numbers to a list of probabilities result = softmax(data) # report the probabilities print(result) # report the sum of the probabilities print(sum(result))

Running the example reports roughly the same numbers with minor differences in precision.

[0.09003057 0.66524096 0.24472847] 1.0

Finally, we can use the built-in softmax() NumPy function to calculate the softmax for an array or list of numbers, as follows:

# example of calculating the softmax for a list of numbers from scipy.special import softmax # define data data = [1, 3, 2] # calculate softmax result = softmax(data) # report the probabilities print(result) # report the sum of the probabilities print(sum(result))

Running the example, again, we get very similar results with very minor differences in precision.

[0.09003057 0.66524096 0.24472847] 0.9999999999999997

Now that we are familiar with the softmax function, let’s look at how it is used in a neural network model.

The softmax function is used as the activation function in the output layer of neural network models that predict a multinomial probability distribution.

That is, softmax is used as the activation function for multi-class classification problems where class membership is required on more than two class labels.

Any time we wish to represent a probability distribution over a discrete variable with n possible values, we may use the softmax function. This can be seen as a generalization of the sigmoid function which was used to represent a probability distribution over a binary variable.

— Page 184, Deep Learning, 2016.

The function can be used as an activation function for a hidden layer in a neural network, although this is less common. It may be used when the model internally needs to choose or weight multiple different inputs at a bottleneck or concatenation layer.

Softmax units naturally represent a probability distribution over a discrete variable with k possible values, so they may be used as a kind of switch.

— Page 196, Deep Learning, 2016.

In the Keras deep learning library with a three-class classification task, use of softmax in the output layer may look as follows:

... model.add(Dense(3, activation='softmax'))

By definition, the softmax activation will output one value for each node in the output layer. The output values will represent (or can be interpreted as) probabilities and the values sum to 1.0.

When modeling a multi-class classification problem, the data must be prepared. The target variable containing the class labels is first label encoded, meaning that an integer is applied to each class label from 0 to N-1, where N is the number of class labels.

The label encoded (or integer encoded) target variables are then one-hot encoded. This is a probabilistic representation of the class label, much like the softmax output. A vector is created with a position for each class label and the position. All values are marked 0 (impossible) and a 1 (certain) is used to mark the position for the class label.

For example, three class labels will be integer encoded as 0, 1, and 2. Then encoded to vectors as follows:

- Class 0: [1, 0, 0]
- Class 1: [0, 1, 0]
- Class 2: [0, 0, 1]

This is called a one-hot encoding.

It represents the expected multinomial probability distribution for each class used to correct the model under supervised learning.

The softmax function will output a probability of class membership for each class label and attempt to best approximate the expected target for a given input.

For example, if the integer encoded class 1 was expected for one example, the target vector would be:

- [0, 1, 0]

The softmax output might look as follows, which puts the most weight on class 1 and less weight on the other classes.

- [0.09003057 0.66524096 0.24472847]

The error between the expected and predicted multinomial probability distribution is often calculated using cross-entropy, and this error is then used to update the model. This is called the cross-entropy loss function.

For more on cross-entropy for calculating the difference between probability distributions, see the tutorial:

We may want to convert the probabilities back into an integer encoded class label.

This can be achieved using the *argmax()* function that returns the index of the list with the largest value. Given that the class labels are integer encoded from 0 to N-1, the argmax of the probabilities will always be the integer encoded class label.

- class integer = argmax([0.09003057 0.66524096 0.24472847])
- class integer = 1

This section provides more resources on the topic if you are looking to go deeper.

- Neural Networks for Pattern Recognition, 1995.
- Neural Networks: Tricks of the Trade: Tricks of the Trade, 2nd Edition, 2012.
- Deep Learning, 2016.

In this tutorial, you discovered the softmax activation function used in neural network models.

Specifically, you learned:

- Linear and Sigmoid activation functions are inappropriate for multi-class classification tasks.
- Softmax can be thought of as a softened version of the argmax function that returns the index of the largest value in a list.
- How to implement the softmax function from scratch in Python and how to convert the output into a class label.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Softmax Activation Function with Python appeared first on Machine Learning Mastery.

]]>The post How to Develop LARS Regression Models in Python appeared first on Machine Learning Mastery.

]]>Linear regression is the standard algorithm for regression that assumes a linear relationship between inputs and the target variable. An extension to linear regression involves adding penalties to the loss function during training that encourage simpler models that have smaller coefficient values. These extensions are referred to as regularized linear regression or penalized linear regression.

Lasso Regression is a popular type of regularized linear regression that includes an L1 penalty. This has the effect of shrinking the coefficients for those input variables that do not contribute much to the prediction task.

**Least Angle Regression** or **LARS** for short provides an alternate, efficient way of fitting a Lasso regularized regression model that does not require any hyperparameters.

In this tutorial, you will discover how to develop and evaluate LARS Regression models in Python.

After completing this tutorial, you will know:

- LARS Regression provides an alternate way to train a Lasso regularized linear regression model that adds a penalty to the loss function during training.
- How to evaluate a LARS Regression model and use a final model to make predictions for new data.
- How to configure the LARS Regression model for a new dataset automatically using a cross-validation version of the estimator.

Let’s get started.

This tutorial is divided into three parts; they are:

- LARS Regression
- Example of LARS Regression
- Tuning LARS Hyperparameters

Linear regression refers to a model that assumes a linear relationship between input variables and the target variable.

With a single input variable, this relationship is a line, and with higher dimensions, this relationship can be thought of as a hyperplane that connects the input variables to the target variable. The coefficients of the model are found via an optimization process that seeks to minimize the sum squared error between the predictions (*yhat*) and the expected target values (*y*).

- loss = sum i=0 to n (y_i – yhat_i)^2

A problem with linear regression is that estimated coefficients of the model can become large, making the model sensitive to inputs and possibly unstable. This is particularly true for problems with few observations (samples) or more input predictors (*p*) than variables than samples (*n*) (so-called *p >> n problems*).

One approach to address the stability of regression models is to change the loss function to include additional costs for a model that has large coefficients. Linear regression models that use these modified loss functions during training are referred to collectively as penalized linear regression.

A popular penalty is to penalize a model based on the sum of the absolute coefficient values. This is called the L1 penalty. An L1 penalty minimizes the size of all coefficients and allows some coefficients to be minimized to the value zero, which removes the predictor from the model.

- l1_penalty = sum j=0 to p abs(beta_j)

An L1 penalty minimizes the size of all coefficients and allows any coefficient to go to the value of zero, effectively removing input features from the model. This acts as a type of automatic feature selection method.

… a consequence of penalizing the absolute values is that some parameters are actually set to 0 for some value of lambda. Thus the lasso yields models that simultaneously use regularization to improve the model and to conduct feature selection.

— Page 125, Applied Predictive Modeling, 2013.

This penalty can be added to the cost function for linear regression and is referred to as Least Absolute Shrinkage And Selection Operator (LASSO), or more commonly, “*Lasso*” (with title case) for short.

The Lasso trains the model using a least-squares loss training procedure.

**Least Angle Regression**, LAR or LARS for short, is an alternative approach to solving the optimization problem of fitting the penalized model. Technically, LARS is a forward stepwise version of feature selection for regression that can be adapted for the Lasso model.

Unlike the Lasso, it does not require a hyperparameter that controls the weighting of the penalty in the loss function. Instead, the weighting is discovered automatically by LARS.

… least angle regression (LARS), is a broad framework that encompasses the lasso and similar models. The LARS model can be used to fit lasso models more efficiently, especially in high-dimensional problems.

— Page 126, Applied Predictive Modeling, 2013.

Now that we are familiar with LARS penalized regression, let’s look at a worked example.

In this section, we will demonstrate how to use the LARS Regression algorithm.

First, let’s introduce a standard regression dataset. We will use the housing dataset.

The housing dataset is a standard machine learning dataset comprising 506 rows of data with 13 numerical input variables and a numerical target variable.

Using a test harness of repeated stratified 10-fold cross-validation with three repeats, a naive model can achieve a mean absolute error (MAE) of about 6.6. A top-performing model can achieve a MAE on this same test harness of about 1.9. This provides the bounds of expected performance on this dataset.

The dataset involves predicting the house price given details of the house’s suburb in the American city of Boston.

No need to download the dataset; we will download it automatically as part of our worked examples.

The example below downloads and loads the dataset as a Pandas DataFrame and summarizes the shape of the dataset and the first five rows of data.

# load and summarize the housing dataset from pandas import read_csv from matplotlib import pyplot # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) # summarize shape print(dataframe.shape) # summarize first few lines print(dataframe.head())

Running the example confirms the 506 rows of data and 13 input variables and a single numeric target variable (14 in total). We can also see that all input variables are numeric.

(506, 14) 0 1 2 3 4 5 ... 8 9 10 11 12 13 0 0.00632 18.0 2.31 0 0.538 6.575 ... 1 296.0 15.3 396.90 4.98 24.0 1 0.02731 0.0 7.07 0 0.469 6.421 ... 2 242.0 17.8 396.90 9.14 21.6 2 0.02729 0.0 7.07 0 0.469 7.185 ... 2 242.0 17.8 392.83 4.03 34.7 3 0.03237 0.0 2.18 0 0.458 6.998 ... 3 222.0 18.7 394.63 2.94 33.4 4 0.06905 0.0 2.18 0 0.458 7.147 ... 3 222.0 18.7 396.90 5.33 36.2 [5 rows x 14 columns]

The scikit-learn Python machine learning library provides an implementation of the LARS penalized regression algorithm via the Lars class.

... # define model model = Lars()

We can evaluate the LARS Regression model on the housing dataset using repeated 10-fold cross-validation and report the average mean absolute error (MAE) on the dataset.

# evaluate an lars regression model on the dataset from numpy import mean from numpy import std from numpy import absolute from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from sklearn.linear_model import Lars # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model model = Lars() # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1) # force scores to be positive scores = absolute(scores) print('Mean MAE: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the LARS Regression algorithm on the housing dataset and reports the average MAE across the three repeats of 10-fold cross-validation.

Your specific results may vary given the stochastic nature of the learning algorithm. Consider running the example a few times.

In this case, we can see that the model achieved a MAE of about 3.432.

Mean MAE: 3.432 (0.552)

We may decide to use the LARS Regression as our final model and make predictions on new data.

This can be achieved by fitting the model on all available data and calling the *predict()* function, passing in a new row of data.

We can demonstrate this with a complete example, listed below.

# make a prediction with a lars regression model on the dataset from pandas import read_csv from sklearn.linear_model import Lars # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model model = Lars() # fit model model.fit(X, y) # define new data row = [0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98] # make a prediction yhat = model.predict([row]) # summarize prediction print('Predicted: %.3f' % yhat)

Running the example fits the model and makes a prediction for the new rows of data.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

Predicted: 29.904

Next, we can look at configuring the model hyperparameters.

As part of the LARS training algorithm, it automatically discovers the best value for the lambda hyperparameter used in the Lasso algorithm.

This hyperparameter is referred to as the “*alpha*” argument in the scikit-learn implementation of Lasso and LARS.

Nevertheless, the process of automatically discovering the best model and *alpha* hyperparameter is still based on a single training dataset.

An alternative approach is to fit the model on multiple subsets of the training dataset and choose the best internal model configuration across the folds, in this case, the value of *alpha*. Generally, this is referred to as a cross-validation estimator.

The scikit-learn libraries offer a cross-validation version of the LARS for finding a more robust value for *alpha* via the LarsCV class.

The example below demonstrates how to fit a *LarsCV* model and report the *alpha* value found via cross-validation

# use automatically configured the lars regression algorithm from numpy import arange from pandas import read_csv from sklearn.linear_model import LarsCV from sklearn.model_selection import RepeatedKFold # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # define model model = LarsCV(cv=cv, n_jobs=-1) # fit model model.fit(X, y) # summarize chosen configuration print('alpha: %f' % model.alpha_)

Running the example fits the *LarsCV* model using repeated cross-validation and reports an optimal *alpha* value found across the runs.

alpha: 0.001623

This version of the LARS model may prove more robust in practice.

We can evaluate it using the same procedure we did in the previous section, although in this case, each model fit is based on the hyperparameters found via repeated k-fold cross-validation internally (e.g. cross-validation of a cross-validation estimator).

The complete example is listed below.

# evaluate an lars cross-validation regression model on the dataset from numpy import mean from numpy import std from numpy import absolute from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from sklearn.linear_model import LarsCV # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # define model model = LarsCV(cv=cv, n_jobs=-1) # evaluate model scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1) # force scores to be positive scores = absolute(scores) print('Mean MAE: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example will evaluate the cross-validated estimation of model hyperparameters using repeated cross-validation.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that we achieved slightly better results with 3.374 vs. 3.432 in the previous section.

Mean MAE: 3.374 (0.558)

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered how to develop and evaluate LARS Regression models in Python.

Specifically, you learned:

- LARS Regression provides an alternate way to train a Lasso regularized linear regression model that adds a penalty to the loss function during training.
- How to evaluate a LARS Regression model and use a final model to make predictions for new data.
- How to configure the LARS Regression model for a new dataset automatically using a cross-validation version of the estimator.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Develop LARS Regression Models in Python appeared first on Machine Learning Mastery.

]]>The post Nearest Shrunken Centroids With Python appeared first on Machine Learning Mastery.

]]>It involves predicting a class label for new examples based on which class-based centroid the example is closest to from the training dataset.

The **Nearest Shrunken Centroids** algorithm is an extension that involves shifting class-based centroids toward the centroid of the entire training dataset and removing those input variables that are less useful at discriminating the classes.

As such, the Nearest Shrunken Centroids algorithm performs an automatic form of feature selection, making it appropriate for datasets with very large numbers of input variables.

In this tutorial, you will discover the Nearest Shrunken Centroids classification machine learning algorithm.

After completing this tutorial, you will know:

- The Nearest Shrunken Centroids is a simple linear machine learning algorithm for classification.
- How to fit, evaluate, and make predictions with the Nearest Shrunken Centroids model with Scikit-Learn.
- How to tune the hyperparameters of the Nearest Shrunken Centroids algorithm on a given dataset.

Let’s get started.

This tutorial is divided into three parts; they are:

- Nearest Centroids Algorithm
- Nearest Centroids With Scikit-Learn
- Tuning Nearest Centroid Hyperparameters

Nearest Centroids is a classification machine learning algorithm.

The algorithm involves first summarizing the training dataset into a set of centroids (centers), then using the centroids to make predictions for new examples.

For each class, the centroid of the data is found by taking the average value of each predictor (per class) in the training set. The overall centroid is computed using the data from all of the classes.

— Page 307, Applied Predictive Modeling, 2013.

A centroid is the geometric center of a data distribution, such as the mean. In multiple dimensions, this would be the mean value along each dimension, forming a point of center of the distribution across each variable.

The Nearest Centroids algorithm assumes that the centroids in the input feature space are different for each target label. The training data is split into groups by class label, then the centroid for each group of data is calculated. Each centroid is simply the mean value of each of the input variables. If there are two classes, then two centroids or points are calculated; three classes give three centroids, and so on.

The centroids then represent the “*model*.” Given new examples, such as those in the test set or new data, the distance between a given row of data and each centroid is calculated and the closest centroid is used to assign a class label to the example.

Distance measures, such as Euclidean distance, are used for numerical data or hamming distance for categorical data, in which case it is best practice to scale input variables via normalization or standardization prior to training the model. This is to ensure that input variables with large values don’t dominate the distance calculation.

An extension to the nearest centroid method for classification is to shrink the centroids of each input variable towards the centroid of the entire training dataset. Those variables that are shrunk down to the value of the data centroid can then be removed as they do not help to discriminate between the class labels.

As such, the amount of shrinkage applied to the centroids is a hyperparameter that can be tuned for the dataset and used to perform an automatic form of feature selection. Thus, it is appropriate for a dataset with a large number of input variables, some of which may be irrelevant or noisy.

Consequently, the nearest shrunken centroid model also conducts feature selection during the model training process.

— Page 307, Applied Predictive Modeling, 2013.

This approach is referred to as “*Nearest Shrunken Centroids*” and was first described by Robert Tibshirani, et al. in their 2002 paper titled “Diagnosis Of Multiple Cancer Types By Shrunken Centroids Of Gene Expression.”

The Nearest Shrunken Centroids is available in the scikit-learn Python machine learning library via the NearestCentroid class.

The class allows the configuration of the distance metric used in the algorithm via the “*metric*” argument, which defaults to ‘*euclidean*‘ for the Euclidean distance metric.

This can be changed to other built-in metrics such as ‘*manhattan*.’

... # create the nearest centroid model model = NearestCentroid(metric='euclidean')

By default, no shrinkage is used, but shrinkage can be specified via the “*shrink_threshold*” argument, which takes a floating point value between 0 and 1.

... # create the nearest centroid model model = NearestCentroid(metric='euclidean', shrink_threshold=0.5)

We can demonstrate the Nearest Shrunken Centroids with a worked example.

First, let’s define a synthetic classification dataset.

We will use the make_classification() function to create a dataset with 1,000 examples, each with 20 input variables.

The example creates and summarizes the dataset.

# test classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and confirms the number of rows and columns of the dataset.

(1000, 20) (1000,)

We can fit and evaluate a Nearest Shrunken Centroids model using repeated stratified k-fold cross-validation via the RepeatedStratifiedKFold class. We will use 10 folds and three repeats in the test harness.

We will use the default configuration of Euclidean distance and no shrinkage.

... # create the nearest centroid model model = NearestCentroid()

The complete example of evaluating the Nearest Shrunken Centroids model for the synthetic binary classification task is listed below.

# evaluate an nearest centroid model on the dataset from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.neighbors import NearestCentroid # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) # define model model = NearestCentroid() # define model evaluation method cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # summarize result print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the Nearest Shrunken Centroids algorithm on the synthetic dataset and reports the average accuracy across the three repeats of 10-fold cross-validation.

Your specific results may vary given the stochastic nature of the learning algorithm. Consider running the example a few times.

In this case, we can see that the model achieved a mean accuracy of about 71 percent.

Mean Accuracy: 0.711 (0.055)

We may decide to use the Nearest Shrunken Centroids as our final model and make predictions on new data.

This can be achieved by fitting the model on all available data and calling the *predict()* function passing in a new row of data.

We can demonstrate this with a complete example listed below.

# make a prediction with a nearest centroid model on the dataset from sklearn.datasets import make_classification from sklearn.neighbors import NearestCentroid # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) # define model model = NearestCentroid() # fit model model.fit(X, y) # define new data row = [2.47475454,0.40165523,1.68081787,2.88940715,0.91704519,-3.07950644,4.39961206,0.72464273,-4.86563631,-6.06338084,-1.22209949,-0.4699618,1.01222748,-0.6899355,-0.53000581,6.86966784,-3.27211075,-6.59044146,-2.21290585,-3.139579] # make a prediction yhat = model.predict([row]) # summarize prediction print('Predicted Class: %d' % yhat)

Running the example fits the model and makes a class label prediction for a new row of data.

Predicted Class: 0

Next, we can look at configuring the model hyperparameters.

The hyperparameters for the Nearest Shrunken Centroid method must be configured for your specific dataset.

Perhaps the most important hyperparameter is the shrinkage controlled via the “*shrink_threshold*” argument. It is a good idea to test values between 0 and 1 on a grid of values such as 0.1 or 0.01.

The example below demonstrates this using the GridSearchCV class with a grid of values we have defined.

# grid search shrinkage for nearest centroid from numpy import arange from sklearn.datasets import make_classification from sklearn.model_selection import GridSearchCV from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.neighbors import NearestCentroid # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) # define model model = NearestCentroid() # define model evaluation method cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define grid grid = dict() grid['shrink_threshold'] = arange(0, 1.01, 0.01) # define search search = GridSearchCV(model, grid, scoring='accuracy', cv=cv, n_jobs=-1) # perform the search results = search.fit(X, y) # summarize print('Mean Accuracy: %.3f' % results.best_score_) print('Config: %s' % results.best_params_)

Running the example will evaluate each combination of configurations using repeated cross-validation.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that we achieved slightly better results than the default, with 71.4 percent vs 71.1 percent. We can see that the model assigned a *shrink_threshold* value of 0.53.

Mean Accuracy: 0.714 Config: {'shrink_threshold': 0.53}

The other key configuration is the distance measure used, which can be chosen based on the distribution of the input variables.

Any of the built-in distance measures can be used, as listed here:

Common distance measures include:

- ‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’

For more on how these distance measures are calculated, see the tutorial:

Given that our input variables are numeric, our dataset only supports ‘*euclidean*‘ and ‘*manhattan*.’

We can include these metrics in our grid search; the complete example is listed below.

# grid search shrinkage and distance metric for nearest centroid from numpy import arange from sklearn.datasets import make_classification from sklearn.model_selection import GridSearchCV from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.neighbors import NearestCentroid # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) # define model model = NearestCentroid() # define model evaluation method cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define grid grid = dict() grid['shrink_threshold'] = arange(0, 1.01, 0.01) grid['metric'] = ['euclidean', 'manhattan'] # define search search = GridSearchCV(model, grid, scoring='accuracy', cv=cv, n_jobs=-1) # perform the search results = search.fit(X, y) # summarize print('Mean Accuracy: %.3f' % results.best_score_) print('Config: %s' % results.best_params_)

Running the example fits the model and discovers the hyperparameters that give the best results using cross-validation.

In this case, we can see that we get slightly better accuracy of 75 percent using no shrinkage and the manhattan instead of the euclidean distance measure.

Mean Accuracy: 0.750 Config: {'metric': 'manhattan', 'shrink_threshold': 0.0}

A good extension to these experiments would be to add data normalization or standardization to the data as part of a modeling Pipeline.

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered the Nearest Shrunken Centroids classification machine learning algorithm.

Specifically, you learned:

- The Nearest Shrunken Centroids is a simple linear machine learning algorithm for classification.
- How to fit, evaluate, and make predictions with the Nearest Shrunken Centroids model with Scikit-Learn.
- How to tune the hyperparameters of the Nearest Shrunken Centroids algorithm on a given dataset.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Nearest Shrunken Centroids With Python appeared first on Machine Learning Mastery.

]]>The post How to Develop LASSO Regression Models in Python appeared first on Machine Learning Mastery.

]]>Linear regression is the standard algorithm for regression that assumes a linear relationship between inputs and the target variable. An extension to linear regression invokes adding penalties to the loss function during training that encourages simpler models that have smaller coefficient values. These extensions are referred to as regularized linear regression or penalized linear regression.

**Lasso Regression** is a popular type of regularized linear regression that includes an L1 penalty. This has the effect of shrinking the coefficients for those input variables that do not contribute much to the prediction task. This penalty allows some coefficient values to go to the value of zero, allowing input variables to be effectively removed from the model, providing a type of automatic feature selection.

In this tutorial, you will discover how to develop and evaluate Lasso Regression models in Python.

After completing this tutorial, you will know:

- Lasso Regression is an extension of linear regression that adds a regularization penalty to the loss function during training.
- How to evaluate a Lasso Regression model and use a final model to make predictions for new data.
- How to configure the Lasso Regression model for a new dataset via grid search and automatically.

Let’s get started.

This tutorial is divided into three parts; they are:

- Lasso Regression
- Example of Lasso Regression
- Tuning Lasso Hyperparameters

Linear regression refers to a model that assumes a linear relationship between input variables and the target variable.

With a single input variable, this relationship is a line, and with higher dimensions, this relationship can be thought of as a hyperplane that connects the input variables to the target variable. The coefficients of the model are found via an optimization process that seeks to minimize the sum squared error between the predictions (*yhat*) and the expected target values (*y*).

- loss = sum i=0 to n (y_i – yhat_i)^2

A problem with linear regression is that estimated coefficients of the model can become large, making the model sensitive to inputs and possibly unstable. This is particularly true for problems with few observations (*samples*) or more samples (*n*) than input predictors (*p*) or variables (so-called *p >> n problems*).

One approach to address the stability of regression models is to change the loss function to include additional costs for a model that has large coefficients. Linear regression models that use these modified loss functions during training are referred to collectively as penalized linear regression.

A popular penalty is to penalize a model based on the sum of the absolute coefficient values. This is called the L1 penalty. An L1 penalty minimizes the size of all coefficients and allows some coefficients to be minimized to the value zero, which removes the predictor from the model.

- l1_penalty = sum j=0 to p abs(beta_j)

An L1 penalty minimizes the size of all coefficients and allows any coefficient to go to the value of zero, effectively removing input features from the model.

This acts as a type of automatic feature selection.

… a consequence of penalizing the absolute values is that some parameters are actually set to 0 for some value of lambda. Thus the lasso yields models that simultaneously use regularization to improve the model and to conduct feature selection.

— Page 125, Applied Predictive Modeling, 2013.

This penalty can be added to the cost function for linear regression and is referred to as Least Absolute Shrinkage And Selection Operator regularization (LASSO), or more commonly, “*Lasso*” (with title case) for short.

A popular alternative to ridge regression is the least absolute shrinkage and selection operator model, frequently called the lasso.

— Page 124, Applied Predictive Modeling, 2013.

A hyperparameter is used called “*lambda*” that controls the weighting of the penalty to the loss function. A default value of 1.0 will give full weightings to the penalty; a value of 0 excludes the penalty. Very small values of *lambda*, such as 1e-3 or smaller, are common.

- lasso_loss = loss + (lambda * l1_penalty)

Now that we are familiar with Lasso penalized regression, let’s look at a worked example.

In this section, we will demonstrate how to use the Lasso Regression algorithm.

First, let’s introduce a standard regression dataset. We will use the housing dataset.

The housing dataset is a standard machine learning dataset comprising 506 rows of data with 13 numerical input variables and a numerical target variable.

Using a test harness of repeated stratified 10-fold cross-validation with three repeats, a naive model can achieve a mean absolute error (MAE) of about 6.6. A top-performing model can achieve a MAE on this same test harness of about 1.9. This provides the bounds of expected performance on this dataset.

The dataset involves predicting the house price given details of the house suburb in the American city of Boston.

No need to download the dataset; we will download it automatically as part of our worked examples.

The example below downloads and loads the dataset as a Pandas DataFrame and summarizes the shape of the dataset and the first five rows of data.

# load and summarize the housing dataset from pandas import read_csv from matplotlib import pyplot # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) # summarize shape print(dataframe.shape) # summarize first few lines print(dataframe.head())

Running the example confirms the 506 rows of data and 13 input variables and a single numeric target variable (14 in total). We can also see that all input variables are numeric.

(506, 14) 0 1 2 3 4 5 ... 8 9 10 11 12 13 0 0.00632 18.0 2.31 0 0.538 6.575 ... 1 296.0 15.3 396.90 4.98 24.0 1 0.02731 0.0 7.07 0 0.469 6.421 ... 2 242.0 17.8 396.90 9.14 21.6 2 0.02729 0.0 7.07 0 0.469 7.185 ... 2 242.0 17.8 392.83 4.03 34.7 3 0.03237 0.0 2.18 0 0.458 6.998 ... 3 222.0 18.7 394.63 2.94 33.4 4 0.06905 0.0 2.18 0 0.458 7.147 ... 3 222.0 18.7 396.90 5.33 36.2 [5 rows x 14 columns]

The scikit-learn Python machine learning library provides an implementation of the Lasso penalized regression algorithm via the Lasso class.

Confusingly, the *lambda* term can be configured via the “*alpha*” argument when defining the class. The default value is 1.0 or a full penalty.

... # define model model = Lasso(alpha=1.0)

We can evaluate the Lasso Regression model on the housing dataset using repeated 10-fold cross-validation and report the average mean absolute error (MAE) on the dataset.

# evaluate an lasso regression model on the dataset from numpy import mean from numpy import std from numpy import absolute from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from sklearn.linear_model import Lasso # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model model = Lasso(alpha=1.0) # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1) # force scores to be positive scores = absolute(scores) print('Mean MAE: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the Lasso Regression algorithm on the housing dataset and reports the average MAE across the three repeats of 10-fold cross-validation.

Your specific results may vary given the stochastic nature of the learning algorithm. Consider running the example a few times.

In this case, we can see that the model achieved a MAE of about 3.711.

Mean MAE: 3.711 (0.549)

We may decide to use the Lasso Regression as our final model and make predictions on new data.

This can be achieved by fitting the model on all available data and calling the *predict()* function, passing in a new row of data.

We can demonstrate this with a complete example, listed below.

# make a prediction with a lasso regression model on the dataset from pandas import read_csv from sklearn.linear_model import Lasso # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model model = Lasso(alpha=1.0) # fit model model.fit(X, y) # define new data row = [0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98] # make a prediction yhat = model.predict([row]) # summarize prediction print('Predicted: %.3f' % yhat)

Running the example fits the model and makes a prediction for the new rows of data.

Predicted: 30.998

Next, we can look at configuring the model hyperparameters.

How do we know that the default hyperparameter of *alpha=1.0* is appropriate for our dataset?

We don’t.

Instead, it is good practice to test a suite of different configurations and discover what works best for our dataset.

One approach would be to gird search *alpha* values from perhaps 1e-5 to 100 on a log-10 scale and discover what works best for a dataset. Another approach would be to test values between 0.0 and 1.0 with a grid separation of 0.01. We will try the latter in this case.

The example below demonstrates this using the GridSearchCV class with a grid of values we have defined.

# grid search hyperparameters for lasso regression from numpy import arange from pandas import read_csv from sklearn.model_selection import GridSearchCV from sklearn.model_selection import RepeatedKFold from sklearn.linear_model import Lasso # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model model = Lasso() # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # define grid grid = dict() grid['alpha'] = arange(0, 1, 0.01) # define search search = GridSearchCV(model, grid, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1) # perform the search results = search.fit(X, y) # summarize print('MAE: %.3f' % results.best_score_) print('Config: %s' % results.best_params_)

Running the example will evaluate each combination of configurations using repeated cross-validation.

You might see some warnings that can be safely ignored, such as:

Objective did not converge. You might want to increase the number of iterations.

In this case, we can see that we achieved slightly better results than the default 3.379 vs. 3.711. Ignore the sign; the library makes the MAE negative for optimization purposes.

We can see that the model assigned an *alpha* weight of 0.01 to the penalty.

MAE: -3.379 Config: {'alpha': 0.01}

The scikit-learn library also provides a built-in version of the algorithm that automatically finds good hyperparameters via the LassoCV class.

To use the class, the model is fit on the training dataset as per normal and the hyperparameters are tuned automatically during the training process. The fit model can then be used to make a prediction.

By default, the model will test 100 *alpha* values. We can change this to a grid of values between 0 and 1 with a separation of 0.01 as we did on the previous example by setting the “*alphas*” argument.

The example below demonstrates this.

# use automatically configured the lasso regression algorithm from numpy import arange from pandas import read_csv from sklearn.linear_model import LassoCV from sklearn.model_selection import RepeatedKFold # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # define model model = LassoCV(alphas=arange(0, 1, 0.01), cv=cv, n_jobs=-1) # fit model model.fit(X, y) # summarize chosen configuration print('alpha: %f' % model.alpha_)

Running the example fits the model and discovers the hyperparameters that give the best results using cross-validation.

In this case, we can see that the model chose the hyperparameter of alpha=0.0. This is different from what we found via our manual grid search, perhaps due to the systematic way in which configurations were searched or selected.

alpha: 0.000000

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered how to develop and evaluate Lasso Regression models in Python.

Specifically, you learned:

- Lasso Regression is an extension of linear regression that adds a regularization penalty to the loss function during training.
- How to evaluate a Lasso Regression model and use a final model to make predictions for new data.
- How to configure the Lasso Regression model for a new dataset via grid search and automatically.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Develop LASSO Regression Models in Python appeared first on Machine Learning Mastery.

]]>The post How to Develop Ridge Regression Models in Python appeared first on Machine Learning Mastery.

]]>Linear regression is the standard algorithm for regression that assumes a linear relationship between inputs and the target variable. An extension to linear regression invokes adding penalties to the loss function during training that encourages simpler models that have smaller coefficient values. These extensions are referred to as regularized linear regression or penalized linear regression.

**Ridge Regression** is a popular type of regularized linear regression that includes an L2 penalty. This has the effect of shrinking the coefficients for those input variables that do not contribute much to the prediction task.

In this tutorial, you will discover how to develop and evaluate Ridge Regression models in Python.

After completing this tutorial, you will know:

- Ridge Regression is an extension of linear regression that adds a regularization penalty to the loss function during training.
- How to evaluate a Ridge Regression model and use a final model to make predictions for new data.
- How to configure the Ridge Regression model for a new dataset via grid search and automatically.

Let’s get started.

**Update Oct/2020**: Updated code in the grid search procedure to match description.

This tutorial is divided into three parts; they are:

- Ridge Regression
- Example of Ridge Regression
- Tuning Ridge Hyperparameters

Linear regression refers to a model that assumes a linear relationship between input variables and the target variable.

With a single input variable, this relationship is a line, and with higher dimensions, this relationship can be thought of as a hyperplane that connects the input variables to the target variable. The coefficients of the model are found via an optimization process that seeks to minimize the sum squared error between the predictions (yhat) and the expected target values (y).

- loss = sum i=0 to n (y_i – yhat_i)^2

A problem with linear regression is that estimated coefficients of the model can become large, making the model sensitive to inputs and possibly unstable. This is particularly true for problems with few observations (*samples*) or less samples (*n*) than input predictors (*p*) or variables (so-called *p >> n problems*).

One approach to address the stability of regression models is to change the loss function to include additional costs for a model that has large coefficients. Linear regression models that use these modified loss functions during training are referred to collectively as penalized linear regression.

One popular penalty is to penalize a model based on the sum of the squared coefficient values (*beta*). This is called an L2 penalty.

- l2_penalty = sum j=0 to p beta_j^2

An L2 penalty minimizes the size of all coefficients, although it prevents any coefficients from being removed from the model by allowing their value to become zero.

The effect of this penalty is that the parameter estimates are only allowed to become large if there is a proportional reduction in SSE. In effect, this method shrinks the estimates towards 0 as the lambda penalty becomes large (these techniques are sometimes called “shrinkage methods”).

— Page 123, Applied Predictive Modeling, 2013.

This penalty can be added to the cost function for linear regression and is referred to as Tikhonov regularization (after the author), or Ridge Regression more generally.

A hyperparameter is used called “*lambda*” that controls the weighting of the penalty to the loss function. A default value of 1.0 will fully weight the penalty; a value of 0 excludes the penalty. Very small values of lambda, such as 1e-3 or smaller are common.

- ridge_loss = loss + (lambda * l2_penalty)

Now that we are familiar with Ridge penalized regression, let’s look at a worked example.

In this section, we will demonstrate how to use the Ridge Regression algorithm.

First, let’s introduce a standard regression dataset. We will use the housing dataset.

The housing dataset is a standard machine learning dataset comprising 506 rows of data with 13 numerical input variables and a numerical target variable.

Using a test harness of repeated stratified 10-fold cross-validation with three repeats, a naive model can achieve a mean absolute error (MAE) of about 6.6. A top-performing model can achieve a MAE on this same test harness of about 1.9. This provides the bounds of expected performance on this dataset.

The dataset involves predicting the house price given details of the house’s suburb in the American city of Boston.

No need to download the dataset; we will download it automatically as part of our worked examples.

The example below downloads and loads the dataset as a Pandas DataFrame and summarizes the shape of the dataset and the first five rows of data.

# load and summarize the housing dataset from pandas import read_csv from matplotlib import pyplot # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) # summarize shape print(dataframe.shape) # summarize first few lines print(dataframe.head())

Running the example confirms the 506 rows of data and 13 input variables and a single numeric target variable (14 in total). We can also see that all input variables are numeric.

(506, 14) 0 1 2 3 4 5 ... 8 9 10 11 12 13 0 0.00632 18.0 2.31 0 0.538 6.575 ... 1 296.0 15.3 396.90 4.98 24.0 1 0.02731 0.0 7.07 0 0.469 6.421 ... 2 242.0 17.8 396.90 9.14 21.6 2 0.02729 0.0 7.07 0 0.469 7.185 ... 2 242.0 17.8 392.83 4.03 34.7 3 0.03237 0.0 2.18 0 0.458 6.998 ... 3 222.0 18.7 394.63 2.94 33.4 4 0.06905 0.0 2.18 0 0.458 7.147 ... 3 222.0 18.7 396.90 5.33 36.2 [5 rows x 14 columns]

The scikit-learn Python machine learning library provides an implementation of the Ridge Regression algorithm via the Ridge class.

Confusingly, the lambda term can be configured via the “*alpha*” argument when defining the class. The default value is 1.0 or a full penalty.

... # define model model = Ridge(alpha=1.0)

We can evaluate the Ridge Regression model on the housing dataset using repeated 10-fold cross-validation and report the average mean absolute error (MAE) on the dataset.

# evaluate an ridge regression model on the dataset from numpy import mean from numpy import std from numpy import absolute from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from sklearn.linear_model import Ridge # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model model = Ridge(alpha=1.0) # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1) # force scores to be positive scores = absolute(scores) print('Mean MAE: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the Ridge Regression algorithm on the housing dataset and reports the average MAE across the three repeats of 10-fold cross-validation.

In this case, we can see that the model achieved a MAE of about 3.382.

Mean MAE: 3.382 (0.519)

We may decide to use the Ridge Regression as our final model and make predictions on new data.

This can be achieved by fitting the model on all available data and calling the *predict()* function, passing in a new row of data.

We can demonstrate this with a complete example listed below.

# make a prediction with a ridge regression model on the dataset from pandas import read_csv from sklearn.linear_model import Ridge # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model model = Ridge(alpha=1.0) # fit model model.fit(X, y) # define new data row = [0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98] # make a prediction yhat = model.predict([row]) # summarize prediction print('Predicted: %.3f' % yhat)

Running the example fits the model and makes a prediction for the new rows of data.

Predicted: 30.253

Next, we can look at configuring the model hyperparameters.

How do we know that the default hyperparameters of *alpha=1.0* is appropriate for our dataset?

We don’t.

Instead, it is good practice to test a suite of different configurations and discover what works best for our dataset.

One approach would be to grid search *alpha* values from perhaps 1e-5 to 100 on a log scale and discover what works best for a dataset. Another approach would be to test values between 0.0 and 1.0 with a grid separation of 0.01. We will try the latter in this case.

The example below demonstrates this using the GridSearchCV class with a grid of values we have defined.

# grid search hyperparameters for ridge regression from numpy import arange from pandas import read_csv from sklearn.model_selection import GridSearchCV from sklearn.model_selection import RepeatedKFold from sklearn.linear_model import Ridge # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model model = Ridge() # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # define grid grid = dict() grid['alpha'] = arange(0, 1, 0.01) # define search search = GridSearchCV(model, grid, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1) # perform the search results = search.fit(X, y) # summarize print('MAE: %.3f' % results.best_score_) print('Config: %s' % results.best_params_)

Running the example will evaluate each combination of configurations using repeated cross-validation.

In this case, we can see that we achieved slightly better results than the default 3.379 vs. 3.382. Ignore the sign; the library makes the MAE negative for optimization purposes.

We can see that the model assigned an *alpha* weight of 0.51 to the penalty.

MAE: -3.379 Config: {'alpha': 0.51}

The scikit-learn library also provides a built-in version of the algorithm that automatically finds good hyperparameters via the RidgeCV class.

To use this class, it is fit on the training dataset and used to make a prediction. During the training process, it automatically tunes the hyperparameter values.

By default, the model will only test the *alpha* values (0.1, 1.0, 10.0). We can change this to a grid of values between 0 and 1 with a separation of 0.01 as we did on the previous example by setting the “*alphas*” argument.

The example below demonstrates this.

# use automatically configured the ridge regression algorithm from numpy import arange from pandas import read_csv from sklearn.linear_model import RidgeCV from sklearn.model_selection import RepeatedKFold # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # define model model = RidgeCV(alphas=arange(0, 1, 0.01), cv=cv, scoring='neg_mean_absolute_error') # fit model model.fit(X, y) # summarize chosen configuration print('alpha: %f' % model.alpha_)

Running the example fits the model and discovers the hyperparameters that give the best results using cross-validation.

In this case, we can see that the model chose the identical hyperparameter of *alpha=0.51* that we found via our manual grid search.

alpha: 0.510000

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered how to develop and evaluate Ridge Regression models in Python.

Specifically, you learned:

- Ridge Regression is an extension of linear regression that adds a regularization penalty to the loss function during training.
- How to evaluate a Ridge Regression model and use a final model to make predictions for new data.
- How to configure the Ridge Regression model for a new dataset via grid search and automatically.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Develop Ridge Regression Models in Python appeared first on Machine Learning Mastery.

]]>The post How to Develop Elastic Net Regression Models in Python appeared first on Machine Learning Mastery.

]]>Linear regression is the standard algorithm for regression that assumes a linear relationship between inputs and the target variable. An extension to linear regression involves adding penalties to the loss function during training that encourage simpler models that have smaller coefficient values. These extensions are referred to as regularized linear regression or penalized linear regression.

**Elastic net** is a popular type of regularized linear regression that combines two popular penalties, specifically the L1 and L2 penalty functions.

In this tutorial, you will discover how to develop Elastic Net regularized regression in Python.

After completing this tutorial, you will know:

- Elastic Net is an extension of linear regression that adds regularization penalties to the loss function during training.
- How to evaluate an Elastic Net model and use a final model to make predictions for new data.
- How to configure the Elastic Net model for a new dataset via grid search and automatically.

Let’s get started.

This tutorial is divided into three parts; they are:

- Elastic Net Regression
- Example of Elastic Net Regression
- Tuning Elastic Net Hyperparameters

With a single input variable, this relationship is a line, and with higher dimensions, this relationship can be thought of as a hyperplane that connects the input variables to the target variable. The coefficients of the model are found via an optimization process that seeks to minimize the sum squared error between the predictions (*yhat*) and the expected target values (*y*).

- loss = sum i=0 to n (y_i – yhat_i)^2

A problem with linear regression is that estimated coefficients of the model can become large, making the model sensitive to inputs and possibly unstable. This is particularly true for problems with few observations (*samples*) or more samples (*n*) than input predictors (*p*) or variables (so-called *p >> n problems*).

One approach to addressing the stability of regression models is to change the loss function to include additional costs for a model that has large coefficients. Linear regression models that use these modified loss functions during training are referred to collectively as penalized linear regression.

One popular penalty is to penalize a model based on the sum of the squared coefficient values. This is called an L2 penalty. An L2 penalty minimizes the size of all coefficients, although it prevents any coefficients from being removed from the model.

- l2_penalty = sum j=0 to p beta_j^2

Another popular penalty is to penalize a model based on the sum of the absolute coefficient values. This is called the L1 penalty. An L1 penalty minimizes the size of all coefficients and allows some coefficients to be minimized to the value zero, which removes the predictor from the model.

- l1_penalty = sum j=0 to p abs(beta_j)

Elastic net is a penalized linear regression model that includes both the L1 and L2 penalties during training.

Using the terminology from “The Elements of Statistical Learning,” a hyperparameter “*alpha*” is provided to assign how much weight is given to each of the L1 and L2 penalties. Alpha is a value between 0 and 1 and is used to weight the contribution of the L1 penalty and one minus the alpha value is used to weight the L2 penalty.

- elastic_net_penalty = (alpha * l1_penalty) + ((1 – alpha) * l2_penalty)

For example, an alpha of 0.5 would provide a 50 percent contribution of each penalty to the loss function. An alpha value of 0 gives all weight to the L2 penalty and a value of 1 gives all weight to the L1 penalty.

The parameter alpha determines the mix of the penalties, and is often pre-chosen on qualitative grounds.

— Page 663, The Elements of Statistical Learning, 2016.

The benefit is that elastic net allows a balance of both penalties, which can result in better performance than a model with either one or the other penalty on some problems.

Another hyperparameter is provided called “*lambda*” that controls the weighting of the sum of both penalties to the loss function. A default value of 1.0 is used to use the fully weighted penalty; a value of 0 excludes the penalty. Very small values of lambada, such as 1e-3 or smaller, are common.

- elastic_net_loss = loss + (lambda * elastic_net_penalty)

Now that we are familiar with elastic net penalized regression, let’s look at a worked example.

In this section, we will demonstrate how to use the Elastic Net regression algorithm.

First, let’s introduce a standard regression dataset. We will use the housing dataset.

The dataset involves predicting the house price given details of the house’s suburb in the American city of Boston.

No need to download the dataset; we will download it automatically as part of our worked examples.

Running the example confirms the 506 rows of data and 13 input variables and a single numeric target variable (14 in total).

We can also see that all input variables are numeric.

The scikit-learn Python machine learning library provides an implementation of the Elastic Net penalized regression algorithm via the ElasticNet class.

Confusingly, the *alpha* hyperparameter can be set via the “*l1_ratio*” argument that controls the contribution of the L1 and L2 penalties and the *lambda* hyperparameter can be set via the “*alpha*” argument that controls the contribution of the sum of both penalties to the loss function.

By default, an equal balance of 0.5 is used for “*l1_ratio*” and a full weighting of 1.0 is used for alpha.

... # define model model = ElasticNet(alpha=1.0, l1_ratio=0.5)

We can evaluate the Elastic Net model on the housing dataset using repeated 10-fold cross-validation and report the average mean absolute error (MAE) on the dataset.

# evaluate an elastic net model on the dataset from numpy import mean from numpy import std from numpy import absolute from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from sklearn.linear_model import ElasticNet # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model model = ElasticNet(alpha=1.0, l1_ratio=0.5) # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1) # force scores to be positive scores = absolute(scores) print('Mean MAE: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the Elastic Net algorithm on the housing dataset and reports the average MAE across the three repeats of 10-fold cross-validation.

In this case, we can see that the model achieved a MAE of about 3.682.

Mean MAE: 3.682 (0.530)

We may decide to use the Elastic Net as our final model and make predictions on new data.

*predict()* function, passing in a new row of data.

We can demonstrate this with a complete example, listed below.

# make a prediction with an elastic net model on the dataset from pandas import read_csv from sklearn.linear_model import ElasticNet # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model model = ElasticNet(alpha=1.0, l1_ratio=0.5) # fit model model.fit(X, y) # define new data row = [0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98] # make a prediction yhat = model.predict([row]) # summarize prediction print('Predicted: %.3f' % yhat)

Running the example fits the model and makes a prediction for the new rows of data.

Predicted: 31.047

Next, we can look at configuring the model hyperparameters.

How do we know that the default hyperparameters of alpha=1.0 and l1_ratio=0.5 are any good for our dataset?

We don’t.

Instead, it is good practice to test a suite of different configurations and discover what works best.

One approach would be to gird search *l1_ratio* values between 0 and 1 with a 0.1 or 0.01 separation and *alpha* values from perhaps 1e-5 to 100 on a log-10 scale and discover what works best for a dataset.

# grid search hyperparameters for the elastic net from numpy import arange from pandas import read_csv from sklearn.model_selection import GridSearchCV from sklearn.model_selection import RepeatedKFold from sklearn.linear_model import ElasticNet # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model model = ElasticNet() # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # define grid grid = dict() grid['alpha'] = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 0.0, 1.0, 10.0, 100.0] grid['l1_ratio'] = arange(0, 1, 0.01) # define search search = GridSearchCV(model, grid, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1) # perform the search results = search.fit(X, y) # summarize print('MAE: %.3f' % results.best_score_) print('Config: %s' % results.best_params_)

You might see some warnings that can be safely ignored, such as:

Objective did not converge. You might want to increase the number of iterations.

In this case, we can see that we achieved slightly better results than the default 3.378 vs. 3.682. Ignore the sign; the library makes the MAE negative for optimization purposes.

We can see that the model assigned an alpha weight of 0.01 to the penalty and focuses exclusively on the L2 penalty.

MAE: -3.378 Config: {'alpha': 0.01, 'l1_ratio': 0.97}

The scikit-learn library also provides a built-in version of the algorithm that automatically finds good hyperparameters via the ElasticNetCV class.

To use this class, it is first fit on the dataset, then used to make a prediction. It will automatically find appropriate hyperparameters.

By default, the model will test 100 alpha values and use a default ratio. We can specify our own lists of values to test via the “*l1_ratio*” and “*alphas*” arguments, as we did with the manual grid search.

The example below demonstrates this.

# use automatically configured elastic net algorithm from numpy import arange from pandas import read_csv from sklearn.linear_model import ElasticNetCV from sklearn.model_selection import RepeatedKFold # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # define model ratios = arange(0, 1, 0.01) alphas = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 0.0, 1.0, 10.0, 100.0] model = ElasticNetCV(l1_ratio=ratios, alphas=alphas, cv=cv, n_jobs=-1) # fit model model.fit(X, y) # summarize chosen configuration print('alpha: %f' % model.alpha_) print('l1_ratio_: %f' % model.l1_ratio_)

Again, you might see some warnings that can be safely ignored, such as:

Objective did not converge. You might want to increase the number of iterations.

In this case, we can see that an alpha of 0.0 was chosen, removing both penalties from the loss function.

This is different from what we found via our manual grid search, perhaps due to the systematic way in which configurations were searched or selected.

alpha: 0.000000 l1_ratio_: 0.470000

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered how to develop Elastic Net regularized regression in Python.

Specifically, you learned:

- Elastic Net is an extension of linear regression that adds regularization penalties to the loss function during training.
- How to evaluate an Elastic Net model and use a final model to make predictions for new data.
- How to configure the Elastic Net model for a new dataset via grid search and automatically.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Develop Elastic Net Regression Models in Python appeared first on Machine Learning Mastery.

]]>