Regression is a modeling task that involves predicting a numeric value given an input.
Linear regression is the standard algorithm for regression that assumes a linear relationship between inputs and the target variable. An extension to linear regression involves adding penalties to the loss function during training that encourage simpler models that have smaller coefficient values. These extensions are referred to as regularized linear regression or penalized linear regression.
Elastic net is a popular type of regularized linear regression that combines two popular penalties, specifically the L1 and L2 penalty functions.
In this tutorial, you will discover how to develop Elastic Net regularized regression in Python.
After completing this tutorial, you will know:
- Elastic Net is an extension of linear regression that adds regularization penalties to the loss function during training.
- How to evaluate an Elastic Net model and use a final model to make predictions for new data.
- How to configure the Elastic Net model for a new dataset via grid search and automatically.
Let’s get started.
Tutorial Overview
This tutorial is divided into three parts; they are:
- Elastic Net Regression
- Example of Elastic Net Regression
- Tuning Elastic Net Hyperparameters
Elastic Net Regression
Linear regression refers to a model that assumes a linear relationship between input variables and the target variable.
With a single input variable, this relationship is a line, and with higher dimensions, this relationship can be thought of as a hyperplane that connects the input variables to the target variable. The coefficients of the model are found via an optimization process that seeks to minimize the sum squared error between the predictions (yhat) and the expected target values (y).
- loss = sum i=0 to n (y_i – yhat_i)^2
A problem with linear regression is that estimated coefficients of the model can become large, making the model sensitive to inputs and possibly unstable. This is particularly true for problems with few observations (samples) or more samples (n) than input predictors (p) or variables (so-called p >> n problems).
One approach to addressing the stability of regression models is to change the loss function to include additional costs for a model that has large coefficients. Linear regression models that use these modified loss functions during training are referred to collectively as penalized linear regression.
One popular penalty is to penalize a model based on the sum of the squared coefficient values. This is called an L2 penalty. An L2 penalty minimizes the size of all coefficients, although it prevents any coefficients from being removed from the model.
- l2_penalty = sum j=0 to p beta_j^2
Another popular penalty is to penalize a model based on the sum of the absolute coefficient values. This is called the L1 penalty. An L1 penalty minimizes the size of all coefficients and allows some coefficients to be minimized to the value zero, which removes the predictor from the model.
- l1_penalty = sum j=0 to p abs(beta_j)
Elastic net is a penalized linear regression model that includes both the L1 and L2 penalties during training.
Using the terminology from “The Elements of Statistical Learning,” a hyperparameter “alpha” is provided to assign how much weight is given to each of the L1 and L2 penalties. Alpha is a value between 0 and 1 and is used to weight the contribution of the L1 penalty and one minus the alpha value is used to weight the L2 penalty.
- elastic_net_penalty = (alpha * l1_penalty) + ((1 – alpha) * l2_penalty)
For example, an alpha of 0.5 would provide a 50 percent contribution of each penalty to the loss function. An alpha value of 0 gives all weight to the L2 penalty and a value of 1 gives all weight to the L1 penalty.
The parameter alpha determines the mix of the penalties, and is often pre-chosen on qualitative grounds.
— Page 663, The Elements of Statistical Learning, 2016.
The benefit is that elastic net allows a balance of both penalties, which can result in better performance than a model with either one or the other penalty on some problems.
Another hyperparameter is provided called “lambda” that controls the weighting of the sum of both penalties to the loss function. A default value of 1.0 is used to use the fully weighted penalty; a value of 0 excludes the penalty. Very small values of lambada, such as 1e-3 or smaller, are common.
- elastic_net_loss = loss + (lambda * elastic_net_penalty)
Now that we are familiar with elastic net penalized regression, let’s look at a worked example.
Example of Elastic Net Regression
In this section, we will demonstrate how to use the Elastic Net regression algorithm.
First, let’s introduce a standard regression dataset. We will use the housing dataset.
The housing dataset is a standard machine learning dataset comprising 506 rows of data with 13 numerical input variables and a numerical target variable.
Using a test harness of repeated stratified 10-fold cross-validation with three repeats, a naive model can achieve a mean absolute error (MAE) of about 6.6. A top-performing model can achieve a MAE on this same test harness of about 1.9. This provides the bounds of expected performance on this dataset.
The dataset involves predicting the house price given details of the house’s suburb in the American city of Boston.
No need to download the dataset; we will download it automatically as part of our worked examples.
The example below downloads and loads the dataset as a Pandas DataFrame and summarizes the shape of the dataset and the first five rows of data.
1 2 3 4 5 6 7 8 9 10 |
# load and summarize the housing dataset from pandas import read_csv from matplotlib import pyplot # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) # summarize shape print(dataframe.shape) # summarize first few lines print(dataframe.head()) |
Running the example confirms the 506 rows of data and 13 input variables and a single numeric target variable (14 in total).
We can also see that all input variables are numeric.
1 2 3 4 5 6 7 8 9 |
(506, 14) 0 1 2 3 4 5 ... 8 9 10 11 12 13 0 0.00632 18.0 2.31 0 0.538 6.575 ... 1 296.0 15.3 396.90 4.98 24.0 1 0.02731 0.0 7.07 0 0.469 6.421 ... 2 242.0 17.8 396.90 9.14 21.6 2 0.02729 0.0 7.07 0 0.469 7.185 ... 2 242.0 17.8 392.83 4.03 34.7 3 0.03237 0.0 2.18 0 0.458 6.998 ... 3 222.0 18.7 394.63 2.94 33.4 4 0.06905 0.0 2.18 0 0.458 7.147 ... 3 222.0 18.7 396.90 5.33 36.2 [5 rows x 14 columns] |
The scikit-learn Python machine learning library provides an implementation of the Elastic Net penalized regression algorithm via the ElasticNet class.
Confusingly, the alpha hyperparameter can be set via the “l1_ratio” argument that controls the contribution of the L1 and L2 penalties and the lambda hyperparameter can be set via the “alpha” argument that controls the contribution of the sum of both penalties to the loss function.
By default, an equal balance of 0.5 is used for “l1_ratio” and a full weighting of 1.0 is used for alpha.
1 2 3 |
... # define model model = ElasticNet(alpha=1.0, l1_ratio=0.5) |
We can evaluate the Elastic Net model on the housing dataset using repeated 10-fold cross-validation and report the average mean absolute error (MAE) on the dataset.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
# evaluate an elastic net model on the dataset from numpy import mean from numpy import std from numpy import absolute from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from sklearn.linear_model import ElasticNet # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model model = ElasticNet(alpha=1.0, l1_ratio=0.5) # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1) # force scores to be positive scores = absolute(scores) print('Mean MAE: %.3f (%.3f)' % (mean(scores), std(scores))) |
Running the example evaluates the Elastic Net algorithm on the housing dataset and reports the average MAE across the three repeats of 10-fold cross-validation.
Your specific results may vary given the stochastic nature of the learning algorithm. Consider running the example a few times.
In this case, we can see that the model achieved a MAE of about 3.682.
1 |
Mean MAE: 3.682 (0.530) |
We may decide to use the Elastic Net as our final model and make predictions on new data.
This can be achieved by fitting the model on all available data and calling the predict() function, passing in a new row of data.
We can demonstrate this with a complete example, listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# make a prediction with an elastic net model on the dataset from pandas import read_csv from sklearn.linear_model import ElasticNet # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model model = ElasticNet(alpha=1.0, l1_ratio=0.5) # fit model model.fit(X, y) # define new data row = [0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98] # make a prediction yhat = model.predict([row]) # summarize prediction print('Predicted: %.3f' % yhat) |
Running the example fits the model and makes a prediction for the new rows of data.
1 |
Predicted: 31.047 |
Next, we can look at configuring the model hyperparameters.
Tuning Elastic Net Hyperparameters
How do we know that the default hyperparameters of alpha=1.0 and l1_ratio=0.5 are any good for our dataset?
We don’t.
Instead, it is good practice to test a suite of different configurations and discover what works best.
One approach would be to gird search l1_ratio values between 0 and 1 with a 0.1 or 0.01 separation and alpha values from perhaps 1e-5 to 100 on a log-10 scale and discover what works best for a dataset.
The example below demonstrates this using the GridSearchCV class with a grid of values we have defined.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
# grid search hyperparameters for the elastic net from numpy import arange from pandas import read_csv from sklearn.model_selection import GridSearchCV from sklearn.model_selection import RepeatedKFold from sklearn.linear_model import ElasticNet # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model model = ElasticNet() # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # define grid grid = dict() grid['alpha'] = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 0.0, 1.0, 10.0, 100.0] grid['l1_ratio'] = arange(0, 1, 0.01) # define search search = GridSearchCV(model, grid, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1) # perform the search results = search.fit(X, y) # summarize print('MAE: %.3f' % results.best_score_) print('Config: %s' % results.best_params_) |
Running the example will evaluate each combination of configurations using repeated cross-validation.
You might see some warnings that can be safely ignored, such as:
1 |
Objective did not converge. You might want to increase the number of iterations. |
Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.
In this case, we can see that we achieved slightly better results than the default 3.378 vs. 3.682. Ignore the sign; the library makes the MAE negative for optimization purposes.
We can see that the model assigned an alpha weight of 0.01 to the penalty and focuses exclusively on the L2 penalty.
1 2 |
MAE: -3.378 Config: {'alpha': 0.01, 'l1_ratio': 0.97} |
The scikit-learn library also provides a built-in version of the algorithm that automatically finds good hyperparameters via the ElasticNetCV class.
To use this class, it is first fit on the dataset, then used to make a prediction. It will automatically find appropriate hyperparameters.
By default, the model will test 100 alpha values and use a default ratio. We can specify our own lists of values to test via the “l1_ratio” and “alphas” arguments, as we did with the manual grid search.
The example below demonstrates this.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# use automatically configured elastic net algorithm from numpy import arange from pandas import read_csv from sklearn.linear_model import ElasticNetCV from sklearn.model_selection import RepeatedKFold # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' dataframe = read_csv(url, header=None) data = dataframe.values X, y = data[:, :-1], data[:, -1] # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # define model ratios = arange(0, 1, 0.01) alphas = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 0.0, 1.0, 10.0, 100.0] model = ElasticNetCV(l1_ratio=ratios, alphas=alphas, cv=cv, n_jobs=-1) # fit model model.fit(X, y) # summarize chosen configuration print('alpha: %f' % model.alpha_) print('l1_ratio_: %f' % model.l1_ratio_) |
Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.
Again, you might see some warnings that can be safely ignored, such as:
1 |
Objective did not converge. You might want to increase the number of iterations. |
In this case, we can see that an alpha of 0.0 was chosen, removing both penalties from the loss function.
This is different from what we found via our manual grid search, perhaps due to the systematic way in which configurations were searched or selected.
1 2 |
alpha: 0.000000 l1_ratio_: 0.470000 |
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
Books
APIs
Articles
Summary
In this tutorial, you discovered how to develop Elastic Net regularized regression in Python.
Specifically, you learned:
- Elastic Net is an extension of linear regression that adds regularization penalties to the loss function during training.
- How to evaluate an Elastic Net model and use a final model to make predictions for new data.
- How to configure the Elastic Net model for a new dataset via grid search and automatically.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Dear Dr Jason,
Thank you again for your instructive tutorials.
I have a question please on the MAE. Please bear with me.
In correlation, the min and max correlation is between -1 and 1. We know the boundary of correlation. If the value is 0.85 there is a strong positive correlation. Likewise, a correlation of -0.85 is indicative of a strong negative correlation.
In the MAE, we see a figure. There is no upper and lower bound in an MAE. Is there an ideal MAE?
Thank you,
Anthony of Sydney
Yes, ideal MAE is 0.0 (zero error).
A good MAE is relative to a naive model:
https://machinelearningmastery.com/faq/single-faq/how-to-know-if-a-model-has-good-performance
Dear Dr Jason,
Thank you for putting attention to the abovementioned page “how-to-know-if-a-model-has-good-performance:.
From the topic, “…what we mean when we talk about model skill being relative, not absolute, it is relative to the skill of the baseline method….”
That is you need to compare the MAE with the baseline model: that is you need to compare it to two models.
Question: So what is the definition of’baseline’ model.
Thank you,
Anthony of Sydney
The definition of baseline models for each problem type is listed here:
https://machinelearningmastery.com/faq/single-faq/how-to-know-if-a-model-has-good-performance
For regression, predict the mean value, or use this:
https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyRegressor.html
Dear Dr Jason,
Thank you
Anthony of Sydney
You’re welcome.
Dear Dr Jason,
Thank you again for the reply.
I did some minor experimentation and research, and found a ‘hierarchy’ of linear regression
Do all 1 to 4 and look for the maximum score
Then if the regression model uses particular parameters the do either:
Am I on the right track?
Thank you,
Anthony of Sydney
Yes, although Elastic net can simulate all 4.
Dear Dr Jason,
Thank you again,
Anthony of Sydney
You’re welcome.
Thank you very much for this great article.
Can we use PCA and Standard Scaler while using ElasticNet?
Thanks in advance
Yes.
hi jason,
elasticnet can be applied to classification problem .
No, it is a regression technique.
How can I get the significance values of the coefficients? I understand I can get the coefficients themselves using model.coef_
I’m trained in data science. My understanding is I need the coefficient value itself, the standard error, and the Degrees of Freedom. I think I can get the coefficient value, and the Degrees of Freedom, but how do I get the standard error?
A complete answer would be nice.
I actually use z whitened x predictor terms. So my standard errors should all be the same… but I don’t know how to extract them. Maybe it’s something like model.se_?
You may have to use a different API to fit the model and develop an analysis, perhaps scipy.
I’m trying to use this in a pipeline to extract best alpha’s and lambda’s, can you assist
This is the subsection of code I’m working with
estimators = []
estimators.append((‘standardize’, ZCA()))
#estimators.append((‘ElasticNetCV’, ElasticNetCV(cv=10, random_state=0)))
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
ratios = arange(0, 1, 0.01)
alphas = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 0.0, 1.0, 10.0, 100.0]
estimators.append((‘ElasticNetCV’, ElasticNetCV(l1_ratio=ratios, alphas=alphas, cv=cv, n_jobs=-1)))
model = Pipeline(estimators)
model.fit(X, y)
#print(‘alpha: %f’ % model.alphas)
#print(‘l1_ratio_: %f’ % model.l1_ratio)
I was able to figure it out
https://github.com/thistleknot/python-ml/blob/master/code/ElasticNetCV.ipynb