How to Develop Ridge Regression Models in Python

By Jason Brownlee on October 11, 2020 in Python Machine Learning 22

Regression is a modeling task that involves predicting a numeric value given an input.

Linear regression is the standard algorithm for regression that assumes a linear relationship between inputs and the target variable. An extension to linear regression invokes adding penalties to the loss function during training that encourages simpler models that have smaller coefficient values. These extensions are referred to as regularized linear regression or penalized linear regression.

Ridge Regression is a popular type of regularized linear regression that includes an L2 penalty. This has the effect of shrinking the coefficients for those input variables that do not contribute much to the prediction task.

In this tutorial, you will discover how to develop and evaluate Ridge Regression models in Python.

After completing this tutorial, you will know:

Ridge Regression is an extension of linear regression that adds a regularization penalty to the loss function during training.
How to evaluate a Ridge Regression model and use a final model to make predictions for new data.
How to configure the Ridge Regression model for a new dataset via grid search and automatically.

Let’s get started.

Update Oct/2020: Updated code in the grid search procedure to match description.

How to Develop Ridge Regression Models in Python
Photo by Susanne Nilsson, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Ridge Regression
Example of Ridge Regression
Tuning Ridge Hyperparameters

Ridge Regression

Linear regression refers to a model that assumes a linear relationship between input variables and the target variable.

With a single input variable, this relationship is a line, and with higher dimensions, this relationship can be thought of as a hyperplane that connects the input variables to the target variable. The coefficients of the model are found via an optimization process that seeks to minimize the sum squared error between the predictions (yhat) and the expected target values (y).

loss = sum i=0 to n (y_i – yhat_i)^2

A problem with linear regression is that estimated coefficients of the model can become large, making the model sensitive to inputs and possibly unstable. This is particularly true for problems with few observations (samples) or less samples (n) than input predictors (p) or variables (so-called p >> n problems).

One approach to address the stability of regression models is to change the loss function to include additional costs for a model that has large coefficients. Linear regression models that use these modified loss functions during training are referred to collectively as penalized linear regression.

One popular penalty is to penalize a model based on the sum of the squared coefficient values (beta). This is called an L2 penalty.

l2_penalty = sum j=0 to p beta_j^2

An L2 penalty minimizes the size of all coefficients, although it prevents any coefficients from being removed from the model by allowing their value to become zero.

The effect of this penalty is that the parameter estimates are only allowed to become large if there is a proportional reduction in SSE. In effect, this method shrinks the estimates towards 0 as the lambda penalty becomes large (these techniques are sometimes called “shrinkage methods”).

— Page 123, Applied Predictive Modeling, 2013.

This penalty can be added to the cost function for linear regression and is referred to as Tikhonov regularization (after the author), or Ridge Regression more generally.

A hyperparameter is used called “lambda” that controls the weighting of the penalty to the loss function. A default value of 1.0 will fully weight the penalty; a value of 0 excludes the penalty. Very small values of lambda, such as 1e-3 or smaller are common.

ridge_loss = loss + (lambda * l2_penalty)

Now that we are familiar with Ridge penalized regression, let’s look at a worked example.

Example of Ridge Regression

In this section, we will demonstrate how to use the Ridge Regression algorithm.

First, let’s introduce a standard regression dataset. We will use the housing dataset.

The housing dataset is a standard machine learning dataset comprising 506 rows of data with 13 numerical input variables and a numerical target variable.

Using a test harness of repeated stratified 10-fold cross-validation with three repeats, a naive model can achieve a mean absolute error (MAE) of about 6.6. A top-performing model can achieve a MAE on this same test harness of about 1.9. This provides the bounds of expected performance on this dataset.

The dataset involves predicting the house price given details of the house’s suburb in the American city of Boston.

No need to download the dataset; we will download it automatically as part of our worked examples.

The example below downloads and loads the dataset as a Pandas DataFrame and summarizes the shape of the dataset and the first five rows of data.

# load and summarize the housing dataset
from pandas import read_csv
from matplotlib import pyplot
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
# summarize shape
print(dataframe.shape)
# summarize first few lines
print(dataframe.head())

# load and summarize the housing dataset

from pandas import read_csv

from matplotlib import pyplot

# load dataset

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'

dataframe = read_csv(url, header=None)

# summarize shape

print(dataframe.shape)

# summarize first few lines

print(dataframe.head())

Running the example confirms the 506 rows of data and 13 input variables and a single numeric target variable (14 in total). We can also see that all input variables are numeric.

(506, 14)
        0     1     2   3      4      5   ...  8      9     10      11    12    13
0  0.00632  18.0  2.31   0  0.538  6.575  ...   1  296.0  15.3  396.90  4.98  24.0
1  0.02731   0.0  7.07   0  0.469  6.421  ...   2  242.0  17.8  396.90  9.14  21.6
2  0.02729   0.0  7.07   0  0.469  7.185  ...   2  242.0  17.8  392.83  4.03  34.7
3  0.03237   0.0  2.18   0  0.458  6.998  ...   3  222.0  18.7  394.63  2.94  33.4
4  0.06905   0.0  2.18   0  0.458  7.147  ...   3  222.0  18.7  396.90  5.33  36.2

[5 rows x 14 columns]

(506, 14)

0 1 2 3 4 5 ... 8 9 10 11 12 13

0 0.00632 18.0 2.31 0 0.538 6.575 ... 1 296.0 15.3 396.90 4.98 24.0

1 0.02731 0.0 7.07 0 0.469 6.421 ... 2 242.0 17.8 396.90 9.14 21.6

2 0.02729 0.0 7.07 0 0.469 7.185 ... 2 242.0 17.8 392.83 4.03 34.7

3 0.03237 0.0 2.18 0 0.458 6.998 ... 3 222.0 18.7 394.63 2.94 33.4

4 0.06905 0.0 2.18 0 0.458 7.147 ... 3 222.0 18.7 396.90 5.33 36.2

[5 rows x 14 columns]

The scikit-learn Python machine learning library provides an implementation of the Ridge Regression algorithm via the Ridge class.

Confusingly, the lambda term can be configured via the “alpha” argument when defining the class. The default value is 1.0 or a full penalty.

...
# define model
model = Ridge(alpha=1.0)

...

# define model

model = Ridge(alpha=1.0)

We can evaluate the Ridge Regression model on the housing dataset using repeated 10-fold cross-validation and report the average mean absolute error (MAE) on the dataset.

# evaluate an ridge regression model on the dataset
from numpy import mean
from numpy import std
from numpy import absolute
from pandas import read_csv
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.linear_model import Ridge
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
# define model
model = Ridge(alpha=1.0)
# define model evaluation method
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
# force scores to be positive
scores = absolute(scores)
print('Mean MAE: %.3f (%.3f)' % (mean(scores), std(scores)))

# evaluate an ridge regression model on the dataset

from numpy import mean

from numpy import std

from numpy import absolute

from pandas import read_csv

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import RepeatedKFold

from sklearn.linear_model import Ridge

# load the dataset

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'

dataframe = read_csv(url, header=None)

data = dataframe.values

X, y = data[:, :-1], data[:, -1]

# define model

model = Ridge(alpha=1.0)

# define model evaluation method

cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate model

scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)

# force scores to be positive

scores = absolute(scores)

print('Mean MAE: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the Ridge Regression algorithm on the housing dataset and reports the average MAE across the three repeats of 10-fold cross-validation.

Your specific results may vary given the stochastic nature of the learning algorithm. Consider running the example a few times.

In this case, we can see that the model achieved a MAE of about 3.382.

Mean MAE: 3.382 (0.519)

1	Mean MAE: 3.382 (0.519)

We may decide to use the Ridge Regression as our final model and make predictions on new data.

This can be achieved by fitting the model on all available data and calling the predict() function, passing in a new row of data.

We can demonstrate this with a complete example listed below.

# make a prediction with a ridge regression model on the dataset
from pandas import read_csv
from sklearn.linear_model import Ridge
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
# define model
model = Ridge(alpha=1.0)
# fit model
model.fit(X, y)
# define new data
row = [0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98]
# make a prediction
yhat = model.predict([row])
# summarize prediction
print('Predicted: %.3f' % yhat)

# make a prediction with a ridge regression model on the dataset

from pandas import read_csv

from sklearn.linear_model import Ridge

# load the dataset

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'

dataframe = read_csv(url, header=None)

data = dataframe.values

X, y = data[:, :-1], data[:, -1]

# define model

model = Ridge(alpha=1.0)

# fit model

model.fit(X, y)

# define new data

row = [0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98]

# make a prediction

yhat = model.predict([row])

# summarize prediction

print('Predicted: %.3f' % yhat)

Running the example fits the model and makes a prediction for the new rows of data.

Predicted: 30.253

1	Predicted: 30.253

Next, we can look at configuring the model hyperparameters.

Tuning Ridge Hyperparameters

How do we know that the default hyperparameters of alpha=1.0 is appropriate for our dataset?

We don’t.

Instead, it is good practice to test a suite of different configurations and discover what works best for our dataset.

One approach would be to grid search alpha values from perhaps 1e-5 to 100 on a log scale and discover what works best for a dataset. Another approach would be to test values between 0.0 and 1.0 with a grid separation of 0.01. We will try the latter in this case.

The example below demonstrates this using the GridSearchCV class with a grid of values we have defined.

# grid search hyperparameters for ridge regression
from numpy import arange
from pandas import read_csv
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedKFold
from sklearn.linear_model import Ridge
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
# define model
model = Ridge()
# define model evaluation method
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid
grid = dict()
grid['alpha'] = arange(0, 1, 0.01)
# define search
search = GridSearchCV(model, grid, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
# perform the search
results = search.fit(X, y)
# summarize
print('MAE: %.3f' % results.best_score_)
print('Config: %s' % results.best_params_)

# grid search hyperparameters for ridge regression

from numpy import arange

from pandas import read_csv

from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import RepeatedKFold

from sklearn.linear_model import Ridge

# load the dataset

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'

dataframe = read_csv(url, header=None)

data = dataframe.values

X, y = data[:, :-1], data[:, -1]

# define model

model = Ridge()

# define model evaluation method

cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

# define grid

grid = dict()

grid['alpha'] = arange(0, 1, 0.01)

# define search

search = GridSearchCV(model, grid, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)

# perform the search

results = search.fit(X, y)

# summarize

print('MAE: %.3f' % results.best_score_)

print('Config: %s' % results.best_params_)

Running the example will evaluate each combination of configurations using repeated cross-validation.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that we achieved slightly better results than the default 3.379 vs. 3.382. Ignore the sign; the library makes the MAE negative for optimization purposes.

We can see that the model assigned an alpha weight of 0.51 to the penalty.

MAE: -3.379
Config: {'alpha': 0.51}

1 2	MAE: -3.379 Config: {'alpha': 0.51}

The scikit-learn library also provides a built-in version of the algorithm that automatically finds good hyperparameters via the RidgeCV class.

To use this class, it is fit on the training dataset and used to make a prediction. During the training process, it automatically tunes the hyperparameter values.

By default, the model will only test the alpha values (0.1, 1.0, 10.0). We can change this to a grid of values between 0 and 1 with a separation of 0.01 as we did on the previous example by setting the “alphas” argument.

The example below demonstrates this.

# use automatically configured the ridge regression algorithm
from numpy import arange
from pandas import read_csv
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import RepeatedKFold
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
# define model evaluation method
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# define model
model = RidgeCV(alphas=arange(0, 1, 0.01), cv=cv, scoring='neg_mean_absolute_error')
# fit model
model.fit(X, y)
# summarize chosen configuration
print('alpha: %f' % model.alpha_)

# use automatically configured the ridge regression algorithm

from numpy import arange

from pandas import read_csv

from sklearn.linear_model import RidgeCV

from sklearn.model_selection import RepeatedKFold

# load the dataset

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'

dataframe = read_csv(url, header=None)

data = dataframe.values

X, y = data[:, :-1], data[:, -1]

# define model evaluation method

cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

# define model

model = RidgeCV(alphas=arange(0, 1, 0.01), cv=cv, scoring='neg_mean_absolute_error')

# fit model

model.fit(X, y)

# summarize chosen configuration

print('alpha: %f' % model.alpha_)

Running the example fits the model and discovers the hyperparameters that give the best results using cross-validation.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that the model chose the identical hyperparameter of alpha=0.51 that we found via our manual grid search.

alpha: 0.510000

1	alpha: 0.510000

Summary

In this tutorial, you discovered how to develop and evaluate Ridge Regression models in Python.

Specifically, you learned:

Ridge Regression is an extension of linear regression that adds a regularization penalty to the loss function during training.
How to evaluate a Ridge Regression model and use a final model to make predictions for new data.
How to configure the Ridge Regression model for a new dataset via grid search and automatically.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

22 Responses to How to Develop Ridge Regression Models in Python

Asad Mumtaz October 9, 2020 at 3:09 pm #

Hi Jason,

Another simple, to-the-point article as always.

There is a sentence under the Ridge Regression section:
“This is particularly true for problems with few observations (samples) or more samples (n) than input predictors (p) or variables (so-called p >> n problems).”

Unless I am wrong, I believe this should have instead read “…less samples (n) than input predictors (p)…”?

Reply
- Jason Brownlee October 10, 2020 at 6:58 am #
  
  Thanks! Fixed.
  
  Reply
  - Tom Weichle August 4, 2021 at 9:33 am #
    
    Hi Jason, I also noticed this error in the your tutorial for Lasso and Elastic Net regression.
    
    Reply
    - Jason Brownlee August 5, 2021 at 5:14 am #
      
      Thanks.
      
      Reply
Ramesh Ravula October 9, 2020 at 4:38 pm #

L2 penalty looks different from L2 regularization. Are they really different? What is the difference?

Reply
- Jason Brownlee October 10, 2020 at 6:59 am #
  
  Same thing. L2 of model weights/coefficient added to loss.
  
  In neural nets we call it weight decay:
  https://machinelearningmastery.com/weight-regularization-to-reduce-overfitting-of-deep-learning-models/
  
  Reply
fabou October 11, 2020 at 2:43 am #

Hi,

if :

grid[‘alpha’] = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 0.0, 1.0, 10.0, 100.0]

then :

Config: {‘alpha’: 0.51}

is not possible as 0.51 is not in [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 0.0, 1.0, 10.0, 100.0]

Reply
- Jason Brownlee October 11, 2020 at 6:58 am #
  
  Thanks, looks like I pasted the wrong version of the code in the tutorial. Fixed!
  
  Reply
Lola November 16, 2020 at 2:03 am #

Hi, is there more information for kernalised ridge regression?

Reply
- Jason Brownlee November 16, 2020 at 6:30 am #
  
  Yes, right here:
  https://scikit-learn.org/stable/modules/generated/sklearn.kernel_ridge.KernelRidge.html
  
  Reply
munio November 24, 2020 at 1:18 am #

hello, Thank you for this best tutorial for the topic, that I found:)

I have a question.
How to tune further the parameters in Ridge? My prediction is somehow ‘shifted’ in relation to ground truth data. Do you think that the reason is not-normalized data?
Thx,

Reply
- Jason Brownlee November 24, 2020 at 6:20 am #
  
  Perhaps some of these suggestions will help:
  https://machinelearningmastery.com/machine-learning-performance-improvement-cheat-sheet/
  
  Reply
Sam McDonald December 11, 2020 at 4:07 pm #

Is MAE in % here?

Reply
- Jason Brownlee December 12, 2020 at 6:22 am #
  
  MAE (mean absolute error) is the average error, it not a percentage.
  
  Reply
Steven May 13, 2021 at 5:52 am #

So, how do you get it to output the end equation for you once it has fit the model appropriately?

Reply
- Jason Brownlee May 13, 2021 at 6:08 am #
  
  We generally do not, e.g. this is applied ML, we want a model for use in software, not an equation.
  
  Nevertheless, I suspect you can retrieve the coefficients from the fit model and determine how they are used to make predictions by reading the open-source code library.
  
  Reply
pree July 7, 2021 at 7:44 pm #

Why Ridge with Tensorflow or Keras give me a different result with sklearn at high alpha(2000)?

make_regression Dataset

X, y, coef = make_regression(
n_samples=100,
n_features=n_features,
n_informative=n_features,
n_targets=1,
noise=0,
effective_rank=1000,
coef=True,
)

Reply
- Jason Brownlee July 8, 2021 at 6:06 am #
  
  Perhaps this will help:
  https://machinelearningmastery.com/faq/single-faq/why-do-i-get-different-results-each-time-i-run-the-code
  
  Reply
Sharomonk July 15, 2021 at 9:20 pm #

Thanks for doing this. I ran same dataset for LinearRegression and got a better prediction on same defined new data, why?

Reply
- Jason Brownlee July 16, 2021 at 5:23 am #
  
  Perhaps linear regression is better suited to your dataset.
  
  Reply
Gel December 3, 2021 at 7:37 am #

Hi I just have a question about the data you applied Ridge Regression to. Was it normalized to begin with? If not, why did you not normalize the data? Thanks!

Reply
- Adrian Tam December 8, 2021 at 6:47 am #
  
  Ridge regression do not necessarily need to normalize. But usually normalization provide a more stable result.
  
  Reply

Navigation

How to Develop Ridge Regression Models in Python

Tutorial Overview

Ridge Regression

Example of Ridge Regression

Tuning Ridge Hyperparameters

Further Reading

Books

APIs

Articles

Summary

Discover Fast Machine Learning in Python!

Develop Your Own Models in Minutes

Finally Bring Machine Learning To
Your Own Projects

More On This Topic

22 Responses to How to Develop Ridge Regression Models in Python

Leave a Reply Click here to cancel reply.

Navigation

Tutorial Overview

Ridge Regression

Example of Ridge Regression

Tuning Ridge Hyperparameters

Further Reading

Books

APIs

Articles

Summary

Discover Fast Machine Learning in Python!

Develop Your Own Models in Minutes

Finally Bring Machine Learning To Your Own Projects

More On This Topic

22 Responses to How to Develop Ridge Regression Models in Python

Leave a Reply Click here to cancel reply.

Finally Bring Machine Learning To
Your Own Projects