Spot-Check Regression Machine Learning Algorithms in Python with scikit-learn

By Jason Brownlee on August 28, 2020 in Python Machine Learning 15

Spot-checking is a way of discovering which algorithms perform well on your machine learning problem.

You cannot know which algorithms are best suited to your problem before hand. You must trial a number of methods and focus attention on those that prove themselves the most promising.

In this post you will discover 6 machine learning algorithms that you can use when spot checking your regression problem in Python with scikit-learn.

Kick-start your project with my new book Machine Learning Mastery With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Update Jan/2017: Updated to reflect changes to the scikit-learn API in version 0.18.
Update Mar/2018: Added alternate link to download the dataset as the original appears to have been taken down.

Spot-Check Regression Machine Learning Algorithms in Python with scikit-learn
Photo by frankieleon, some rights reserved.

Algorithms Overview

We are going to take a look at 7 classification algorithms that you can spot check on your dataset.

4 Linear Machine Learning Algorithms:

Linear Regression
Ridge Regression
LASSO Linear Regression
Elastic Net Regression

3 Nonlinear Machine Learning Algorithms:

K-Nearest Neighbors
Classification and Regression Trees
Support Vector Machines

Each recipe is demonstrated on a Boston House Price dataset. This is a regression problem where all attributes are numeric (update: download data from here).

Each recipe is complete and standalone. This means that you can copy and paste it into your own project and start using it immediately.

A test harness with 10-fold cross validation is used to demonstrate how to spot check each machine learning algorithm and mean squared error measures are used to indicate algorithm performance. Note that mean squared error values are inverted (negative). This is a quirk of the cross_val_score() function used that requires all algorithm metrics to be sorted in ascending order (larger value is better).

The recipes assume that you know about each machine learning algorithm and how to use them. We will not go into the API or parameterization of each algorithm.

Need help with Machine Learning in Python?

Take my free 2-week email course and discover data prep, algorithms and more (with code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Linear Machine Learning Algorithms

This section provides examples of how to use 4 different linear machine learning algorithms for regression in Python with scikit-learn.

1. Linear Regression

Linear regression assumes that the input variables have a Gaussian distribution. It is also assumed that input variables are relevant to the output variable and that they are not highly correlated with each other (a problem called collinearity).

You can construct a linear regression model using the LinearRegression class.

# Linear Regression
import pandas
from sklearn import model_selection
from sklearn.linear_model import LinearRegression
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.data"
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = pandas.read_csv(url, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:,0:13]
Y = array[:,13]
seed = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = LinearRegression()
scoring = 'neg_mean_squared_error'
results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print(results.mean())

# Linear Regression

import pandas

from sklearn import model_selection

from sklearn.linear_model import LinearRegression

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.data"

names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']

dataframe = pandas.read_csv(url, delim_whitespace=True, names=names)

array = dataframe.values

X = array[:,0:13]

Y = array[:,13]

seed = 7

kfold = model_selection.KFold(n_splits=10, random_state=seed)

model = LinearRegression()

scoring = 'neg_mean_squared_error'

results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)

print(results.mean())

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example provides an estimate of mean squared error.

-34.7052559445

1	-34.7052559445

2. Ridge Regression

Ridge regression is an extension of linear regression where the loss function is modified to minimize the complexity of the model measured as the sum squared value of the coefficient values (also called the l2-norm).

You can construct a ridge regression model by using the Ridge class.

# Ridge Regression
import pandas
from sklearn import model_selection
from sklearn.linear_model import Ridge
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.data"
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = pandas.read_csv(url, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:,0:13]
Y = array[:,13]
seed = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = Ridge()
scoring = 'neg_mean_squared_error'
results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print(results.mean())

# Ridge Regression

import pandas

from sklearn import model_selection

from sklearn.linear_model import Ridge

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.data"

names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']

dataframe = pandas.read_csv(url, delim_whitespace=True, names=names)

array = dataframe.values

X = array[:,0:13]

Y = array[:,13]

seed = 7

kfold = model_selection.KFold(n_splits=10, random_state=seed)

model = Ridge()

scoring = 'neg_mean_squared_error'

results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)

print(results.mean())

Running the example provides an estimate of the mean squared error.

-34.0782462093

1	-34.0782462093

3. LASSO Regression

The Least Absolute Shrinkage and Selection Operator (or LASSO for short) is a modification of linear regression, like ridge regression, where the loss function is modified to minimize the complexity of the model measured as the sum absolute value of the coefficient values (also called the l1-norm).

You can construct a LASSO model by using the Lasso class.

# Lasso Regression
import pandas
from sklearn import model_selection
from sklearn.linear_model import Lasso
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.data"
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = pandas.read_csv(url, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:,0:13]
Y = array[:,13]
seed = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = Lasso()
scoring = 'neg_mean_squared_error'
results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print(results.mean())

# Lasso Regression

import pandas

from sklearn import model_selection

from sklearn.linear_model import Lasso

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.data"

names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']

dataframe = pandas.read_csv(url, delim_whitespace=True, names=names)

array = dataframe.values

X = array[:,0:13]

Y = array[:,13]

seed = 7

kfold = model_selection.KFold(n_splits=10, random_state=seed)

model = Lasso()

scoring = 'neg_mean_squared_error'

results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)

print(results.mean())

Running the example provides an estimate of the mean squared error.

-34.4640845883

1	-34.4640845883

4. ElasticNet Regression

ElasticNet is a form of regularization regression that combines the properties of both Ridge Regression and LASSO regression. It seeks to minimize the complexity of the regression model (magnitude and number of regression coefficients) by penalizing the model using both the l2-norm (sum squared coefficient values) and the l1-norm (sum absolute coefficient values).

You can construct an ElasticNet model using the ElasticNet class.

# ElasticNet Regression
import pandas
from sklearn import model_selection
from sklearn.linear_model import ElasticNet
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.data"
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = pandas.read_csv(url, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:,0:13]
Y = array[:,13]
seed = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = ElasticNet()
scoring = 'neg_mean_squared_error'
results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print(results.mean())

# ElasticNet Regression

import pandas

from sklearn import model_selection

from sklearn.linear_model import ElasticNet

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.data"

names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']

dataframe = pandas.read_csv(url, delim_whitespace=True, names=names)

array = dataframe.values

X = array[:,0:13]

Y = array[:,13]

seed = 7

kfold = model_selection.KFold(n_splits=10, random_state=seed)

model = ElasticNet()

scoring = 'neg_mean_squared_error'

results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)

print(results.mean())

Running the example provides an estimate of the mean squared error.

-31.1645737142

1	-31.1645737142

Nonlinear Machine Learning Algorithms

This section provides examples of how to use 3 different nonlinear machine learning algorithms for regression in Python with scikit-learn.

1. K-Nearest Neighbors

K-Nearest Neighbors (or KNN) locates the K most similar instances in the training dataset for a new data instance. From the K neighbors, a mean or median output variable is taken as the prediction. Of note is the distance metric used (the metric argument). The Minkowski distance is used by default, which is a generalization of both the Euclidean distance (used when all inputs have the same scale) and Manhattan distance (for when the scales of the input variables differ).

You can construct a KNN model for regression using the KNeighborsRegressor class.

# KNN Regression
import pandas
from sklearn import model_selection
from sklearn.neighbors import KNeighborsRegressor
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.data"
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = pandas.read_csv(url, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:,0:13]
Y = array[:,13]
seed = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = KNeighborsRegressor()
scoring = 'neg_mean_squared_error'
results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print(results.mean())

# KNN Regression

import pandas

from sklearn import model_selection

from sklearn.neighbors import KNeighborsRegressor

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.data"

names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']

dataframe = pandas.read_csv(url, delim_whitespace=True, names=names)

array = dataframe.values

X = array[:,0:13]

Y = array[:,13]

seed = 7

kfold = model_selection.KFold(n_splits=10, random_state=seed)

model = KNeighborsRegressor()

scoring = 'neg_mean_squared_error'

results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)

print(results.mean())

Running the example provides an estimate of the mean squared error.

-107.28683898

1	-107.28683898

2. Classification and Regression Trees

Decision trees or the Classification and Regression Trees (CART as they are known) use the training data to select the best points to split the data in order to minimize a cost metric. The default cost metric for regression decision trees is the mean squared error, specified in the criterion parameter.

You can create a CART model for regression using the DecisionTreeRegressor class.

# Decision Tree Regression
import pandas
from sklearn import model_selection
from sklearn.tree import DecisionTreeRegressor
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.data"
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = pandas.read_csv(url, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:,0:13]
Y = array[:,13]
seed = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = DecisionTreeRegressor()
scoring = 'neg_mean_squared_error'
results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print(results.mean())

# Decision Tree Regression

import pandas

from sklearn import model_selection

from sklearn.tree import DecisionTreeRegressor

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.data"

names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']

dataframe = pandas.read_csv(url, delim_whitespace=True, names=names)

array = dataframe.values

X = array[:,0:13]

Y = array[:,13]

seed = 7

kfold = model_selection.KFold(n_splits=10, random_state=seed)

model = DecisionTreeRegressor()

scoring = 'neg_mean_squared_error'

results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)

print(results.mean())

Running the example provides an estimate of the mean squared error.

-35.4906027451

1	-35.4906027451

3. Support Vector Machines

Support Vector Machines (SVM) were developed for binary classification. The technique has been extended for the prediction real-valued problems called Support Vector Regression (SVR). Like the classification example, SVR is built upon the LIBSVM library.

You can create an SVM model for regression using the SVR class.

# SVM Regression
import pandas
from sklearn import model_selection
from sklearn.svm import SVR
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.data"
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = pandas.read_csv(url, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:,0:13]
Y = array[:,13]
seed = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = SVR()
scoring = 'neg_mean_squared_error'
results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print(results.mean())

# SVM Regression

import pandas

from sklearn import model_selection

from sklearn.svm import SVR

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.data"

names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']

dataframe = pandas.read_csv(url, delim_whitespace=True, names=names)

array = dataframe.values

X = array[:,0:13]

Y = array[:,13]

seed = 7

kfold = model_selection.KFold(n_splits=10, random_state=seed)

model = SVR()

scoring = 'neg_mean_squared_error'

results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)

print(results.mean())

Running the example provides an estimate of the mean squared error.

-91.0478243332

1	-91.0478243332

Summary

In this post you discovered machine learning recipes for regression in Python using scikit-learn.

Specifically, you learned about:

4 Linear Machine Learning Algorithms:

Linear Regression
Ridge Regression
LASSO Linear Regression
Elastic Net Regression

3 Nonlinear Machine Learning Algorithms:

K-Nearest Neighbors
Classification and Regression Trees
Support Vector Machines

Do you have any questions about regression machine learning algorithms or this post? Ask your questions in the comments and I will do my best to answer them.

15 Responses to Spot-Check Regression Machine Learning Algorithms in Python with scikit-learn

Joe Dorocak October 30, 2016 at 5:11 am #

Hi, Jason.

I got a message when i ran [LinearRegression, mean_squared_error| that said, ” Scoring method mean_squared_error was renamed to neg_mean_squared_error in version 0.18 and will be removed in 0.20.”

Love and peace,
Joe

Reply
- Jason Brownlee October 30, 2016 at 8:57 am #
  
  Yes, I need to update the examples for v0.18 of sklearn.
  
  Reply
Joe Dorocak October 30, 2016 at 5:31 am #

Hi Jason.
Here’s the Values i got.

Scoring: neg_mean_squared_error

Dataset Group Model Value
———- ——– ——– ——-
Boston Linear LinRegr -34.705
Boston Linear Ridge -34.078
Boston Linear Lasso -34.464
Boston Linear ElastN -31.165
Boston Non-Lin KNRegr -107.287
Boston Clas+Tr DecTrRg -36.348
Boston SpVMs SVR -91.048

Thanks for everything.

Love and peace,

Joe

Reply
Aniket Saxena October 25, 2017 at 3:51 am #

As we have multiple regression algorithms, so my question is, if we want to spot check the accurate algorithm which we have to use in order to solve the problem, so is it necessary to have checked all the algorithms one by one before making predictions to the problem or is there any best alternative you know for this linear work(any fast and accurate work to identify accurate algorithm for the problem)?

Reply
- Jason Brownlee October 25, 2017 at 6:53 am #
  
  When spot checking, we would evaluate each algorithm on the data, one at a time.
  
  Reply
Aniket Saxena January 15, 2018 at 4:17 pm #

Hello Jason,

I have got these solutions to above problem:-

1. KNRegressor: -107.286839 as mean and 79.839530 as standard deviation.
2. RANSAC (robustness regression): -213.964101 as mean and 354.784695 as standard deviation
So, my question is, should I take into account KNregressor or RANSAC as albeit I have got a huge standard deviation in RANSAC, I also got big mean squared error(negative) which indicates better performance?

Reply
- Jason Brownlee January 16, 2018 at 7:31 am #
  
  Algorithm selection really comes down to the goals of your project.
  
  Reply
jbo May 11, 2018 at 8:34 pm #

hi Jason :
What is the difference between Multitask Lasso and Elastic net , both are regularizing using L1/L2? And when to use Multi-task with Lasso or Elastic Net and its benefits ?

Reply
- Jason Brownlee May 12, 2018 at 6:31 am #
  
  I don’t know about multitask, but from memory, Lasso does L1, ridge does L2 and elastic net does L1 and L2.
  
  Reply
Billa September 14, 2018 at 11:41 pm #

Hi Joson

How to set hyper parameters in model_selection.cross_val_score?

In the given example for KNN, I would like to pass hyper parameter k value but unable to do that.

Can you please help me on it.

Reply
- Jason Brownlee September 15, 2018 at 6:09 am #
  
  You configure the model first, e.g. when you define it.
  
  Reply
Santosh Khanal October 20, 2020 at 9:01 am #

I follow your tutorials and reading sections regarding Machine Learning. Actually, I am working on Genetic Programming. Do you have any idea if there are some materials and tutorials on Genetic Programming. I have found a lot in your page and materials which has helped me to learn machine learning. I followed you on ANN but I am working on GP.

Reply
- Jason Brownlee October 20, 2020 at 1:39 pm #
  
  This might be a good place to start:
  https://amzn.to/2TarBAE
  
  Reply
  - Santosh Khanal October 25, 2020 at 1:43 pm #
    
    Thank you! I have gone through the book you mentioned. I have grasped some basics theories and fundamentals behind GP. I was wondering if there are some other materials like the frameworks such that we have in ANN. The open-source frameworks in other ML models make us easy to work on. I rarely find such a framework which could make me build and work on GP models.
    
    Reply
    - Jason Brownlee October 26, 2020 at 6:47 am #
      
      Sorry, I’m not aware of good modern libraries for genetic programming.
      
      Reply

Navigation

Spot-Check Regression Machine Learning Algorithms in Python with scikit-learn

Algorithms Overview

Need help with Machine Learning in Python?

Linear Machine Learning Algorithms

1. Linear Regression

2. Ridge Regression

3. LASSO Regression

4. ElasticNet Regression

Nonlinear Machine Learning Algorithms

1. K-Nearest Neighbors

2. Classification and Regression Trees

3. Support Vector Machines

Summary

Discover Fast Machine Learning in Python!

Develop Your Own Models in Minutes

Finally Bring Machine Learning To
Your Own Projects

More On This Topic

15 Responses to Spot-Check Regression Machine Learning Algorithms in Python with scikit-learn

Leave a Reply Click here to cancel reply.

Navigation

Algorithms Overview

Need help with Machine Learning in Python?

Linear Machine Learning Algorithms

1. Linear Regression

2. Ridge Regression

3. LASSO Regression

4. ElasticNet Regression

Nonlinear Machine Learning Algorithms

1. K-Nearest Neighbors

2. Classification and Regression Trees

3. Support Vector Machines

Summary

Discover Fast Machine Learning in Python!

Develop Your Own Models in Minutes

Finally Bring Machine Learning To Your Own Projects

More On This Topic

15 Responses to Spot-Check Regression Machine Learning Algorithms in Python with scikit-learn

Leave a Reply Click here to cancel reply.

Finally Bring Machine Learning To
Your Own Projects