Spot-Check Classification Machine Learning Algorithms in Python with scikit-learn

By Jason Brownlee on August 28, 2020 in Python Machine Learning 30

Spot-checking is a way of discovering which algorithms perform well on your machine learning problem.

You cannot know which algorithms are best suited to your problem before hand. You must trial a number of methods and focus attention on those that prove themselves the most promising.

In this post you will discover 6 machine learning algorithms that you can use when spot checking your classification problem in Python with scikit-learn.

Kick-start your project with my new book Machine Learning Mastery With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Update Jan/2017: Updated to reflect changes to the scikit-learn API in version 0.18.
Update Mar/2018: Added alternate link to download the dataset as the original appears to have been taken down.

Spot-Check Classification Machine Learning Algorithms in Python with scikit-learn
Photo by Masahiro Ihara, some rights reserved

Algorithm Spot Checking

You cannot know which algorithm will work best on your dataset before hand.

You must use trial and error to discover a short list of algorithms that do well on your problem that you can then double down on and tune further. I call this process spot checking.

The question is not:

What algorithm should I use on my dataset?

Instead it is:

What algorithms should I spot check on my dataset?

You can guess at what algorithms might do well on your dataset, and this can be a good starting point.

I recommend trying a mixture of algorithms and see what is good at picking out the structure in your data.

Try a mixture of algorithm representations (e.g. instances and trees).
Try a mixture of learning algorithms (e.g. different algorithms for learning the same type of representation).
Try a mixture of modeling types (e.g. linear and nonlinear functions or parametric and nonparametric).

Let’s get specific. In the next section, we will look at algorithms that you can use to spot check on your next machine learning project in Python.

Algorithms Overview

We are going to take a look at 6 classification algorithms that you can spot check on your dataset.

2 Linear Machine Learning Algorithms:

Logistic Regression
Linear Discriminant Analysis

4 Nonlinear Machine Learning Algorithms:

K-Nearest Neighbors
Naive Bayes
Classification and Regression Trees
Support Vector Machines

Each recipe is demonstrated on the Pima Indians onset of Diabetes dataset. This is a binary classification problem where all attributes are numeric.

You can learn more about the dataset here:

Each recipe is complete and standalone. This means that you can copy and paste it into your own project and start using it immediately.

A test harness using 10-fold cross validation is used to demonstrate how to spot check each machine learning algorithm and mean accuracy measures are used to indicate algorithm performance.

The recipes assume that you know about each machine learning algorithm and how to use them. We will not go into the API or parameterization of each algorithm.

Need help with Machine Learning in Python?

Take my free 2-week email course and discover data prep, algorithms and more (with code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Linear Machine Learning Algorithms

This section demonstrates minimal recipes for how to use two linear machine learning algorithms: logistic regression and linear discriminant analysis.

1. Logistic Regression

Logistic regression assumes a Gaussian distribution for the numeric input variables and can model binary classification problems.

You can construct a logistic regression model using the LogisticRegression class.

# Logistic Regression Classification
import pandas
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
seed = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = LogisticRegression()
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

# Logistic Regression Classification

import pandas

from sklearn import model_selection

from sklearn.linear_model import LogisticRegression

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

dataframe = pandas.read_csv(url, names=names)

array = dataframe.values

X = array[:,0:8]

Y = array[:,8]

seed = 7

kfold = model_selection.KFold(n_splits=10, random_state=seed)

model = LogisticRegression()

results = model_selection.cross_val_score(model, X, Y, cv=kfold)

print(results.mean())

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example prints the mean estimated accuracy.

0.76951469583

1	0.76951469583

2. Linear Discriminant Analysis

Linear Discriminant Analysis or LDA is a statistical technique for binary and multi-class classification. It too assumes a Gaussian distribution for the numerical input variables.

You can construct an LDA model using the LinearDiscriminantAnalysis class.

# LDA Classification
import pandas
from sklearn import model_selection
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
seed = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = LinearDiscriminantAnalysis()
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

# LDA Classification

import pandas

from sklearn import model_selection

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

dataframe = pandas.read_csv(url, names=names)

array = dataframe.values

X = array[:,0:8]

Y = array[:,8]

seed = 7

kfold = model_selection.KFold(n_splits=10, random_state=seed)

model = LinearDiscriminantAnalysis()

results = model_selection.cross_val_score(model, X, Y, cv=kfold)

print(results.mean())

Running the example prints the mean estimated accuracy.

0.773462064252

1	0.773462064252

Nonlinear Machine Learning Algorithms

This section demonstrates minimal recipes for how to use 4 nonlinear machine learning algorithms.

1. K-Nearest Neighbors

K-Nearest Neighbors (or KNN) uses a distance metric to find the K most similar instances in the training data for a new instance and takes the mean outcome of the neighbors as the prediction.

You can construct a KNN model using the KNeighborsClassifier class.

# KNN Classification
import pandas
from sklearn import model_selection
from sklearn.neighbors import KNeighborsClassifier
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
random_state = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = KNeighborsClassifier()
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

# KNN Classification

import pandas

from sklearn import model_selection

from sklearn.neighbors import KNeighborsClassifier

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

dataframe = pandas.read_csv(url, names=names)

array = dataframe.values

X = array[:,0:8]

Y = array[:,8]

random_state = 7

kfold = model_selection.KFold(n_splits=10, random_state=seed)

model = KNeighborsClassifier()

results = model_selection.cross_val_score(model, X, Y, cv=kfold)

print(results.mean())

Running the example prints the mean estimated accuracy.

0.726555023923

1	0.726555023923

2. Naive Bayes

Naive Bayes calculates the probability of each class and the conditional probability of each class given each input value. These probabilities are estimated for new data and multiplied together, assuming that they are all independent (a simple or naive assumption).

When working with real-valued data, a Gaussian distribution is assumed to easily estimate the probabilities for input variables using the Gaussian Probability Density Function.

You can construct a Naive Bayes model using the GaussianNB class.

# Gaussian Naive Bayes Classification
import pandas
from sklearn import model_selection
from sklearn.naive_bayes import GaussianNB
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
seed = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = GaussianNB()
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

# Gaussian Naive Bayes Classification

import pandas

from sklearn import model_selection

from sklearn.naive_bayes import GaussianNB

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

dataframe = pandas.read_csv(url, names=names)

array = dataframe.values

X = array[:,0:8]

Y = array[:,8]

seed = 7

kfold = model_selection.KFold(n_splits=10, random_state=seed)

model = GaussianNB()

results = model_selection.cross_val_score(model, X, Y, cv=kfold)

print(results.mean())

Running the example prints the mean estimated accuracy.

0.75517771702

1	0.75517771702

3. Classification and Regression Trees

Classification and Regression Trees (CART or just decision trees) construct a binary tree from the training data. Split points are chosen greedily by evaluating each attribute and each value of each attribute in the training data in order to minimize a cost function (like Gini).

You can construct a CART model using the DecisionTreeClassifier class.

# CART Classification
import pandas
from sklearn import model_selection
from sklearn.tree import DecisionTreeClassifier
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
seed = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = DecisionTreeClassifier()
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

# CART Classification

import pandas

from sklearn import model_selection

from sklearn.tree import DecisionTreeClassifier

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

dataframe = pandas.read_csv(url, names=names)

array = dataframe.values

X = array[:,0:8]

Y = array[:,8]

seed = 7

kfold = model_selection.KFold(n_splits=10, random_state=seed)

model = DecisionTreeClassifier()

results = model_selection.cross_val_score(model, X, Y, cv=kfold)

print(results.mean())

Running the example prints the mean estimated accuracy.

0.692600820232

1	0.692600820232

4. Support Vector Machines

Support Vector Machines (or SVM) seek a line that best separates two classes. Those data instances that are closest to the line that best separates the classes are called support vectors and influence where the line is placed. SVM has been extended to support multiple classes.

Of particular importance is the use of different kernel functions via the kernel parameter. A powerful Radial Basis Function is used by default.

You can construct an SVM model using the SVC class.

# SVM Classification
import pandas
from sklearn import model_selection
from sklearn.svm import SVC
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
seed = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = SVC()
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

# SVM Classification

import pandas

from sklearn import model_selection

from sklearn.svm import SVC

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

dataframe = pandas.read_csv(url, names=names)

array = dataframe.values

X = array[:,0:8]

Y = array[:,8]

seed = 7

kfold = model_selection.KFold(n_splits=10, random_state=seed)

model = SVC()

results = model_selection.cross_val_score(model, X, Y, cv=kfold)

print(results.mean())

Running the example prints the mean estimated accuracy.

0.651025290499

1	0.651025290499

Summary

In this post you discovered 6 machine learning algorithms that you can use to spot-check on your classification problem in Python using scikit-learn.

Specifically, you learned how to spot-check:

2 Linear Machine Learning Algorithms

Logistic Regression
Linear Discriminant Analysis

4 Nonlinear Machine Learning Algorithms

K-Nearest Neighbors
Naive Bayes
Classification and Regression Trees
Support Vector Machines

Do you have any questions about spot checking machine learning algorithms or about this post? Ask your questions in the comments section below and I will do my best to answer them.

30 Responses to Spot-Check Classification Machine Learning Algorithms in Python with scikit-learn

vachar February 15, 2017 at 4:08 am #

I am pretty good at missing things while reading documentation so I obviously missed out that you could do this with sklearn.
This was awesome. Thank you

Reply
- Jason Brownlee February 15, 2017 at 11:37 am #
  
  You’re welcome vachar.
  
  Reply
chandrisha March 5, 2017 at 2:59 pm #

how to use random forest for the prediction? i am having my final year project with machine learning and using random forest but facing problem in it. Can u please suggest a way?

Reply
- Jason Brownlee March 6, 2017 at 10:58 am #
  
  Yes, fit the model on all of your training data and call y=model.predict(X) to predict on new input data.
  
  Reply
Jeff October 25, 2017 at 2:50 am #

While building a model using Spot-check Algorithms,you are using cross-validation as well,isnt it
And do we have to calculate the cross val score on the X and Y or X_train and y_train
After doing a Train-test split can we do the cross-validation as well

Reply
- Jason Brownlee October 25, 2017 at 6:52 am #
  
  You could perform cross-validation on the training set and hold the test set back as validation:
  https://machinelearningmastery.com/difference-test-validation-datasets/
  
  Reply
  - Jeff October 26, 2017 at 12:48 am #
    
    Thank you Jason for the response . So, if I do a Train-test split then I follow that up with a cross-validation on the X_Train and Y_train.
    And if I do not split the dataset , I perform the cross-validation on the X and Y, isn’t it?
    
    Reply
    - Jason Brownlee October 26, 2017 at 5:30 am #
      
      Correct.
      
      Reply
Jeff October 26, 2017 at 12:49 pm #

Thank you so much, Jason.

Reply
- Jason Brownlee October 26, 2017 at 4:17 pm #
  
  You’re welcome Jeff.
  
  Reply
Gareth November 14, 2017 at 5:18 am #

Hi Jason, I am attempting to run some of these models across my dataset but I am receiving the error ‘ValueError: could not convert string to float:’ relating to my attributes with an object data type, is there an easy way to solve this? Thanks.

Reply
- Jason Brownlee November 14, 2017 at 10:20 am #
  
  Perhaps confirm that you have loaded your data as numeric and if not, convert it to numeric.
  
  Reply
Aleksei December 12, 2017 at 9:06 pm #

I’ve got more of a methodical question.

Going through the social media, websites, job descriptions and stuff, it seems that Data Scientist is expected to understand which algorithm (approach in general) can be applied to which problem. It is fair, but doesn’t it differ from spot-checking method a bit?

I mean, spot-checking is kind of like throwing everything and then seeing what works. What about the rationale and rigor? Is this rigor even relevant to business problems?

Thanks for your attention

Reply
- Jason Brownlee December 13, 2017 at 5:34 am #
  
  We cannot know which algorithm will be good or best for a given dataset without experimentation. It is intractable. The same goes for how to best configure an algorithm for a problem. Also intractable.
  
  I think the job descriptions mean that you need to know what algorithms can be used for classification, regression, clustering, time series, etc.
  
  Reply
Yazeed Alotaibi March 25, 2018 at 1:27 am #

i’m always get 0.0

I tried all code in this post

only change url and name

also x and y to fit my data

this is my data

https://raw.githubusercontent.com/Yazeedot/yazeedot.github.io/master/assets/survey.csv

Reply
- Jason Brownlee March 25, 2018 at 6:33 am #
  
  I recommend this process when working through new predictive modeling problems:
  https://machinelearningmastery.com/start-here/#process
  
  Reply
manohar May 5, 2018 at 4:24 pm #

Hi,

seed is not defined in KNN

Reply
- Jason Brownlee May 6, 2018 at 6:23 am #
  
  This is the seed for the pseudorandom number generator.
  
  You can learn more about random numbers in Python here:
  https://machinelearningmastery.com/introduction-to-random-number-generators-for-machine-learning/
  
  Reply
Mamta May 8, 2018 at 5:23 pm #

Thanks for your email and course ..I have learnt alot in short time …. Your days guide I liked the most.Since there is lot to read outside ..Your course made me confident in python as well as in ML concepts..

Reply
- Jason Brownlee May 9, 2018 at 6:16 am #
  
  Thanks! Well done on making it through the course.
  
  Reply
John July 25, 2018 at 3:11 am #

I see you’re only using accuracy to determine which subset of models to proceed with. Is this sufficient or would you want to factor in precision-recall-curve or ROC curve? Also, is there any useful methodology in selecting the cut-off point for which models to proceed with or do you simply go with top 3 performing? Thanks!

Reply
- Jason Brownlee July 25, 2018 at 6:22 am #
  
  You should use a measure that captures what is important to your project.
  
  Reply
Mushtaq September 6, 2018 at 8:29 pm #

how to oversample with smote and spot check with the oversampled data

Reply
- Jason Brownlee September 7, 2018 at 8:03 am #
  
  I hope to cover the topic.
  
  Reply
shadia November 1, 2019 at 5:19 am #

hi jason.thnx for your post,but i’m a little confused.i want to know the difference between:
Evaluate the Performance of Machine Learning Algorithms
Metrics To Evaluate Machine Learning Algorithms in Python
Spot-Check Classification Machine Learning Algorithms

i’m try to build a random forest classifier
should i use those methods together
and could i use a random forest classifier in any of them
thnx

Reply
- Jason Brownlee November 1, 2019 at 5:43 am #
  
  Perhaps try it and see?
  
  Reply
John September 11, 2020 at 8:08 pm #

Hi Jason and thanks for the code.
I have an error when running cros validation with LDA algorithm which i didn’t face when running Logistic Regrassion.

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

I search stackoverflow for the above error and they recommend using X.todense()
But how can i apply this step since we have a pipeline with OneHotEncoder and then model

Reply
- Jason Brownlee September 12, 2020 at 6:08 am #
  
  Sorry to hear that, perhaps these tips will help:
  https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
  
  Reply
ZHENYI JIA April 1, 2021 at 9:02 am #

it’s amazing!! i was just a beginner and this helps a looooooooot!!!

Reply
- Jason Brownlee April 2, 2021 at 5:32 am #
  
  Thanks, I’m happy to hear that!
  
  Reply

Navigation

Spot-Check Classification Machine Learning Algorithms in Python with scikit-learn

Algorithm Spot Checking

Algorithms Overview

Need help with Machine Learning in Python?

Linear Machine Learning Algorithms

1. Logistic Regression

2. Linear Discriminant Analysis

Nonlinear Machine Learning Algorithms

1. K-Nearest Neighbors

2. Naive Bayes

3. Classification and Regression Trees

4. Support Vector Machines

Summary

Discover Fast Machine Learning in Python!

Develop Your Own Models in Minutes

Finally Bring Machine Learning To
Your Own Projects

More On This Topic

30 Responses to Spot-Check Classification Machine Learning Algorithms in Python with scikit-learn

Leave a Reply Click here to cancel reply.

Navigation

Algorithm Spot Checking

Algorithms Overview

Need help with Machine Learning in Python?

Linear Machine Learning Algorithms

1. Logistic Regression

2. Linear Discriminant Analysis

Nonlinear Machine Learning Algorithms

1. K-Nearest Neighbors

2. Naive Bayes

3. Classification and Regression Trees

4. Support Vector Machines

Summary

Discover Fast Machine Learning in Python!

Develop Your Own Models in Minutes

Finally Bring Machine Learning To Your Own Projects

More On This Topic

30 Responses to Spot-Check Classification Machine Learning Algorithms in Python with scikit-learn

Leave a Reply Click here to cancel reply.

Finally Bring Machine Learning To
Your Own Projects