The Naive Bayes algorithm is simple and effective and should be one of the first methods you try on a classification problem.

In this tutorial you are going to learn about the Naive Bayes algorithm including how it works and how to implement it from scratch in Python.

**Update**: Check out the follow-up on tips for using the naive bayes algorithm titled: “Better Naive Bayes: 12 Tips To Get The Most From The Naive Bayes Algorithm“.**Update March/2018**: Added alternate link to download the dataset as the original appears to have been taken down.

## About Naive Bayes

The Naive Bayes algorithm is an intuitive method that uses the probabilities of each attribute belonging to each class to make a prediction. It is the supervised learning approach you would come up with if you wanted to model a predictive modeling problem probabilistically.

Naive bayes simplifies the calculation of probabilities by assuming that the probability of each attribute belonging to a given class value is independent of all other attributes. This is a strong assumption but results in a fast and effective method.

The probability of a class value given a value of an attribute is called the conditional probability. By multiplying the conditional probabilities together for each attribute for a given class value, we have a probability of a data instance belonging to that class.

To make a prediction we can calculate probabilities of the instance belonging to each class and select the class value with the highest probability.

Naive bases is often described using categorical data because it is easy to describe and calculate using ratios. A more useful version of the algorithm for our purposes supports numeric attributes and assumes the values of each numerical attribute are normally distributed (fall somewhere on a bell curve). Again, this is a strong assumption, but still gives robust results.

## Get your FREE Algorithms Mind Map

I've created a handy mind map of 60+ algorithms organized by type.

Download it, print it and use it.

Also get exclusive access to the machine learning algorithms email mini-course.

## Predict the Onset of Diabetes

The test problem we will use in this tutorial is the Pima Indians Diabetes problem.

This problem is comprised of 768 observations of medical details for Pima indians patents. The records describe instantaneous measurements taken from the patient such as their age, the number of times pregnant and blood workup. All patients are women aged 21 or older. All attributes are numeric, and their units vary from attribute to attribute.

Each record has a class value that indicates whether the patient suffered an onset of diabetes within 5 years of when the measurements were taken (1) or not (0).

This is a standard dataset that has been studied a lot in machine learning literature. A good prediction accuracy is 70%-76%.

Below is a sample from the *pima-indians.data.csv* file to get a sense of the data we will be working with (update: download from here).

1 2 3 4 5 |
6,148,72,35,0,33.6,0.627,50,1 1,85,66,29,0,26.6,0.351,31,0 8,183,64,0,0,23.3,0.672,32,1 1,89,66,23,94,28.1,0.167,21,0 0,137,40,35,168,43.1,2.288,33,1 |

## Naive Bayes Algorithm Tutorial

This tutorial is broken down into the following steps:

**Handle Data**: Load the data from CSV file and split it into training and test datasets.**Summarize Data**: summarize the properties in the training dataset so that we can calculate probabilities and make predictions.**Make a Prediction**: Use the summaries of the dataset to generate a single prediction.**Make Predictions**: Generate predictions given a test dataset and a summarized training dataset.**Evaluate Accuracy**: Evaluate the accuracy of predictions made for a test dataset as the percentage correct out of all predictions made.**Tie it Together**: Use all of the code elements to present a complete and standalone implementation of the Naive Bayes algorithm.

### 1. Handle Data

The first thing we need to do is load our data file. The data is in CSV format without a header line or any quotes. We can open the file with the open function and read the data lines using the reader function in the csv module.

We also need to convert the attributes that were loaded as strings into numbers that we can work with them. Below is the **loadCsv()** function for loading the Pima indians dataset.

1 2 3 4 5 6 7 |
import csv def loadCsv(filename): lines = csv.reader(open(filename, "rb")) dataset = list(lines) for i in range(len(dataset)): dataset[i] = [float(x) for x in dataset[i]] return dataset |

We can test this function by loading the pima indians dataset and printing the number of data instances that were loaded.

1 2 3 |
filename = 'pima-indians-diabetes.data.csv' dataset = loadCsv(filename) print('Loaded data file {0} with {1} rows').format(filename, len(dataset)) |

Running this test, you should see something like:

1 |
Loaded data file pima-indians-diabetes.data.csv rows |

Next we need to split the data into a training dataset that Naive Bayes can use to make predictions and a test dataset that we can use to evaluate the accuracy of the model. We need to split the data set randomly into train and datasets with a ratio of 67% train and 33% test (this is a common ratio for testing an algorithm on a dataset).

Below is the **splitDataset()** function that will split a given dataset into a given split ratio.

1 2 3 4 5 6 7 8 9 |
import random def splitDataset(dataset, splitRatio): trainSize = int(len(dataset) * splitRatio) trainSet = [] copy = list(dataset) while len(trainSet) < trainSize: index = random.randrange(len(copy)) trainSet.append(copy.pop(index)) return [trainSet, copy] |

We can test this out by defining a mock dataset with 5 instances, split it into training and testing datasets and print them out to see which data instances ended up where.

1 2 3 4 |
dataset = [[1], [2], [3], [4], [5]] splitRatio = 0.67 train, test = splitDataset(dataset, splitRatio) print('Split {0} rows into train with {1} and test with {2}').format(len(dataset), train, test) |

Running this test, you should see something like:

1 |
Split 5 rows into train with [[4], [3], [5]] and test with [[1], [2]] |

### 2. Summarize Data

The naive bayes model is comprised of a summary of the data in the training dataset. This summary is then used when making predictions.

The summary of the training data collected involves the mean and the standard deviation for each attribute, by class value. For example, if there are two class values and 7 numerical attributes, then we need a mean and standard deviation for each attribute (7) and class value (2) combination, that is 14 attribute summaries.

These are required when making predictions to calculate the probability of specific attribute values belonging to each class value.

We can break the preparation of this summary data down into the following sub-tasks:

- Separate Data By Class
- Calculate Mean
- Calculate Standard Deviation
- Summarize Dataset
- Summarize Attributes By Class

#### Separate Data By Class

The first task is to separate the training dataset instances by class value so that we can calculate statistics for each class. We can do that by creating a map of each class value to a list of instances that belong to that class and sort the entire dataset of instances into the appropriate lists.

The **separateByClass()** function below does just this.

1 2 3 4 5 6 7 8 |
def separateByClass(dataset): separated = {} for i in range(len(dataset)): vector = dataset[i] if (vector[-1] not in separated): separated[vector[-1]] = [] separated[vector[-1]].append(vector) return separated |

You can see that the function assumes that the last attribute (-1) is the class value. The function returns a map of class values to lists of data instances.

We can test this function with some sample data, as follows:

1 2 3 |
dataset = [[1,20,1], [2,21,0], [3,22,1]] separated = separateByClass(dataset) print('Separated instances: {0}').format(separated) |

Running this test, you should see something like:

1 |
Separated instances: {0: [[2, 21, 0]], 1: [[1, 20, 1], [3, 22, 1]]} |

#### Calculate Mean

We need to calculate the mean of each attribute for a class value. The mean is the central middle or central tendency of the data, and we will use it as the middle of our gaussian distribution when calculating probabilities.

We also need to calculate the standard deviation of each attribute for a class value. The standard deviation describes the variation of spread of the data, and we will use it to characterize the expected spread of each attribute in our Gaussian distribution when calculating probabilities.

The standard deviation is calculated as the square root of the variance. The variance is calculated as the average of the squared differences for each attribute value from the mean. Note we are using the N-1 method, which subtracts 1 from the number of attribute values when calculating the variance.

1 2 3 4 5 6 7 8 |
import math def mean(numbers): return sum(numbers)/float(len(numbers)) def stdev(numbers): avg = mean(numbers) variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1) return math.sqrt(variance) |

We can test this by taking the mean of the numbers from 1 to 5.

1 2 |
numbers = [1,2,3,4,5] print('Summary of {0}: mean={1}, stdev={2}').format(numbers, mean(numbers), stdev(numbers)) |

Running this test, you should see something like:

1 |
Summary of [1, 2, 3, 4, 5]: mean=3.0, stdev=1.58113883008 |

#### Summarize Dataset

Now we have the tools to summarize a dataset. For a given list of instances (for a class value) we can calculate the mean and the standard deviation for each attribute.

The zip function groups the values for each attribute across our data instances into their own lists so that we can compute the mean and standard deviation values for the attribute.

1 2 3 4 |
def summarize(dataset): summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)] del summaries[-1] return summaries |

We can test this **summarize()** function with some test data that shows markedly different mean and standard deviation values for the first and second data attributes.

1 2 3 |
dataset = [[1,20,0], [2,21,1], [3,22,0]] summary = summarize(dataset) print('Attribute summaries: {0}').format(summary) |

Running this test, you should see something like:

1 |
Attribute summaries: [(2.0, 1.0), (21.0, 1.0)] |

#### Summarize Attributes By Class

We can pull it all together by first separating our training dataset into instances grouped by class. Then calculate the summaries for each attribute.

1 2 3 4 5 6 |
def summarizeByClass(dataset): separated = separateByClass(dataset) summaries = {} for classValue, instances in separated.iteritems(): summaries[classValue] = summarize(instances) return summaries |

We can test this **summarizeByClass()** function with a small test dataset.

1 2 3 |
dataset = [[1,20,1], [2,21,0], [3,22,1], [4,22,0]] summary = summarizeByClass(dataset) print('Summary by class value: {0}').format(summary) |

Running this test, you should see something like:

1 2 3 |
Summary by class value: {0: [(3.0, 1.4142135623730951), (21.5, 0.7071067811865476)], 1: [(2.0, 1.4142135623730951), (21.0, 1.4142135623730951)]} |

### 3. Make Prediction

We are now ready to make predictions using the summaries prepared from our training data. Making predictions involves calculating the probability that a given data instance belongs to each class, then selecting the class with the largest probability as the prediction.

We can divide this part into the following tasks:

- Calculate Gaussian Probability Density Function
- Calculate Class Probabilities
- Make a Prediction
- Estimate Accuracy

#### Calculate Gaussian Probability Density Function

We can use a Gaussian function to estimate the probability of a given attribute value, given the known mean and standard deviation for the attribute estimated from the training data.

Given that the attribute summaries where prepared for each attribute and class value, the result is the conditional probability of a given attribute value given a class value.

See the references for the details of this equation for the Gaussian probability density function. In summary we are plugging our known details into the Gaussian (attribute value, mean and standard deviation) and reading off the likelihood that our attribute value belongs to the class.

In the **calculateProbability()** function we calculate the exponent first, then calculate the main division. This lets us fit the equation nicely on two lines.

1 2 3 4 |
import math def calculateProbability(x, mean, stdev): exponent = math.exp(-(math.pow(x-mean,2)/(2*math.pow(stdev,2)))) return (1 / (math.sqrt(2*math.pi) * stdev)) * exponent |

We can test this with some sample data, as follows.

1 2 3 4 5 |
x = 71.5 mean = 73 stdev = 6.2 probability = calculateProbability(x, mean, stdev) print('Probability of belonging to this class: {0}').format(probability) |

Running this test, you should see something like:

1 |
Probability of belonging to this class: 0.0624896575937 |

#### Calculate Class Probabilities

Now that we can calculate the probability of an attribute belonging to a class, we can combine the probabilities of all of the attribute values for a data instance and come up with a probability of the entire data instance belonging to the class.

We combine probabilities together by multiplying them. In the **calculateClassProbabilities()** below, the probability of a given data instance is calculated by multiplying together the attribute probabilities for each class. the result is a map of class values to probabilities.

1 2 3 4 5 6 7 8 9 |
def calculateClassProbabilities(summaries, inputVector): probabilities = {} for classValue, classSummaries in summaries.iteritems(): probabilities[classValue] = 1 for i in range(len(classSummaries)): mean, stdev = classSummaries[i] x = inputVector[i] probabilities[classValue] *= calculateProbability(x, mean, stdev) return probabilities |

We can test the **calculateClassProbabilities()** function.

1 2 3 4 |
summaries = {0:[(1, 0.5)], 1:[(20, 5.0)]} inputVector = [1.1, '?'] probabilities = calculateClassProbabilities(summaries, inputVector) print('Probabilities for each class: {0}').format(probabilities) |

Running this test, you should see something like:

1 |
Probabilities for each class: {0: 0.7820853879509118, 1: 6.298736258150442e-05} |

#### Make a Prediction

Now that can calculate the probability of a data instance belonging to each class value, we can look for the largest probability and return the associated class.

The **predict()** function belong does just that.

1 2 3 4 5 6 7 8 |
def predict(summaries, inputVector): probabilities = calculateClassProbabilities(summaries, inputVector) bestLabel, bestProb = None, -1 for classValue, probability in probabilities.iteritems(): if bestLabel is None or probability > bestProb: bestProb = probability bestLabel = classValue return bestLabel |

We can test the **predict()** function as follows:

1 2 3 4 |
summaries = {'A':[(1, 0.5)], 'B':[(20, 5.0)]} inputVector = [1.1, '?'] result = predict(summaries, inputVector) print('Prediction: {0}').format(result) |

Running this test, you should see something like:

1 |
Prediction: A |

### 4. Make Predictions

Finally, we can estimate the accuracy of the model by making predictions for each data instance in our test dataset. The **getPredictions()** will do this and return a list of predictions for each test instance.

1 2 3 4 5 6 |
def getPredictions(summaries, testSet): predictions = [] for i in range(len(testSet)): result = predict(summaries, testSet[i]) predictions.append(result) return predictions |

We can test the **getPredictions()** function.

1 2 3 4 |
summaries = {'A':[(1, 0.5)], 'B':[(20, 5.0)]} testSet = [[1.1, '?'], [19.1, '?']] predictions = getPredictions(summaries, testSet) print('Predictions: {0}').format(predictions) |

Running this test, you should see something like:

1 |
Predictions: ['A', 'B'] |

### 5. Get Accuracy

The predictions can be compared to the class values in the test dataset and a classification accuracy can be calculated as an accuracy ratio between 0& and 100%. The **getAccuracy()** will calculate this accuracy ratio.

1 2 3 4 5 6 |
def getAccuracy(testSet, predictions): correct = 0 for x in range(len(testSet)): if testSet[x][-1] == predictions[x]: correct += 1 return (correct/float(len(testSet))) * 100.0 |

We can test the **getAccuracy()** function using the sample code below.

1 2 3 4 |
testSet = [[1,1,1,'a'], [2,2,2,'a'], [3,3,3,'b']] predictions = ['a', 'a', 'a'] accuracy = getAccuracy(testSet, predictions) print('Accuracy: {0}').format(accuracy) |

Running this test, you should see something like:

1 |
Accuracy: 66.6666666667 |

### 6. Tie it Together

Finally, we need to tie it all together.

Below provides the full code listing for Naive Bayes implemented from scratch in Python.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 |
# Example of Naive Bayes implemented from Scratch in Python import csv import random import math def loadCsv(filename): lines = csv.reader(open(filename, "rb")) dataset = list(lines) for i in range(len(dataset)): dataset[i] = [float(x) for x in dataset[i]] return dataset def splitDataset(dataset, splitRatio): trainSize = int(len(dataset) * splitRatio) trainSet = [] copy = list(dataset) while len(trainSet) < trainSize: index = random.randrange(len(copy)) trainSet.append(copy.pop(index)) return [trainSet, copy] def separateByClass(dataset): separated = {} for i in range(len(dataset)): vector = dataset[i] if (vector[-1] not in separated): separated[vector[-1]] = [] separated[vector[-1]].append(vector) return separated def mean(numbers): return sum(numbers)/float(len(numbers)) def stdev(numbers): avg = mean(numbers) variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1) return math.sqrt(variance) def summarize(dataset): summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)] del summaries[-1] return summaries def summarizeByClass(dataset): separated = separateByClass(dataset) summaries = {} for classValue, instances in separated.iteritems(): summaries[classValue] = summarize(instances) return summaries def calculateProbability(x, mean, stdev): exponent = math.exp(-(math.pow(x-mean,2)/(2*math.pow(stdev,2)))) return (1 / (math.sqrt(2*math.pi) * stdev)) * exponent def calculateClassProbabilities(summaries, inputVector): probabilities = {} for classValue, classSummaries in summaries.iteritems(): probabilities[classValue] = 1 for i in range(len(classSummaries)): mean, stdev = classSummaries[i] x = inputVector[i] probabilities[classValue] *= calculateProbability(x, mean, stdev) return probabilities def predict(summaries, inputVector): probabilities = calculateClassProbabilities(summaries, inputVector) bestLabel, bestProb = None, -1 for classValue, probability in probabilities.iteritems(): if bestLabel is None or probability > bestProb: bestProb = probability bestLabel = classValue return bestLabel def getPredictions(summaries, testSet): predictions = [] for i in range(len(testSet)): result = predict(summaries, testSet[i]) predictions.append(result) return predictions def getAccuracy(testSet, predictions): correct = 0 for i in range(len(testSet)): if testSet[i][-1] == predictions[i]: correct += 1 return (correct/float(len(testSet))) * 100.0 def main(): filename = 'pima-indians-diabetes.data.csv' splitRatio = 0.67 dataset = loadCsv(filename) trainingSet, testSet = splitDataset(dataset, splitRatio) print('Split {0} rows into train={1} and test={2} rows').format(len(dataset), len(trainingSet), len(testSet)) # prepare model summaries = summarizeByClass(trainingSet) # test model predictions = getPredictions(summaries, testSet) accuracy = getAccuracy(testSet, predictions) print('Accuracy: {0}%').format(accuracy) main() |

Running the example provides output like the following:

1 2 |
Split 768 rows into train=514 and test=254 rows Accuracy: 76.3779527559% |

## Implementation Extensions

This section provides you with ideas for extensions that you could apply and investigate with the Python code you have implemented as part of this tutorial.

You have implemented your own version of Gaussian Naive Bayes in python from scratch.

You can extend the implementation further.

**Calculate Class Probabilities**: Update the example to summarize the probabilities of a data instance belonging to each class as a ratio. This can be calculated as the probability of a data instance belonging to one class, divided by the sum of the probabilities of the data instance belonging to each class. For example an instance had the probability of 0.02 for class A and 0.001 for class B, the likelihood of the instance belonging to class A is (0.02/(0.02+0.001))*100 which is about 95.23%.**Log Probabilities**: The conditional probabilities for each class given an attribute value are small. When they are multiplied together they result in very small values, which can lead to floating point underflow (numbers too small to represent in Python). A common fix for this is to combine the log of the probabilities together. Research and implement this improvement.**Nominal Attributes**: Update the implementation to support nominal attributes. This is much similar and the summary information you can collect for each attribute is the ratio of category values for each class. Dive into the references for more information.**Different Density Function**(*bernoulli*or*multinomial*): We have looked at Gaussian Naive Bayes, but you can also look at other distributions. Implement a different distribution such as multinomial, bernoulli or kernel naive bayes that make different assumptions about the distribution of attribute values and/or their relationship with the class value.

## Resources and Further Reading

This section will provide some resources that you can use to learn more about the Naive Bayes algorithm in terms of both theory of how and why it works and practical concerns for implementing it in code.

### Problem

More resources for learning about the problem of predicting the onset of diabetes.

- Pima Indians Diabetes Data Set: This page provides access to the dataset files, describes the attributes and lists papers that use the dataset.
- Dataset File: The dataset file.
- Dataset Summary: Description of the dataset attributes.
- Diabetes Dataset Results: The accuracy of many standard algorithms on this dataset.

### Code

This section links to open source implementations of Naive Bayes in popular machine learning libraries. Review these if you are considering implementing your own version of the method for operational use.

- Naive Bayes in Scikit-Learn: Implementation of naive bayes in the scikit-learn library.
- Naive Bayes documentation: Scikit-Learn documentation and sample code for Naive Bayes

### Books

You may have one or more books on applied machine learning. This section highlights the sections or chapters in common applied books on machine learning that refer to Naive Bayes.

- Applied Predictive Modeling, page 353
- Data Mining: Practical Machine Learning Tools and Techniques, page 94
- Machine Learning for Hackers, page 78
- An Introduction to Statistical Learning: with Applications in R, page 138
- Machine Learning: An Algorithmic Perspective, page 171
- Machine Learning in Action, page 61 (Chapter 4)
- Machine Learning, page 177 (chapter 6)

## Next Step

Take action.

Follow the tutorial and implement Naive Bayes from scratch. Adapt the example to another problem. Follow the extensions and improve upon the implementation.

Leave a comment and share your experiences.

**Update**: Check out the follow-up on tips for using the naive bayes algorithm titled: “Better Naive Bayes: 12 Tips To Get The Most From The Naive Bayes Algorithm“

Statistical methods should be developed from scratch because of misunderstandings. Thank you.

Jason … Thank you so much… you are too good.

This is a wonderful article. Your blog is one of those blogs that I visit everyday. Thanks for sharing this stuff. I had a question about the programming language that should be used for building these algorithms from scratch. I know that Python is widely used because it’s easy to write code by importing useful libraries that are already available. Nevertheless, I am a C++ guy. Although I am a beginner in practical ML, I have tried to write efficient codes before I started learning and implementing ML. Now I am aware of the complexities involved in coding if you’re using C++: more coding is to be done than what is required in Python. Considering that, what language is your preference and under what situations? I know that it’s lame to ask about preferences of programming language as it is essentially a personal choice. But still I’d like you to share your take on this. Also try to share the trade-offs while choosing these programming languages.

Thank you.

Thanks Anurag

I am getting error while I try to implement this in my own dataset.

for classValue, classSummaries in summaries.iteritems():

AttributeError: ‘list’ object has no attribute ‘iteritems’

When I try to run it,with your csv file,it says

ataset = list(lines)

Error: iterator should return strings, not bytes (did you open the file in text mode?)

What to do?

The code was written for Python 2.7. Perhaps you’re using Python 3?

i m using python 3.6 do you have the updated code?

The version of naive bayes in the book supports Py2.7 and Py3.

https://machinelearningmastery.com/machine-learning-algorithms-from-scratch/

use items instead of iteritems. when you write erorro message to google, you can find lots of resolve like stackoverflow.

Hi Jason, found your website and read it in one day. Thank you, it really helped me to understand ML and what to do.

I did the 2 examples here and I think I will take a look at scikit-learn now.

I have a personal project that I want to use ML, and I’ll keep you posted on the progress.

One small note on this post, is on the “1. Handle data” you refer to iris.data from previous post.

Thank you so much, example is really good to show how to do it. Please keep it coming.

Thanks for the kind words Alcides.

Fixed the reference to the iris dataset.

hi jason im not able to run the code and get the Output it says “No such file or directory: ‘pima-indians-diabetes.data.csv'”

Ensure you are running the code from the command line and that the data file and code file are in the same directory.

is thetre any data file available

Hi Jason, still one more note on your post, is on the “1. Handle data” the flower measures that you refer to iris.data.

Thanks, fixed!

Great Article. Learned a Lot. Thanks. Thanks.

thank u

You’re welcome!

Thanks for your nice article. I really appreciate the step by step instructions.

hello sir, plz tell me how to compare the data set using naive Bayes algorithm.

Why does the accuracy change every time you run this code?

when i tried running this code every time it gave me different accuracy percentage in the range from 70-78%

Why is it so?

Why is it not giving a constant accuracy percent?

As Splitting of dataset into testdata and traindata is done using a random function accuracy varies.

I’m extremely new to this concept so please help me with this query.

The algorithm splits the dataset into the same top 67% and bottom 33% every single time.

The test-set is same on every single run.

So even though we use a random function on the top 67%(training set) to randomly index them.

A calculation like

((4+2)/6) and ((2+4)/6) will yield same result every-time.How is this yielding different result?

Is this something to do with the order of calculation in the Gaussian probability density function?

well.. the math calculations would come under to the view if you had the same sample taken again and again..

but the point here is.. we are taking the random data itself.. like.. the variables will not be same in every row right.. so if u change the rows randomly… at the end the accuracy will change… taken that into consideration.. you can take average of accuracy for each run to get what is the “UN-fluctuated accuracy” just in case if you want…

hope it helps..

Hey nice article – one question – why do you use the N-1 in the STD Deviation Process?

Thanks!

N-1 is the sample (corrected or unbiased) standard deviation. See this wiki page.

Hey! Thanks a ton! This was very useful.

It would be great if you give an idea on how other metrics like precision and recall can be calculated.

Thanks!

Sir

When I am running the same code in IDLE (python 2.7) – the code is working fine, but when I run the same code in eclipse. the error coming is:

1) warning – unused variable dataset

2) undefined variable dataset in for loop

Why this difference.

Great post, Jason.

For a MBA/LLM, it makes naive bayes very easy to understand and to implement in legal coding. Looking forward to read more. Best, Melvin

Jason,

Great example. Thanks! One nit: “calculateProbability” is not a good name for a function which actually calculates Gaussian probability density – pdf value may be greater than 1.

Cheers,

-Igor

Good point, thanks!

Hi Jason,

Fantastic post. I really learnt a lot. However I do have a question? Why don’ t you use the P(y) value in your calculateClassProbabilities() ?

If I understood the model correctly, everything is based on the bayes theorem :

P(y|x1….xn) = P(x1…..xn|y) * P(y) / P(x1……xn)

P(x1……xn) will be a constant so we can get rid of it.

Your post explain very well how to calculate P(x1……xn|y) (assumption made that x1…..xn are all independent we then have

P(x1……xn|y) = P(x1|y) * …. P(xn|y) )

How about p(y) ? I assume that we should calculate the frequency of the observation y in the training set and then multiply it to probabilities[classValue] so that we have :

P(y|x1…..xn) = frequency(classValue) * probabilities[classValue]

Otherwise let’ s assume that in a training set of 500 lines, we have two class 0 and 1 but observed 100 times 0 et 400 times 1. If we do not compute the frequency, then the probability may be biased, right ? Did I misunderstand something ? Hopefully my post is clear. I really hope that you will reply because I am a bit confused.

Thanks

Alex

I have the same question – why is multiplying by p(y) is omitted?

No Answer yet – no one on internet has answer to this.

Just don’t want to accept answers without understanding it.

yeah，I have the same question too, maybe the P(y) is nessary ,but why the accuracy is not so low when P(y) is missing? is it proving that bayes model is powerful?

hi,

I believe this is because P(y) = 1 as classes are already segregated before calculating P(x1…xn|Y).

Can experts comment on this please?

There is huge bug in this implementation;

First of all the implementation using GaussianNB gives totally a different answer.

Why is no one is replying even after 2 months of this.

My concern is, there are so many more bad bayesians in a wrong concept.

My lead read this article and now he thinks I am wrong

At least the parameters are correct – something wrong with calculating probs.

def SplitXy(Xy):

Xy10=Xy[0:8]

Xy10 = Xy;

#print Xy10

#print “========”

zXy10=list(zip(*Xy10))

y= zXy10[-1]

del zXy10[-1]

z1=zip(*zXy10)

X=[list(t) for t in z1]

return X,y

from sklearn.naive_bayes import GaussianNB

X,y = SplitXy(trainingSet)

Xt,yt = SplitXy(testSet)

model = GaussianNB()

model.fit(X, y)

### Compare the models built by Python

print (“Class: 0”)

for i,j in enumerate(model.theta_[0]):

print (“({:8.2f} {:9.2f} {:7.2f} )”.format(j, model.sigma_[0][i], sqrt(model.sigma_[0][i])) , end=””)

print (“==> “, summaries[0][i])

print (“Class: 1”)

for i,j in enumerate(model.theta_[1]):

print (“({:8.2f} {:9.2f} {:7.2f} )”.format(j, model.sigma_[1][i], sqrt(model.sigma_[1][i])) , end=””)

print (“==> “, summaries[1][i])

”’

Class: 0

( 3.18 9.06 3.01 )==> (3.1766467065868262, 3.0147673799630748)

( 109.12 699.16 26.44 )==> (109.11976047904191, 26.481293163857107)

( 68.71 286.46 16.93 )==> (68.712574850299404, 16.950414098038465)

( 19.74 228.74 15.12 )==> (19.742514970059879, 15.146913806453629)

( 68.64 10763.69 103.75 )==> (68.640718562874255, 103.90387227315443)

( 30.71 58.05 7.62 )==> (30.710778443113771, 7.630215185470916)

( 0.42 0.09 0.29 )==> (0.42285928143712581, 0.29409299864249266)

( 30.66 118.36 10.88 )==> (30.658682634730539, 10.895778423248444)

Class: 1

( 4.76 12.44 3.53 )==> (4.7611111111111111, 3.5365037952376928)

( 139.17 1064.54 32.63 )==> (139.17222222222222, 32.71833930500929)

( 69.27 525.24 22.92 )==> (69.272222222222226, 22.98209907114023)

( 22.64 309.59 17.60 )==> (22.638888888888889, 17.644143437447358)

( 101.13 20409.91 142.86 )==> (101.12777777777778, 143.2617649699204)

( 34.99 57.18 7.56 )==> (34.99388888888889, 7.5825893182809425)

( 0.54 0.14 0.37 )==> (0.53544444444444439, 0.3702077209795522)

( 36.73 112.86 10.62 )==> (36.727777777777774, 10.653417924304598)

”’

Hello,

Thanks for this article , it is very helpful . I just have a remark about the probabilty that you are calculating which is P(x|Ck) and then you make predictions, the result will be biased since you don’t multiply by P(Ck) , P(x) can be omitted since it’s only a normalisation constant.

Thanks a lot for this tutorial, Jason.

I have a quick question if you can help.

In the

`separateByClass()`

definition, I could not understand how`vector[-1]`

is a right usage when`vector`

is an`int`

type object.If I try the same commands one by one outside the function, the line of code with

`vector[-1]`

obviously throws a`TypeError: 'int' object has no attribute '__getitem__'`

.Then how is it working inside the function?

I am sorry for my ignorance. I am new to python. Thank you.

Hello Jason! I just wanted to leave a message to say thank you for the website. I am preparing for a job in this field and it has helped me so much. Keep up the amazing work!! 🙂

You’re welcome! Thanks for leaving such a kind comment, you really made my day 🙂

Hi Jason,

Very easy to follow your classifier. I try it and works well on your data, but is important to note that it works just on numerical databases, so maybe one have to transform your data from categorical to numerical format.

Another thing, when I transformed one database, sometimes the algorithm find division by zero error, although I avoided to use that number on features and classes.

Any suggestion Jason?

Thanks, Jaime

Thanks

It is by far the best material I’ve found , please continue helping the community!

Thanks eduardo!

Hello.

This is all well explained, and depicts well the steps of machine learning. But the way you calculate your P(y|X) here is false, and may lead to unwanted error.

Here, in theory, using the Bayes law, we know that : P(y|X) = P(y).P(X|y)/P(X). As we want to maximize P(y|X) with a given X, we can ignore P(X) and pick the result for the maximized value of P(y).P(X|y)

2 points remain inconsistent :

– First, you pick a gaussian distribution to estimate P(X|y). But here, you calculateProbability calculates the DENSITY of the function to the specific points X, y, with associated mean and deviation, and not the actual probability.

– The second point is that you don’t take into consideration the calculation of P(y) to estimate P(y|X). Your model (with the correct probability calculation) may work only if all samples have same amount in every value of y (considering y is discret), or if you are lucky enough.

Anyway, despite those mathematical issue, this is a good work, and a god introduction to machine learning.

Thanks Jason for all this great material. One thing that i adore from you is the intellectual honesty, the spirit of collaboration and the parsimony.

In my opinion you are one of the best didactics exponents in the ML.

Thanks to Thibroll too. But i would like to have a real example of the problem in R, python or any other language.

Regards,

Emmanuel:.

Hi Jason,

I have trying to get started with machine learning and your article has given me the much needed first push towards that. Thank you for your efforts! 🙂

i need this code in java.. please help me//

I am working with this code – tweaking it here or there – have found it very helpful as I implement a NB from scratch. I am trying to take the next step and add in categorical data. Any suggestions on where I can head to get ideas for how to add this? Or any particular functions/methods in Python you can recommend? I’ve brought in all the attributes and split them into two datasets for continuous vs. categorical so that I can work on them separately before bringing their probabilities back together. I’ve got the categorical in the same dictionary where the key is the class and the values are lists of attributes for each instance. I’m not sure how to go through the values to count frequencies and then how to store this back up so that I have the attribute values along with their frequencies/probabilities. A dictionary within a dictionary? Should I be going in another direction and not using a similar format?

Thank you Jason, this tutorial is helping me with my implementation of NB algorithm for my PhD Dissertation. Very elaborate.

Hi! thank you! Have you tried to do the same for the textual datasets, for example 20Newsgroups http://qwone.com/~jason/20Newsgroups/ ? Would appreciate some hints or ideas )

Great article, but as others pointed out there are some mathematical mistakes like using the probability density function for single value probabilities.

Thank you for this amazing article!! I implemented the same for wine and MNIST data set and these tutorials helped me so much!! 🙂

I got an error with the first print statement, because your parenthesis are closing the call to print (which returns None) before you’re calling format, so instead of

print(‘Split {0} rows into train with {1} and test with {2}’).format(len(dataset), train, test)

it should be

print(‘Split {0} rows into train with {1} and test with {2}’.format(len(dataset), train, test))

Anyway, thanks for this tutorial, it was really useful, cheers!

Sincere gratitude for this most excellent site. Yes, I never learn until I write code for the algorithm. It is such an important exercise, to get concepts embedded into one’s brain. Brilliant effort, truly !

Just to test the algorithm, i change the class of few of the data to something else i.e 3 or 4, (last digit in a line) and i get divide by zero error while calculating the variance. I am not sure why. does it mean that this particular program works only for 2 classess? cant see anything which restricts it to that.

Hi, I’m a student in Japan.

It seems to me that you are calculating p(X1|Ck)*p(X2|Ck)*…*p(Xm|Ck) and choosing Ck such that this value would be maximum.

However, when I looked in the Wikipedia, you are supposed to calculate p(X1|Ck)*p(X2|Ck)*…*p(Xm|Ck)*p(Ck).

I don’t understand when you calculated p(Ck).

Would you tell me about it?

Had the same thought, where’s the prior calculated?

Its seems it is a bug of this implementation.

I looked for the implementation of scikitlearn and they seem to have the right formula,

reference:

https://github.com/scikit-learn/scikit-learn/blob/e41c4d5e3944083328fd69aeacb590cbb78484da/sklearn/naive_bayes.py#L432

You can add the prior easily. I left it out because it was a constant for this dataset.

This is the same question as Alex Ubot above.

Calculating the parameters are correct.

but prediction implementation is incorrect.

Unfortunately this article comes up high and everyone is learning incorrect way of doing things I think

Really nice tutorial. Can you post a detailed implementation of RandomForest as well ? It will be very helpful for us if you do so.

Thanks!!

Great idea Swapnil.

You may find this post useful:

http://machinelearningmastery.com/bagging-and-random-forest-ensemble-algorithms-for-machine-learning/

thanks Jason…very nice tutorial.

Hi,

I was interested in this Naive Bayes example and downloaded the .csv data and the code to process it.

However, when I try to run it in Pycharm IDE using Python 3.5 I get no end of run-time errors.

Has anyone else run the code successfully? And if so, what IDE/environment did they use?

Thanks

Gary

Hi Gary,

You might want to run it using Python 2.7.

Hi,

Thanks for the excellent tutorial. I’ve attempted to implement the same in Go.

Here is a link for anyone that’s interested interested.

https://github.com/sudsred/gBay

This is AWESOME!!! Thank you Jason.

Where can I find more of this?

That can be implemented in any language because there’re no special libraries involved.

Hello,

there is some errors in “def splitDataset”

in machine learning algorithm , split a dataset into trainning and testing must be done without repetition (duplication) , so the index = random.randrange(len(copy)) generate duplicate data

for example ” index = 0 192 1 2 0 14 34 56 1 ………

the spliting method must be done without duplication of data.

This is a highly informative and detailed explained article. Although I think that this is suitable for Python 2.x versions for 3.x, we don’t have ‘iteritems’ function in a dict object, we currently have ‘items’ in dict object. Secondly, format function is called on lot of print functions, which should have been on string in the print funciton but it has been somehow called on print function, which throws an error, can you please look into it.

hey Jason

thanks for such a great tutorial im newbie to the concept and want to try naive bayes approach on movie-review on the review of a single movie that i have collected in a text file

can you please provide some hint on the topic how to load my file and perform positve or negative review on it

You might find some benefit in this tutorial as a template:

http://machinelearningmastery.com/machine-learning-in-python-step-by-step/

Would you please help me how i can implement naive bayes to predict student performance using their marks and feedback

I’m sorry, I am happy to answer your questions, but I cannot help you with your project. I just don’t have the capacity.

Hey Jason,

Thanks a lot for such a nice article, helped a lot in understanding the implementations,

i have a problem while running the script.

I get the below error

”

if (vector[-1] not in separated):

IndexError: list index out of range

”

can you please help me in getting it right?

Thanks Vinay.

Check that the data was loaded successfully. Perhaps there are empty lines or columns in your loaded data?

Hi Jason,

Thank you for the wonderful article. U have used the ‘?'(testSet = [[1.1, ‘?’], [19.1, ‘?’]]) in the test set. can u please tell me what it specifies

please send me a code in text classification using naive bayes classifier in python . the data set classifies +ve,-ve or neutral

Hi jeni, sorry I don’t have such an example prepared.

I am a newbie to ML and I found your website today. It is one of the greatest ML resources available on the Internet. I bookmarked it and thanks for everything Jason and I will visit your website everyday going forward.

Thanks, I’m glad you like it.

def predict(summaries, inputVector):

probabilities = calculateClassProbabilities(summaries, inputVector)

bestLabel, bestProb = None, -1

for classValue, probability in probabilities.iteritems():

if bestLabel is None or probability > bestProb:

bestProb = probability

bestLabel = classValue

return bestLabel

why is the prediction different for these

summaries = {‘A’ : [(1, 0.5)], ‘B’: [(20, 5.0)]} –predicts A

summaries = {‘0’ : [(1, 0.5)], ‘1’: [(20, 5.0)]} — predicts 0

summaries = {0 : [(1, 0.5)], 1: [(20, 5.0)]} — predicts 1

Hi, can someone please explain the code snippet below:

def separateByClass(dataset):

separated = {}

for i in range(len(dataset)):

vector = dataset[i]

if (vector[-1] not in separated):

separated[vector[-1]] = []

separated[vector[-1]].append(vector)

return separated

What do curly brackets mean in separated = {}?

vector[-1] ?

Massive thanks!

Curly brackets are a set or dictionary in Python, you can learn more about sets in Python here:

https://docs.python.org/3/tutorial/datastructures.html#sets

I am trying to create an Android app which works as follows:

1) On opening the App, the user types a data in a textbox & clicks on search

2) The app then searches about the entered data via internet and returns some answer (Using machine learning algorithms)

I have a dataset of around 17000 things.

Can you suggest the approach? Python/Java/etc…? Which technology to use for implementing machine learning algorithm & for connecting to dataset? How to include the dataset so that android app size is not increased?

Basically, I am trying to implement an app described in a research paper.

I can implement machine learning(ML) algorithms in Python on my laptop for simple ML examples. But, I want to develop an Android app in which the data entered by user is checked from a web site and then from a “data set (using ML)” and result is displayed in app based on both the comparisons. The problem is that the data is of 40 MB & how to reflect the ML results from laptop to android app?? By the way, the dataset is also available online. Shall I need a server? Or, can I use localhost using WAMP server?

Which python server should I use? I would also need to check the data entered from a live website. Can I connect my Android app to live server and localhost simultaneously? Is such a scenario obvious for my app? What do you suggest? Is Anaconda software sufficient?

Sorry I cannot make any good suggestions, I think you need to talk to some app engineers, not ML people.

Hello Jason,

lines = csv.reader(open(filename, “rb”))

IOError: [Errno 2] No such file or directory: ‘pima-indians-diabetes.data.csv’

I have the csv file downloaded and its in the same folder as my code.

What should I do about this?

Hi Roy,

Confirm that the file name in your directory exactly matches the expectation of the script.

Confirm you are running from the command line and both the script and the data file are in the same directory and you are running from that directory.

If using Python 3, consider changing the ‘rb’ to ‘rt’ (text instead of binary).

Does that help?

Hi Jason. Great Tutorial! But, why did you leave the P(Y) in calculateClassProbability() ? The prediction produces in my machine is fine… But some people up there have mentioned it too that what you actually calculate is probability density function. And you don’t even answer their question.

Hi Jason,

Can you please help me fixing below error, The split is working but accuracy giving error

Split 769 rows into train=515 and test=254 rows

Traceback (most recent call last):

File “indian.py”, line 100, in

main()

File “indian.py”, line 95, in main

summaries = summarizeByClass(trainingSet)

File “indian.py”, line 45, in summarizeByClass

separated = separateByClass(dataset)

File “indian.py”, line 26, in separateByClass

if (vector[-1] not in separated):

IndexError: list index out of range

Class wise selection of training and testing data

For Example

In Iris Dataset : Species Column we have classes called Setosa, versicolor and virginica

I want to select 80% of data from each class values.

Advance thanks

Shankru

shankar286@gmail.Com

You can take a random sample or use a stratified sample to ensure the same mixture of classes in train and test sets.

error in Naive Bayes code

IndexError:list index out of range

hai guys

i am velmurugan iam studying annauniversity tindivanam

i have a code for summarization for english description in java

Hi Jason,

This example works. really good for Naive Bayes, but I was wondering what the approach would be like for joint probability distributions. Given a dataset, how to construct a bayesian network in Python or R?

Hello Jason,

Joelon here. I am new to python and machine learning. I keep getting a run-time error after compiling the above script. Is it possible I send you screenshots of the error so we walk through step by step?

What problem are you getting exactly?

Hello Jason !

Thank you for this tutorial.

I have a question: what if our x to predict is a vector? How can we calculate the probability to be in a class (in the function calculateProbability for example) ?

Thank you

Not sure I understand Bill. Perhaps you can give an example?

HI Jason,

Thanks for such a wonderful article. Your efforts are priceless. One quick question about handling cases with single value probabilities. Which part of code requires any smoothening.

Sorry, I’m not sure I understand your question. Perhaps you can restate it?

Hello Jason,

First of all I thank you very much for such a nice tutorial.

I have a quick question for you if you could find some of your precious time to answer it.

Question: Why is that summarizeByClass(dataset) works only with a particular pattern of the dataset like the dataset in your example and does not work with the different pattern like my example dataset = [[2,3,1], [9,7,3], [12,9,0],[29,0,0]]

I guess it has to work for all the different possible datasets.

Thanks,

Mohammed Ehteramuddin

It should work with any dataset as long as the last variable is the class variable.

Oh! you mean the last variable of the dataset (input) can not be other that the two values that we desire to classify the data into, in our case it should either be 0 or 1.

Thanks you very much.

Yes.

What is

`correct`

variable refers in getAccuracy function? Can you elaborate it more?Sorry that was wrong question.

Edit:

Ideally the Gaussian Naive Bayes has lambda (threshold) value to set boundary. I was wondering which part of this code include the threshold?

I do not include threshold.

Can you provide an extension to the data. I can not downloand it for some reason. Thank you!

Here is a direct link to the data file:

https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data

Sometimes the UCI ML repo will go down for a few hours.

I trying running the program but it keeps saying

File “nb.py”, line 102, in

main()

File “nb.py”, line 92, in main

dataset = loadCsv(filename)

File “nb.py”, line 11, in loadCsv

dataset[i] = [float(x) for x in dataset[i]]

ValueError: invalid literal for float(): 1,85,66,29,0,26.6,0.351,31,0

Hi Jason…!

Thank You so much for coding the problem in a very clear way. As I am new in machine learning and also I have not used Python before therefore I feel some difficulties in modifying it. I want to include the serial number of in data set and then show which testing example (e.g. example # 110) got how much probability form class 0 and for 1.

You might be better off using Weka:

http://machinelearningmastery.com/start-here/#weka

Hi Jason,

I encountered a problem at the beginning. After loading the file and running this test:

filename = ‘pima-indians-diabetes.data.csv’

dataset = loadCsv(filename)

print(‘Loaded data file {0} with {1} rows’).format(filename, len(dataset))

I get the Error: “iterator should return strings, not bytes (did you open the file in text mode?)”

btw i am using python 3.6

thank you for the help

Change the loading of the file from binary to text (e.g. ‘rt’)

This code is not working with Python 3.16 :S

3.6

Thanks Marcus.

Is there a way you can try to fix and make it work with 3.6 maybe Jason?

Jason can you explain what this code does?

Yes, see here:

http://machinelearningmastery.com/naive-bayes-for-machine-learning/

I run the code in python 3.6.3 and here are the corrections:

—————————————————————————-

1. change the “rb” to “rt”

2. print(‘Split {0} rows into train={1} and test={2} rows’.format(len(dataset), len(trainingSet), len(testSet)))

3. change for each .iteritems() to .items()

4. print(‘Accuracy: {0}%’.format(accuracy))

Here are some of the results:

Split 768 rows into train=514 and test=254 rows

Accuracy: 71.25984251968504%

Split 768 rows into train=514 and test=254 rows

Accuracy: 76.77165354330708%

Thanks for the tips.

The code in this tutorial is riddled with error after error… The string output formatting isn’t even done right for gods sakes!

This was written in 2014, the python documentation has changed drastically as the versions have been updated

Hey Jason, I really enjoyed the read as it was very thorough and even for a beginner programmer like myself understandable overall. However, like Marcus asked, is it at all possible for you to rewrite or point out how to edit the parts that have new syntax in python 3?

Also, this version utilized the Gaussian probability density function, how would we use the other ones, would the math be different or the code?

Hi Jason, thank you for this post it’s super informative, I just started college and this is really easy to follow! I was wondering how this could be done with a mixture of binary and categorical data. For example, if I wanted to create a model to determine whether or not a car would break down and one category had a list of names of 10 car parts while another category simply asked if the car overheated (yes or no). Thanks again!

Absolutely.

Would the categorical data have to be preprocessed?

No. Naive Bayes is much simpler on categorical data:

http://machinelearningmastery.com/naive-bayes-tutorial-for-machine-learning/

Hi! Thanks for this helpful article. I had a quick question: in your calculateProbability() function, should the denominator be multiplied by the variance instead of the standard deviation?

i.e. should

`return (1 / (math.sqrt(2 * math.pi) * stdev)) * exponent`

instead be:

`return (1 / (math.sqrt(2 * math.pi) * math.pow(stdev, 2))) * exponent`

Hi Jason. I visited ur site several times. This is really helpful. I look for SVM implementation in python from scratch like the way you implemented Naive Bayes here. Can u provide me SVM code??

Yes, I provide it in my book here:

https://machinelearningmastery.com/machine-learning-algorithms-from-scratch/

Thank you so much Mr. Brownlee. I would like to ask your permission if I can show my students your implementations? They are very easy to understand and follow. Thank you very much again 🙂

No problem as long as you credit the source and link back to my website.

thank you very much 🙂

Very good basic tutorial. Thank you.

Thanks.

Hi Jason,

I would have exported my model using joblib and as I have converted the categorical data to numeric in the training data-set to develop the model and now I have no clue on how to convert the a new categorical data to predict using the trained model.

New data must be prepared using the same methods as data used to train the model.

Sometimes this might mean keeping objects/coefficients used to scale or encode input data along with the model.

where/what is learning in this code. I think it is just naive bayes classification. Please specify the learning.

Good question, the probabilities are “learned” from data.

Great example. You are doing a great work thanks. Please am working on this example but i am confused on how to determine attribute relevance analysis. That is how do i determine which attribute is (will) be relevant for my model.

Perhaps you could look at the independent probabilities for each variable?

thanks very much. Grateful

Can you please send me the code for pedestrian detection using HOG and NAIVE BAYES?

Sorry, I cannot.

what does this block of code do

while len(trainSet) < trainSize:

index = random.randrange(len(copy))

trainSet.append(copy.pop(index))

return [trainSet, copy]

Selects random rows from the dataset copy and adds them to the training set.

Jason:

I am very happy with this implementation! I used it as inspiration for an R counterpart. I am unclear about one thing. I understand the training set mean and sd are parameters used to evaluate the test set, but I don’t know why that works lol.

How does evaluating the GPDF with the training set data and the test set instance attributes “train” the model? I may be confusing myself by interpreting “train” too literally.

I think of train as repetitiously doing something multiple times with improvement from each iteration, and these iterations ultimately produce some catalyst for higher predictions. It seems that there is only one iteration of defining the training set’s mean and sd. Not sure if this question makes sense and I apologize if that is the case.

Any help is truly, genuinely appreciated!

Scott Bishop

Here, the training data provides the basis for estimating the probabilities used by the model.

Hi Jason,

Sometimes I am getting the “ZeroDivisionError: float division by zero” when I am running the program

is there a way to get the elements that fall under each class after the classification is done

Do you mean a confusion matrix:

https://machinelearningmastery.com/confusion-matrix-machine-learning/

I added this function:

There is probably a more elegant way to write this code, but I’m new to Python 🙂

The returning array lets you calculate all sorts of criteria, such as sensitivity, specifity, predictive value, likelihood ratio etc.

Whoops I realized I mixed something up. FP and FN have to be the other way round since the outer if-clause checks the true condition. I hope no one has copied that and got in trouble…

Anyways, looks like the confusion matrix lives up to its name.

Hello Jason,

first of all, thanks for this blog. I learned a lot both on python (which I am pretty new to) and also this specific classifier.

I tested this algorithm on a sample database with cholesterol, age and heart desease, and got better results than with a logistic regression.

However, since age is clearly not normally distributed, I am not sure if this result is even legit.

Could you explain how I can change the calculateProbability function to a different distribution?

Oh, and also: How can I add code tags to my replies so that it becomes more readable?

This code is for learning purposes, I would recommend using the built-in functions in scikit-learn for real projects:

https://machinelearningmastery.com/start-here/#python

Hi DR. Jason,

It is a very good post.

I did not see the K Fold Cross Validation in this post like I saw in your post of Neural Network from scratch. Does it mean that Naive Bayes does not need K Fold Cross Validation? Or does not work with K Fold CV?

It is because I am trying to use K Fold CV with Naive Bayes from scratch but I find it difficult since we need to split data by class to make some calculations, we find there two datasets if we have two class classification dataset (but we have one K Fold CV function).

I am facing serious difficulties to understand K Fold CV it seams that the implementation from scratch depends on the classifier we are using.

If you have some answer or tips to this approach (validation on Naive Bayes with K Fold CV – from scratch – ) please let me know.

Regards.

It is a good idea to use CV to evaluate algorithms including naive bayes.

If the variance is equal to 0, how to deal with？

If the variance is 0, then all data in your sample has the same value.

if ‘stdev’ is 0 , how to deal with it?

If stdev is 0 it means all values are the same and that feature should be removed.

How can I retrain my model, without training it from scratch again?

Like if I got some new label for instances, how to approach it then?

Many models can be updated with new data.

With naive bayes, you can keep track of the frequency and likelihood of obs for discrete values or the PDF/mass function for continuous values and update them with new data.

Very good Program, its working correctly, how to construct the bayesian network taking this pima diabetes csv file.

Thanks.

Hey can I ask why didn’t you use a MLE estimate for the prior?

MLE?

It’s a Navie-bayes-classifiation but insted of using Navies-bayes therom we are using Gaussian Probability Density Function Why ?????

To summarise a probability distribution.

why input vector is [1.1,”?”] ? and it works fine I try with [1.1] . Why did we choose the number 1.1 as the parameter for input vector?

It seems that your code is using GaussianNB without prior. The prior is obtained with MLE estimator of the training set. It is simply a constant multiplied by the computed likelihood probability. I tried both (with/without prior) and found that predicting with prior would give better results most of the time.

Thanks.

Jason, why do I get the error messate

AttributeError: ‘dict’ object has no attribute ‘iteritems’ when I am trying the run the

# Split the dataset by class values, returns a dictionary in the Naïve Bayes example in Chapter 12 of the Algorithms from Scratch in Python tutorial?

Thanks

Ken

Sounds like you are using Python 3 for Python 2.7 code.

Jason, I figured it out. In Py 3.X does not understand the .interitem function so when I changed that to Py 2.7 .item it worked fine. Version difference between 3.X and 2.7

Thanks

Glad to hear it.

Hi,

Thank you for the helpful post.

When i ran the code in python 3.6 i encountered the

“ZeroDivisionError: float division by zero” ERROR

any advise on this?

The code was written for Py2.7

thank you Jason for your awesome post.

I’m happy it helped.

I want to calculate the F1 Score. I can do this for binary classification problem. I am confused how to do it for multi class classification. I have calculated the confusion matrix for my data set. my data set contains three different classValue.

Kindly suggest.

The sklearn library may support this, more here:

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

Thanks for the awesome work. P.S external link to the weka for naive bayes shown 404.

Kind Regards

Thanks.

Hi Sir

Great example. You are doing a great great work thanks. kindly also upload similar post related to hidden markov model

Thanks for the suggestion.

Thanks for posting this nice algorithm explained. Nevertheless I struggled a lot until I found out that it is a Gaussian Naive Bayes version. I expected calculations of probabilities counting the prior, posterior, etc. It took me a while to figure it out. I have learnt a lot through it, though 🙂

Thanks.

How we fix this error in python 2.7

return sum(numbers) / (len(numbers))

TypeError: unsupported operand type(s) for +: ‘int’ and ‘str’

thank you for your awesome post…

Hi Jason. I am getting this error while running your code

File “C:/Users/Mustajab/Desktop/ML Assignment/Naive Bayes.py”, line 191, in loadCsv

dataset = list(lines)

Error: iterator should return strings, not bytes (did you open the file in text mode?)

How can I fix this error?

P(A/B) = (P(B/A)P(A)) / P(B))

This is the formula..

In RHS of the formula we use only numerator because denominators are same for every class and doen’t add any extra info as to determine which probability is bigger.

But I don’t thing the numerator is correctly implemented in the function.

Specifically P(A) needs to be multiplied with the product of the conditional probabilities of the individual features belonging to a particular class.

P(A) here is our class probability. That term is not multiplied to the final product inside the calculateClassProbabilities() function.

Something seems off here. Your inputs to the calculateProbability() function are x=71.5, mean=73.0, stdev=6.2. Some simple math will tell you that x is 0.24 standard deviations away from the mean. In a Gaussian distribution, you can’t get much closer than that. Yet the probability of belonging to that distribution is only 6%? Shouldn’t this be something more like 94%?

hello Jason, I try to run this code with my own file but I get “ValueError: could not convert string to float: wtdeer”. do you know How can I fix it ?

thank you so much

Perhaps try using the scikit-learn library as a starting point:

https://machinelearningmastery.com/start-here/#python

Nice tutorials Jason, however needs your attention for the typo in print statements, hope it will be fixed soon.

Thanks anyways once again for providing such a nice explanation!

Thanks.

File “C:/Users/user/Desktop/ss.py”, line 9, in

dataset = loadCsv(filename)

File “C:/Users/user/Desktop/ss.py”, line 4, in loadCsv

dataset = list(lines)

_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

solve this error

jason i have a question i want to do prediction heart disease and the result will be like this for example there are 9 heart disease heart failure heart stroke and more it will predicate the data and generate result like you have heart stroke disease so i have a question which classifier is best and

This is a common question that I answer here:

https://machinelearningmastery.com/faq/single-faq/what-algorithm-config-should-i-use

in code program, can you show to me which one that P(H|E), (P(E│H), P(H) ), (P(E)) as we know that it is formula of naive bayes classifier. would you like to tell me? because I need that reference for show to my lecture. thank you

Jason,

I love your ‘building from scratch way’ of approaching machine learning algorithm, this is equally important as ‘understanding the algorithm’. Your implementations fulfilled the ‘building part’ which is sometimes understated in college classes.

Thanks.

Very thorough step by step guide and your blog is one of the best out there for coding in machine learning.Naive bayes is a very strong algorithm but i find that its best usage is usually with other classifiers and voting between them and get the winner for the final classifier

Great tip!

im getting an error

could not convert string to float: id

in the load Csv function

I believe the example requires Python 2. Perhaps confirm your version of Python?

Why does this model predict values all 1 or all 0?

The model predicts probabilities and class values.

Thanks for the in-depth tutorial.

I re-implemented the code, but was able to get mostly 60+% accuracy. The best so far was 70% and rather a fluke. Your 76.8% result seems a bit out of my reach. The train/test data sets are randomly selected so it’s hard to be exact. I am just wondering if 60+% accuracy is to be expected or I am doing something wrong.

My bad, ignore my post. Mistake in code. 76% is the average result.

Well done.

Double check that you copied all of the code exactly?

it is not working could u help me pla

when i execute the code it shows the following error, would you help me Please?

Split{0}rows into train={1} and test={2} rows

—————————————————————————

AttributeError Traceback (most recent call last)

in

98 accuracy = getAccuracy(testSet, predictions)

99 print(‘Accuracy: {0}%’).format(accuracy)

–> 100 main()

in main()

91 dataset = loadCsv(data)

92 trainingSet, testSet = splitDataset(dataset, splitRatio)

—> 93 print(‘Split{0}rows into train={1} and test={2} rows’).format(len(dataset), len(trainingSet), len(testSet))

94 # prepare model

95 summaries = summarizeByClass(trainingSet)

AttributeError: ‘NoneType’ object has no attribute ‘format’

Sorry to hear that, perhaps try the implementation provided by sklearn:

https://scikit-learn.org/stable/modules/naive_bayes.html

One question:

to calculate bayes function is used the (Prior probability * Density Function)/Total probability but in your algorithm you only calculate the Density Function and use it to make the predictions. Why? Im confuse.

thanks for listening.

I realized that the denominator can be omitted, but and about the prior probability? Shoudn’t I compute too?

Yes, in general it is a good idea.

I removed the prior because it was a constant in this case.

It was a constant because of the dataset?

Correct, even number of observations for each class, e.g. fixed prior.

Thank you for the explanation. 🙂

You’re welcome.

Hi, Great example. You are doing great work thanks.

I have a question, can you help me to the modification your code for calculating precision and the recall so.

Sorry, I cannot change the code for you.

You can access a range of metrics here:

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics

ok, thank you so much.

I’m afraid your responses to others show you’ve fundamentally misunderstood the prior. You do need to include the prior in your calculations, because the prior is different for each class, and depends on your training data–it’s the fraction of cases that fall into that class, i.e 500/768 for an outcome of 0 and 268/700 for an outcome of 1, if we used the entire data set. Image a case where you had one feature variable and its normal distributions were identical for each class–you’d still need to account for the ratio between the different classes in when a prediction.

The only way you’d leave out the prior would be if each class had an equal number of data points in the training data, but the likelihood of getting 257 (=floor(768 * 0.67)/2) of each class in this instance is essentially zero.

It’s easy to check this is true–just fit scikit-learn’s GaussianNB on your training data and check its score on it and your test data. If you don’t include the prior for each class, your results won’t match.

Thanks Matthew.