Last Updated on August 15, 2020
Naive Bayes is a simple but surprisingly powerful algorithm for predictive modeling.
In this post you will discover the Naive Bayes algorithm for classification. After reading this post, you will know:
- The representation used by naive Bayes that is actually stored when a model is written to a file.
- How a learned model can be used to make predictions.
- How you can learn a naive Bayes model from training data.
- How to best prepare your data for the naive Bayes algorithm.
- Where to go for more information on naive Bayes.
This post is written for developers and does not assume any background in statistics or probability, although knowing a little probability wouldn’t hurt.
Kick-start your project with my new book Master Machine Learning Algorithms, including step-by-step tutorials and the Excel Spreadsheet files for all examples.
Let’s get started.
Quick Introduction to Bayes’ Theorem
In machine learning we are often interested in selecting the best hypothesis (h) given data (d).
In a classification problem, our hypothesis (h) may be the class to assign for a new data instance (d).
One of the easiest ways of selecting the most probable hypothesis given the data that we have that we can use as our prior knowledge about the problem. Bayes’ Theorem provides a way that we can calculate the probability of a hypothesis given our prior knowledge.
Bayes’ Theorem is stated as:
P(h|d) = (P(d|h) * P(h)) / P(d)
- P(h|d) is the probability of hypothesis h given the data d. This is called the posterior probability.
- P(d|h) is the probability of data d given that the hypothesis h was true.
- P(h) is the probability of hypothesis h being true (regardless of the data). This is called the prior probability of h.
- P(d) is the probability of the data (regardless of the hypothesis).
You can see that we are interested in calculating the posterior probability of P(h|d) from the prior probability p(h) with P(D) and P(d|h).
After calculating the posterior probability for a number of different hypotheses, you can select the hypothesis with the highest probability. This is the maximum probable hypothesis and may formally be called the maximum a posteriori (MAP) hypothesis.
This can be written as:
MAP(h) = max(P(h|d))
MAP(h) = max((P(d|h) * P(h)) / P(d))
MAP(h) = max(P(d|h) * P(h))
The P(d) is a normalizing term which allows us to calculate the probability. We can drop it when we are interested in the most probable hypothesis as it is constant and only used to normalize.
Back to classification, if we have an even number of instances in each class in our training data, then the probability of each class (e.g. P(h)) will be equal. Again, this would be a constant term in our equation and we could drop it so that we end up with:
MAP(h) = max(P(d|h))
This is a useful exercise, because when reading up further on Naive Bayes you may see all of these forms of the theorem.
Get your FREE Algorithms Mind Map
I've created a handy mind map of 60+ algorithms organized by type.
Download it, print it and use it.
Also get exclusive access to the machine learning algorithms email mini-course.
Naive Bayes Classifier
Naive Bayes is a classification algorithm for binary (two-class) and multi-class classification problems. The technique is easiest to understand when described using binary or categorical input values.
It is called naive Bayes or idiot Bayes because the calculation of the probabilities for each hypothesis are simplified to make their calculation tractable. Rather than attempting to calculate the values of each attribute value P(d1, d2, d3|h), they are assumed to be conditionally independent given the target value and calculated as P(d1|h) * P(d2|H) and so on.
This is a very strong assumption that is most unlikely in real data, i.e. that the attributes do not interact. Nevertheless, the approach performs surprisingly well on data where this assumption does not hold.
Representation Used By Naive Bayes Models
The representation for naive Bayes is probabilities.
A list of probabilities are stored to file for a learned naive Bayes model. This includes:
- Class Probabilities: The probabilities of each class in the training dataset.
- Conditional Probabilities: The conditional probabilities of each input value given each class value.
Learn a Naive Bayes Model From Data
Learning a naive Bayes model from your training data is fast.
Training is fast because only the probability of each class and the probability of each class given different input (x) values need to be calculated. No coefficients need to be fitted by optimization procedures.
Calculating Class Probabilities
The class probabilities are simply the frequency of instances that belong to each class divided by the total number of instances.
For example in a binary classification the probability of an instance belonging to class 1 would be calculated as:
P(class=1) = count(class=1) / (count(class=0) + count(class=1))
In the simplest case each class would have the probability of 0.5 or 50% for a binary classification problem with the same number of instances in each class.
Calculating Conditional Probabilities
The conditional probabilities are the frequency of each attribute value for a given class value divided by the frequency of instances with that class value.
For example, if a “weather” attribute had the values “sunny” and “rainy” and the class attribute had the class values “go-out” and “stay-home“, then the conditional probabilities of each weather value for each class value could be calculated as:
- P(weather=sunny|class=go-out) = count(instances with weather=sunny and class=go-out) / count(instances with class=go-out)
- P(weather=sunny|class=stay-home) = count(instances with weather=sunny and class=stay-home) / count(instances with class=stay-home)
- P(weather=rainy|class=go-out) = count(instances with weather=rainy and class=go-out) / count(instances with class=go-out)
- P(weather=rainy|class=stay-home) = count(instances with weather=rainy and class=stay-home) / count(instances with class=stay-home)
Make Predictions With a Naive Bayes Model
Given a naive Bayes model, you can make predictions for new data using Bayes theorem.
MAP(h) = max(P(d|h) * P(h))
Using our example above, if we had a new instance with the weather of sunny, we can calculate:
go-out = P(weather=sunny|class=go-out) * P(class=go-out)
stay-home = P(weather=sunny|class=stay-home) * P(class=stay-home)
We can choose the class that has the largest calculated value. We can turn these values into probabilities by normalizing them as follows:
P(go-out|weather=sunny) = go-out / (go-out + stay-home)
P(stay-home|weather=sunny) = stay-home / (go-out + stay-home)
If we had more input variables we could extend the above example. For example, pretend we have a “car” attribute with the values “working” and “broken“. We can multiply this probability into the equation.
For example below is the calculation for the “go-out” class label with the addition of the car input variable set to “working”:
go-out = P(weather=sunny|class=go-out) * P(car=working|class=go-out) * P(class=go-out)
Gaussian Naive Bayes
Naive Bayes can be extended to real-valued attributes, most commonly by assuming a Gaussian distribution.
This extension of naive Bayes is called Gaussian Naive Bayes. Other functions can be used to estimate the distribution of the data, but the Gaussian (or Normal distribution) is the easiest to work with because you only need to estimate the mean and the standard deviation from your training data.
Representation for Gaussian Naive Bayes
Above, we calculated the probabilities for input values for each class using a frequency. With real-valued inputs, we can calculate the mean and standard deviation of input values (x) for each class to summarize the distribution.
This means that in addition to the probabilities for each class, we must also store the mean and standard deviations for each input variable for each class.
Learn a Gaussian Naive Bayes Model From Data
This is as simple as calculating the mean and standard deviation values of each input variable (x) for each class value.
mean(x) = 1/n * sum(x)
Where n is the number of instances and x are the values for an input variable in your training data.
We can calculate the standard deviation using the following equation:
standard deviation(x) = sqrt(1/n * sum(xi-mean(x)^2 ))
This is the square root of the average squared difference of each value of x from the mean value of x, where n is the number of instances, sqrt() is the square root function, sum() is the sum function, xi is a specific value of the x variable for the i’th instance and mean(x) is described above, and ^2 is the square.
Make Predictions With a Gaussian Naive Bayes Model
Probabilities of new x values are calculated using the Gaussian Probability Density Function (PDF).
When making predictions these parameters can be plugged into the Gaussian PDF with a new input for the variable, and in return the Gaussian PDF will provide an estimate of the probability of that new input value for that class.
pdf(x, mean, sd) = (1 / (sqrt(2 * PI) * sd)) * exp(-((x-mean^2)/(2*sd^2)))
Where pdf(x) is the Gaussian PDF, sqrt() is the square root, mean and sd are the mean and standard deviation calculated above, PI is the numerical constant, exp() is the numerical constant e or Euler’s number raised to power and x is the input value for the input variable.
We can then plug in the probabilities into the equation above to make predictions with real-valued inputs.
For example, adapting one of the above calculations with numerical values for weather and car:
go-out = P(pdf(weather)|class=go-out) * P(pdf(car)|class=go-out) * P(class=go-out)
Best Prepare Your Data For Naive Bayes
- Categorical Inputs: Naive Bayes assumes label attributes such as binary, categorical or nominal.
- Gaussian Inputs: If the input variables are real-valued, a Gaussian distribution is assumed. In which case the algorithm will perform better if the univariate distributions of your data are Gaussian or near-Gaussian. This may require removing outliers (e.g. values that are more than 3 or 4 standard deviations from the mean).
- Classification Problems: Naive Bayes is a classification algorithm suitable for binary and multiclass classification.
- Log Probabilities: The calculation of the likelihood of different class values involves multiplying a lot of small numbers together. This can lead to an underflow of numerical precision. As such it is good practice to use a log transform of the probabilities to avoid this underflow.
- Kernel Functions: Rather than assuming a Gaussian distribution for numerical input values, more complex distributions can be used such as a variety of kernel density functions.
- Update Probabilities: When new data becomes available, you can simply update the probabilities of your model. This can be helpful if the data changes frequently.
Two other posts on Naive Bayes that you might find interesting are:
- How To Implement Naive Bayes From Scratch in Python
- Better Naive Bayes: 12 Tips To Get The Most From The Naive Bayes Algorithm
I love books. Below are some good general machine learning books for developers that cover naive Bayes:
- Data Mining: Practical Machine Learning Tools and Techniques, page 88
- Applied Predictive Modeling, page 353
- Artificial Intelligence: A Modern Approach, page 808
- Machine Learning, chapter 6
In this post you discovered the Naive Bayes algorithm for classification. You learned about:
- The Bayes Theorem and how to calculate it in practice.
- Naive Bayes algorithm including representation, making predictions and learning the model.
- The adaptation of Naive Bayes for real-valued input data called Gaussian Naive Bayes.
- How to prepare data for Naive Bayes.
Do you have any questions about naive Bayes or about this post? Leave a comment and ask your question, I will do my best to answer it.
Hi, how to understand this statement:
*if we have an even number of instances in each class in our training data, then the probability of each class (e.g. P(h)) will be equal*?
Can you give some clear and concise examples on this?
Sure, if you have two classes “red” and “blue” and 50 examples in each class, then the classes have an equal number of observations.
The probability of any random sample drawn from the population belonging to one class or another is 0.5.
I hope that helps.
Yes, I see now. Here I got the word “even” as any even number like 4, 6, or 8. My bad!
For a classification based on multiple features is it necessary to a multivariate gaussian distribution to decide class labels or will it be sufficient to decide the likelihoods of each feature considering each feature given that it is class yi to follow a gaussian distribution and then simple multiply them together to get the likelihoods? If the answer is yes, can you possibly direct me towards some references?
Thanks in advance!
It depends on the problem. Naive Bayes will assume independent gaussian distributions.
How do I use Naïve Bayes Machine Learning to detect and prevent SQL injection attack, so I mean by question when attacker inject malicious code this code saved in server and algorithm works after that? Or what happen in fact?
It would depend on how you frame your problem:
Hey, for what it concerns the Gaussian Naive Bayes Model, how do you evaluate the conditional probability P(pdf(weather)|class=go-out)? The pdf(x, mean, sd) returns the probability density function which is not a probability (sometimes the value can be higher than 1). Can you please clarify this?
Hey, did you have any time to give a look at the question?
You are right, pdf doesn’t give probability (but a limit of an area when a support goes to 0), but since the goal in naive bayes is to find argmax, we can use the pdf for this as most probable values have higher pdf.
Hi Dr. Jason,
what is the best machine learning algorithm for text classification?
Try a suite and see what works best for your specific data.
CNNs with a word embedding do very well.
Hi Dr. Jason,
1) Is there an article which explains generative vs discriminative models? Or could you give some intuitive explanation of these two models.
2) Articles explaining HMM, CRF and MEMM?
I don’t have posts on those topics, I hope to cover them in the future.
Hi Dr. Jason, Thank you for getting back to me. Will look forward for those topics. Thank you.
hi how is it become a lazy algorithm? and can you tell some applications using a naive bayse,..
You can apply it on most predictive modeling problems and compare the results to other algorithms to see if it should be used or not.
hello sir, I am doing social media analysis for predicting social crime (cyberbully, hacking etc..)
how we use naïve Bayes algorithm for multi-class that we set, for predicting social crime.
(for example particular tweet or post belonging to specified social crime class and how we train model for identifying and predicting social crime)
I would encourage you to use the sklearn implementation.
hello sir, i want to use naive bayes algorithm for blood donors data for prediction the blood donors for next time blood donation according to their age. Sir, is it possible of these kind of prediction.
I would recommend this process to work through your problem:
Thanks for the informative blog Jason. I have a question. In the definition of MAP you have mentioned that likellihood is multiplied with the probability of hyposthesis. Based on your example, I understand that this hypothesis some output class probability. I was referring some other blogs and they mentioned that to compute MAP need to multiply with the probability of prior of model parameters. Please find the links below.
Does this means, MAP is generic formula from Baye’s rule that can be used with both output classes priors and model parameter priors? I guess I am missing something very fundamental here.
Is MAP used for both model
Yes you should include the prior, I excluded it here because it was the same for each class.
I just can’t get enough of your posts man! I don’t have a question, but I really wanted to leave this comment to encourage you to keep writing such articles that are easy to understand and interpret.
Keep up the good work!
hello Jason sir thank you for the post.
I have a question sir:
1. do i have to use log for all the likelihood and also for the prior probabilities to get prediction.
(background : I have heart risk dataset and it contains both discrete and continuous data I have calculated likelihood for discrete data but when I am calculating for continuous data I get likelihoods to be zero and now I knew it was because of underflow of numerical precision. now, I am confused; do I have to transform all discrete and continuous likelihood to log and also prior ?(same above que))…
2.or should I only transform continuous data to log ? (and if I do it, will it not change the prediction?).
3.and lastly, if the log transformation is done the value is -ve so, should I only take magnitude or the sign also for the prediction?
I need help in this. Thank you sir
I think your formula for pdf has a mistake
pdf(x, mean, sd) = (1 / (sqrt(2 * PI) * sd)) * exp(-((x-mean^2)/(2*sd^2)))
pdf(x, mean, sd) = (1 / (sqrt(2 * PI) * sd)) * exp(-((x-mean)^2/(2*sd^2)))
otherwise there might be overflow issues
What python modules that implement naive bayes algo?
The scikit-learn library:
Wow the Best explanation for Naive Bayes I have ever seen.. You are awesome:-)
Hi, could you explain regarding corelation between standard deviation value of an attribute with naive bayes prediction ? cause I have some dataset which one of it has very low standard deviation and the accuracy result is very poor, but for the same dataset I have tried with decission tree and the result is more better than naive bayes.
I don’t believe it is related.
Hi, suppose i want to find KL divergence between two data blocks , can i apply Gaussian NB for finding probability distributions for these blocks? I have two data multivariate date blocks (one dependent and multiple independent variables)and i want to find the distance between these two data blocks probability distributions . How to find it? Please guide
Divergence is for one variable or one discrete event. Not sure about multivaraite kl divergence off the cuff, sorry.
Well written article Dr. Jason. I didn’t see such a clear-cut effort to explain Bayesian as of yet.
However, can you clarify the ‘P’ in the following statement:
If it means probability, then how come we get probability of a density value? Because as per your article, the pdf will itself provides a value which will be compared with another pdf value. So the use of ‘P’ is not clear to me. Thanks in advance!
You should understand better if you go through the implementation: https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/
But I agree with you. The point here is to say, we are computing the probability for each feature and multiply it together. But how to get the probability depends on the pdf of that feature. Gaussian Naive Bayes assumes they are in normal distribution.