# Better Naive Bayes: 12 Tips To Get The Most From The Naive Bayes Algorithm

Last Updated on

Naive Bayes is a simple and powerful technique that you should be testing and using on your classification problems.

It is simple to understand, gives good results and is fast to build a model and make predictions. For these reasons alone you should take a closer look at the algorithm.

In a recent blog post, you learned how to implement the Naive Bayes algorithm from scratch in python.

In this post you will learn tips and tricks to get the most from the Naive Bayes algorithm.

Discover how machine learning algorithms work including kNN, decision trees, naive bayes, SVM, ensembles and much more in my new book, with 22 tutorials and examples in excel.

## 1. Missing Data

Naive Bayes can handle missing data.

Attributes are handled separately by the algorithm at both model construction time and prediction time.

As such, if a data instance has a missing value for an attribute, it can be ignored while preparing the model, and ignored when a probability is calculated for a class value.

## 2. Use Log Probabilities

Probabilities are often small numbers. To calculate joint probabilities, you need to multiply probabilities together. When you multiply one small number by another small number, you get a very small number.

It is possible to get into difficulty with the precision of your floating point values, such as under-runs. To avoid this problem, work in the log probability space (take the logarithm of your probabilities).

This works because to make a prediction in Naive Bayes we need to know which class has the larger probability (rank) rather than what the specific probability was.

## Get your FREE Algorithms Mind Map Sample of the handy machine learning algorithms mind map.

I've created a handy mind map of 60+ algorithms organized by type.

## 3. Use Other Distributions

To use Naive Bayes with categorical attributes, you calculate a frequency for each observation.

To use Naive Bayes with real-valued attributes, you can summarize the density of the attribute using a Gaussian distribution. Alternatively you can use another functional form that better describes the distribution of the data, such as an exponential.

Don’t constrain yourself to the distributions used in examples of the Naive Bayes algorithm. Choose distributions that best characterize your data and prediction problem.

## 4. Use Probabilities For Feature Selection

Feature selection is the selection of those data attributes that best characterize a predicted variable.

In Naive Bayes, the probabilities for each attribute are calculated independently from the training dataset. You can use a search algorithm to explore the combination of the probabilities of different attributes together and evaluate their performance at predicting the output variable.

## 5. Segment The Data

Is their a well-defined subset of your data that responds well to the the Naive Bayes probabilistic approach?

Identifying and separating out segments that are easily handled by a simple probabilistic approach like Naive Bayes can give you increase performance and focus on the elements of the problem that are more difficult to model.

Explore different subsets, such as as the average or popular cases that are very likely handled well by Naive Bayes.

## 6. Re-compute Probabilities

Calculate the probabilities for each attribute is very fast.

This benefit of Naive Bayes means that you can re-calculate the probabilities as the data changes. This may be monthly, daily, even hourly.

This is something that may be unthinkable for other algorithms, but should be tested when using Naive Bayes if there is some temporal drift in the problem being modeled.

## 7. Use as a Generative Model

The Naive Bayes method characterizes the problem, which in turn can be used for making predictions about unseen data.

This probabilistic characterization can also be used to generate instances of the problem.

In the case of a numeric vector, the probability distributions can be sampled to create new fictitious vectors.

In the case of text (a very popular application of Naive Bayes), the model can be used to create fictitious input documents.

How might this be useful in your problem?

At the very least you can use the generative approach to help provide context for what the model has characterized.

## 8. Remove Redundant Features

The performance of Naive Bayes can degrade if the data contains highly correlated features.

This is because the highly correlated features are voted for twice in the model, over inflating their importance.

Evaluate the correlation of attributes pairwise with each other using a correlation matrix and remove those features that are the most highly correlated.

Nevertheless, always test your problem before and after such a change and stick with the form of the problem that leads to the better results.

## 9. Parallelize Probability Calculation

The probabilities for each attribute are calculated independently. This is the independence assumption in the approach and the reason why it has it’s name “naive”.

You can exploit this assumption to further speed up the execution of the algorithm by calculating attribute probabilities in parallel.

Depending on the size of the dataset and your resources, you could do this using different CPUs, different machines or different clusters.

## 10. Less Data Than You Think

Naive Bayes does not need a lot of data to perform well.

It needs enough data to understand the probabilistic relationship of each attribute in isolation with the output variable.

Given that interactions between attributes are ignored in the model, we do not need examples of these interactions and therefore generally less data than other algorithms, such as logistic regression.

Further, it is less likely to overfit the training data with a smaller sample size.

Try Naive Bayes if you do not have much training data.

## 11. Zero Observations Problem

Naive Bayes will not be reliable if there are significant differences in the attribute distributions compared to the training dataset.

An important example of this is the case where a categorical attribute has a value that was not observed in training. In this case, the model will assign a 0 probability and be unable to make a prediction.

These cases should be checked for and handled differently. After such cases have been resolved (an answer is known), the probabilities should be recalculated and the model updated.

## 12. It Works Anyway

An interesting point about Naive Bayes is that even when the independence assumption is violated and there are clear known relationships between attributes, it works anyway.

Importantly, this is one of the reasons why you need to spot check a variety of algorithms on a given problem, because the results can very likely surprise you.

## Summary

In this post you learned a lot about how to use and get more out of the Naive Bayes algorithm.

Do you have some tricks and tips for using Naive Bayes not covered in this post Leave a comment.

## Discover How Machine Learning Algorithms Work! #### See How Algorithms Work in Minutes

...with just arithmetic and simple examples

Discover how in my new Ebook:
Master Machine Learning Algorithms

It covers explanations and examples of 10 top algorithms, like:
Linear Regression, k-Nearest Neighbors, Support Vector Machines and much more...

### 38 Responses to Better Naive Bayes: 12 Tips To Get The Most From The Naive Bayes Algorithm

1. Vaibhav August 28, 2015 at 3:54 am #

Great post! Very informative. Thanks a lot. 🙂

2. Wade January 21, 2016 at 3:36 am #

Fantastic! Thank you

3. Virgil July 27, 2016 at 11:47 pm #

Do you have any suggestions to handle the zero-observation problem? Currently I am using Laplacian correction

4. Matt Birdsall August 30, 2016 at 4:07 am #

Small thing. I think you meant “join[t]” probabilities in section 2.

• Jason Brownlee August 30, 2016 at 8:28 am #

Thanks Matt, fixed.

• Ayele Sankura May 7, 2018 at 5:59 pm #

Hello Dr. Jason,
Thank you very much for your contribution for computing environment,
I’m following your information every day.
I have one unclear idea concerning naive Bayes, I used to train and test the algorithm with the same data-set but the accuracy is not 100%. why it doesn’t so?

The algorithm is implemented with python with help of your posts, and I used to train the algorithm with SEER cancer data after careful preparation; again I used for the testing the same file with the separate calling, the accuracy yet not 100%.
Thank you!

5. Matthew Teow February 16, 2017 at 4:56 pm #

Oh! I almost miss this, Thank God, found it now 🙂

• Jason Brownlee February 17, 2017 at 9:53 am #

I’m glad ou found the post useful Matthew.

6. Joseph Woolf April 10, 2017 at 1:09 pm #

Thanks for the explanation on how to improve on Naive Bayes Jason. I read through your posts on Naive Bayes, but I’m not entirely sure on the disadvantages of using the algorithm. I searched around on possible disadvantages, such as poor estimator, works poorly with highly correlated features, etc., but is it possible that features with similar means and standard deviations could cause the algorithm to perform poorly?

• Jason Brownlee April 11, 2017 at 9:31 am #

Hi Joseph,

The main limitation is that the algorithm does not capture the joint distributions of input variables. That is, any interesting and useful interactions between input features and their contribution to the output variable. The so-called independence assumption of naive bayes.

You could contrive a dataset where the joint distribution of two contrived variables is needed to make accurate predictions and show when naive bayes falls down.

I hope that helps.

• Joseph Woolf May 7, 2017 at 6:21 am #

I apologize for the late reply. Thank you for the explanation. That did help.

• Jason Brownlee May 8, 2017 at 7:41 am #

I’m glad to hear it Joseph.

7. johnson August 10, 2017 at 4:02 pm #

Hi Jason:
I get stuck with a problem when doing text classifications using native bayes:
There are more than 3000 samples in trainset and more than 750 samples in testset.
And the samples should be classified into 95 categories. And i got 39% accuracy finally.

How can i improve the accuracy. Increase trainning samples? or decrease categories?

Thanks!

• Jason Brownlee August 10, 2017 at 4:45 pm #

With so many categories, I expect accuracy does not mean anything any more johnson.

Consider log loss or AUC instead?

• jhonson August 10, 2017 at 5:51 pm #

I searched google, someone suggests that use so called “one-against-many” scheme.
I tried, but when categories grows, the accurary declines very quickly.

Can you give an example of log loss or AUC?
Thanks!

one-against-many: you begin with a two-class classifier (Class A and ‘all else’) then the results in the ‘all else’ class are returned to the algorithm for classification into Class B and ‘all else’, etc.

• Jason Brownlee August 11, 2017 at 6:37 am #

A one-vs-all or similar is the structure of the model, not the performance measure.

If you are using Python, then sklearn offers implementations of a suite of metrics:
http://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics

• johnson August 11, 2017 at 11:21 am #

Thanks.
I use python, and i tried metrics. I changed the number of category to 30, and got
this:

precision recall f1-score support

25 0.48 1.00 0.65 372
26 0.00 0.00 0.00 22
27 0.00 0.00 0.00 1
20 0.00 0.00 0.00 4
21 0.00 0.00 0.00 4
22 0.00 0.00 0.00 13
23 0.00 0.00 0.00 60
28 0.00 0.00 0.00 21
29 0.00 0.00 0.00 39
1 0.00 0.00 0.00 7
0 0.00 0.00 0.00 17
3 0.00 0.00 0.00 25
2 0.00 0.00 0.00 15
5 0.00 0.00 0.00 7
4 0.00 0.00 0.00 13
7 0.00 0.00 0.00 2
6 0.00 0.00 0.00 6
9 0.00 0.00 0.00 2
8 0.00 0.00 0.00 31
11 0.00 0.00 0.00 29
10 0.00 0.00 0.00 20
13 0.60 0.60 0.60 5
12 0.00 0.00 0.00 4
15 0.00 0.00 0.00 7
14 0.00 0.00 0.00 11
17 0.00 0.00 0.00 24
16 0.00 0.00 0.00 9
19 0.00 0.00 0.00 15
18 0.00 0.00 0.00 1

avg / total 0.23 0.48 0.31 786

• Jason Brownlee August 12, 2017 at 6:45 am #

Hang in there!

• johnson August 10, 2017 at 6:19 pm #

why my comment always gets lost?

Thanks jason.

use log loss or AUC in native bayes?

Could you give me an example for it?

Thanks a lot!

• Jason Brownlee August 11, 2017 at 6:39 am #

I moderate comments, I do it every 24 hours.

8. Ravi January 12, 2018 at 3:17 pm #

Awesome, to make the result more accurate. Thanks for putting all the loophole of Naive Bayes at a single place.

• Jason Brownlee January 13, 2018 at 5:27 am #

9. Vasanthi January 28, 2018 at 4:24 am #

It was tough initially. Now I got a clear pic of Navie Bayes . Can you give some practical example of Remove redundant feature.
Thank you .

• Jason Brownlee January 28, 2018 at 8:26 am #
10. Eliza March 22, 2018 at 11:06 pm #

Hello Mr.Jason
If I remove one feature that has the same value of another feature should I retrain the model ?

• Jason Brownlee March 23, 2018 at 6:07 am #

Yes.

11. Ravi August 8, 2018 at 8:28 am #

Great , It helped a lot.

• Jason Brownlee August 8, 2018 at 9:40 am #

12. prem August 13, 2018 at 7:02 pm #

Hi,

what could be the alpha value range in Naive Bayes algorithm for smoothing? Why can’t we apply Naive Bayes on negative values?

13. Ravi Gurnatham August 21, 2018 at 2:10 pm #

Hello Dr. Jason,

Can I apply log-loss metric on naive bayes model performance while using log-probabilities because if we use log-probabilities then it gives real values but log-loss expects values from [0,1].

• Jason Brownlee August 21, 2018 at 2:18 pm #

I don’t follow, what is the problem you are having exactly?

14. Mudi September 7, 2018 at 1:49 am #

Hi,
The model has feature vectors with both label 0 and 1. I want to use Naive Bayes to predict a feature given positive labels, say 0. How to do that?

15. Khoa Ng April 27, 2019 at 9:26 pm #

Hello, thank you for your great post.

I have an unrelated question about Naive Bayes: How can I predict unknown class with Naive Bayes? For example, only class A, B and C are trained. The data to be predicted has a class different than A, B and C. How can we calculate the probability to identify such case?

16. Richard June 10, 2019 at 7:08 am #

I was intrigued by the comment about numerical underflow, so I generated a series of non-zero pseudo-random numbers averaging 1% (range 0% to 2%) and compared the product to the exponential of the sum of the natural logarithms. When reaching about 133 of these values the product underflows using 64-bit (double-precision) floating point numbers, whilst the sum of the logarithms is fine.

The smallest non-zero 64-bit float is approx 2e-308, my random values had an average ln of -5.3, ln(2e-308)/5.3 = 133 and indeed the product then becomes at the count.

So this helps us quantify when this problem might occur and when the CPU overhead of using logarithms only becomes worthwhile (say over 100+ input variables for 64-bit floats, and only 16 inputs for 32-bit floats).

• Jason Brownlee June 10, 2019 at 7:39 am #

Nice one!

It’s almost a golden rule to work with log probs when modeling, and to add an epsilon when logging a probability.