Better Naive Bayes: 12 Tips To Get The Most From The Naive Bayes Algorithm

Last Updated on August 12, 2019

Naive Bayes is a simple and powerful technique that you should be testing and using on your classification problems.

It is simple to understand, gives good results and is fast to build a model and make predictions. For these reasons alone you should take a closer look at the algorithm.

In a recent blog post, you learned how to implement the Naive Bayes algorithm from scratch in python.

In this post you will learn tips and tricks to get the most from the Naive Bayes algorithm.

Kick-start your project with my new book Master Machine Learning Algorithms, including step-by-step tutorials and the Excel Spreadsheet files for all examples.

Better Naive Bayes

Better Naive Bayes
Photo by Duncan Hull, some rights reserved

1. Missing Data

Naive Bayes can handle missing data.

Attributes are handled separately by the algorithm at both model construction time and prediction time.

As such, if a data instance has a missing value for an attribute, it can be ignored while preparing the model, and ignored when a probability is calculated for a class value.

2. Use Log Probabilities

Probabilities are often small numbers. To calculate joint probabilities, you need to multiply probabilities together. When you multiply one small number by another small number, you get a very small number.

It is possible to get into difficulty with the precision of your floating point values, such as under-runs. To avoid this problem, work in the log probability space (take the logarithm of your probabilities).

This works because to make a prediction in Naive Bayes we need to know which class has the larger probability (rank) rather than what the specific probability was.

Get your FREE Algorithms Mind Map

Machine Learning Algorithms Mind Map

Sample of the handy machine learning algorithms mind map.

I've created a handy mind map of 60+ algorithms organized by type.

Download it, print it and use it. 

Also get exclusive access to the machine learning algorithms email mini-course.



3. Use Other Distributions

To use Naive Bayes with categorical attributes, you calculate a frequency for each observation.

To use Naive Bayes with real-valued attributes, you can summarize the density of the attribute using a Gaussian distribution. Alternatively you can use another functional form that better describes the distribution of the data, such as an exponential.

Don’t constrain yourself to the distributions used in examples of the Naive Bayes algorithm. Choose distributions that best characterize your data and prediction problem.

4. Use Probabilities For Feature Selection

Feature selection is the selection of those data attributes that best characterize a predicted variable.

In Naive Bayes, the probabilities for each attribute are calculated independently from the training dataset. You can use a search algorithm to explore the combination of the probabilities of different attributes together and evaluate their performance at predicting the output variable.

5. Segment The Data

Is their a well-defined subset of your data that responds well to the the Naive Bayes probabilistic approach?

Identifying and separating out segments that are easily handled by a simple probabilistic approach like Naive Bayes can give you increase performance and focus on the elements of the problem that are more difficult to model.

Explore different subsets, such as as the average or popular cases that are very likely handled well by Naive Bayes.

6. Re-compute Probabilities

Calculate the probabilities for each attribute is very fast.

This benefit of Naive Bayes means that you can re-calculate the probabilities as the data changes. This may be monthly, daily, even hourly.

This is something that may be unthinkable for other algorithms, but should be tested when using Naive Bayes if there is some temporal drift in the problem being modeled.

7. Use as a Generative Model

The Naive Bayes method characterizes the problem, which in turn can be used for making predictions about unseen data.

This probabilistic characterization can also be used to generate instances of the problem.

In the case of a numeric vector, the probability distributions can be sampled to create new fictitious vectors.

In the case of text (a very popular application of Naive Bayes), the model can be used to create fictitious input documents.

How might this be useful in your problem?

At the very least you can use the generative approach to help provide context for what the model has characterized.

8. Remove Redundant Features

The performance of Naive Bayes can degrade if the data contains highly correlated features.

This is because the highly correlated features are voted for twice in the model, over inflating their importance.

Evaluate the correlation of attributes pairwise with each other using a correlation matrix and remove those features that are the most highly correlated.

Nevertheless, always test your problem before and after such a change and stick with the form of the problem that leads to the better results.

9. Parallelize Probability Calculation

The probabilities for each attribute are calculated independently. This is the independence assumption in the approach and the reason why it has it’s name “naive”.

You can exploit this assumption to further speed up the execution of the algorithm by calculating attribute probabilities in parallel.

Depending on the size of the dataset and your resources, you could do this using different CPUs, different machines or different clusters.

10. Less Data Than You Think

Naive Bayes does not need a lot of data to perform well.

It needs enough data to understand the probabilistic relationship of each attribute in isolation with the output variable.

Given that interactions between attributes are ignored in the model, we do not need examples of these interactions and therefore generally less data than other algorithms, such as logistic regression.

Further, it is less likely to overfit the training data with a smaller sample size.

Try Naive Bayes if you do not have much training data.

11. Zero Observations Problem

Naive Bayes will not be reliable if there are significant differences in the attribute distributions compared to the training dataset.

An important example of this is the case where a categorical attribute has a value that was not observed in training. In this case, the model will assign a 0 probability and be unable to make a prediction.

These cases should be checked for and handled differently. After such cases have been resolved (an answer is known), the probabilities should be recalculated and the model updated.

12. It Works Anyway

An interesting point about Naive Bayes is that even when the independence assumption is violated and there are clear known relationships between attributes, it works anyway.

Importantly, this is one of the reasons why you need to spot check a variety of algorithms on a given problem, because the results can very likely surprise you.


In this post you learned a lot about how to use and get more out of the Naive Bayes algorithm.

Do you have some tricks and tips for using Naive Bayes not covered in this post Leave a comment.

Discover How Machine Learning Algorithms Work!

Mater Machine Learning Algorithms

See How Algorithms Work in Minutes

...with just arithmetic and simple examples

Discover how in my new Ebook:
Master Machine Learning Algorithms

It covers explanations and examples of 10 top algorithms, like:
Linear Regression, k-Nearest Neighbors, Support Vector Machines and much more...

Finally, Pull Back the Curtain on
Machine Learning Algorithms

Skip the Academics. Just Results.

See What's Inside

52 Responses to Better Naive Bayes: 12 Tips To Get The Most From The Naive Bayes Algorithm

  1. Avatar
    Vaibhav August 28, 2015 at 3:54 am #

    Great post! Very informative. Thanks a lot. 🙂

  2. Avatar
    Wade January 21, 2016 at 3:36 am #

    Fantastic! Thank you

  3. Avatar
    Virgil July 27, 2016 at 11:47 pm #

    Do you have any suggestions to handle the zero-observation problem? Currently I am using Laplacian correction

  4. Avatar
    Matt Birdsall August 30, 2016 at 4:07 am #

    Small thing. I think you meant “join[t]” probabilities in section 2.

    • Avatar
      Jason Brownlee August 30, 2016 at 8:28 am #

      Thanks Matt, fixed.

      • Avatar
        Ayele Sankura May 7, 2018 at 5:59 pm #

        Hello Dr. Jason,
        Thank you very much for your contribution for computing environment,
        I’m following your information every day.
        I have one unclear idea concerning naive Bayes, I used to train and test the algorithm with the same data-set but the accuracy is not 100%. why it doesn’t so?

        The algorithm is implemented with python with help of your posts, and I used to train the algorithm with SEER cancer data after careful preparation; again I used for the testing the same file with the separate calling, the accuracy yet not 100%.
        Thank you!

  5. Avatar
    Matthew Teow February 16, 2017 at 4:56 pm #

    Oh! I almost miss this, Thank God, found it now 🙂

  6. Avatar
    Joseph Woolf April 10, 2017 at 1:09 pm #

    Thanks for the explanation on how to improve on Naive Bayes Jason. I read through your posts on Naive Bayes, but I’m not entirely sure on the disadvantages of using the algorithm. I searched around on possible disadvantages, such as poor estimator, works poorly with highly correlated features, etc., but is it possible that features with similar means and standard deviations could cause the algorithm to perform poorly?

    • Avatar
      Jason Brownlee April 11, 2017 at 9:31 am #

      Hi Joseph,

      The main limitation is that the algorithm does not capture the joint distributions of input variables. That is, any interesting and useful interactions between input features and their contribution to the output variable. The so-called independence assumption of naive bayes.

      You could contrive a dataset where the joint distribution of two contrived variables is needed to make accurate predictions and show when naive bayes falls down.

      I hope that helps.

      • Avatar
        Joseph Woolf May 7, 2017 at 6:21 am #

        I apologize for the late reply. Thank you for the explanation. That did help.

  7. Avatar
    johnson August 10, 2017 at 4:02 pm #

    Hi Jason:
    I get stuck with a problem when doing text classifications using native bayes:
    There are more than 3000 samples in trainset and more than 750 samples in testset.
    And the samples should be classified into 95 categories. And i got 39% accuracy finally.

    How can i improve the accuracy. Increase trainning samples? or decrease categories?


    • Avatar
      Jason Brownlee August 10, 2017 at 4:45 pm #

      With so many categories, I expect accuracy does not mean anything any more johnson.

      Consider log loss or AUC instead?

      • Avatar
        jhonson August 10, 2017 at 5:51 pm #

        Thanks for your reply.
        I searched google, someone suggests that use so called “one-against-many” scheme.
        I tried, but when categories grows, the accurary declines very quickly.

        Can you give an example of log loss or AUC?

        one-against-many: you begin with a two-class classifier (Class A and ‘all else’) then the results in the ‘all else’ class are returned to the algorithm for classification into Class B and ‘all else’, etc.

        • Avatar
          Jason Brownlee August 11, 2017 at 6:37 am #

          A one-vs-all or similar is the structure of the model, not the performance measure.

          If you are using Python, then sklearn offers implementations of a suite of metrics:

          • Avatar
            johnson August 11, 2017 at 11:21 am #

            I use python, and i tried metrics. I changed the number of category to 30, and got

            precision recall f1-score support

            25 0.48 1.00 0.65 372
            26 0.00 0.00 0.00 22
            27 0.00 0.00 0.00 1
            20 0.00 0.00 0.00 4
            21 0.00 0.00 0.00 4
            22 0.00 0.00 0.00 13
            23 0.00 0.00 0.00 60
            28 0.00 0.00 0.00 21
            29 0.00 0.00 0.00 39
            1 0.00 0.00 0.00 7
            0 0.00 0.00 0.00 17
            3 0.00 0.00 0.00 25
            2 0.00 0.00 0.00 15
            5 0.00 0.00 0.00 7
            4 0.00 0.00 0.00 13
            7 0.00 0.00 0.00 2
            6 0.00 0.00 0.00 6
            9 0.00 0.00 0.00 2
            8 0.00 0.00 0.00 31
            11 0.00 0.00 0.00 29
            10 0.00 0.00 0.00 20
            13 0.60 0.60 0.60 5
            12 0.00 0.00 0.00 4
            15 0.00 0.00 0.00 7
            14 0.00 0.00 0.00 11
            17 0.00 0.00 0.00 24
            16 0.00 0.00 0.00 9
            19 0.00 0.00 0.00 15
            18 0.00 0.00 0.00 1

            avg / total 0.23 0.48 0.31 786

          • Avatar
            Jason Brownlee August 12, 2017 at 6:45 am #

            Hang in there!

      • Avatar
        johnson August 10, 2017 at 6:19 pm #

        why my comment always gets lost?

        Thanks jason.

        use log loss or AUC in native bayes?

        Could you give me an example for it?

        Thanks a lot!

  8. Avatar
    Ravi January 12, 2018 at 3:17 pm #

    Awesome, to make the result more accurate. Thanks for putting all the loophole of Naive Bayes at a single place.

  9. Avatar
    Vasanthi January 28, 2018 at 4:24 am #

    It was tough initially. Now I got a clear pic of Navie Bayes . Can you give some practical example of Remove redundant feature.
    Thank you .

  10. Avatar
    Eliza March 22, 2018 at 11:06 pm #

    Hello Mr.Jason
    If I remove one feature that has the same value of another feature should I retrain the model ?

  11. Avatar
    Ravi August 8, 2018 at 8:28 am #

    Great , It helped a lot.

  12. Avatar
    prem August 13, 2018 at 7:02 pm #


    what could be the alpha value range in Naive Bayes algorithm for smoothing? Why can’t we apply Naive Bayes on negative values?

  13. Avatar
    Ravi Gurnatham August 21, 2018 at 2:10 pm #

    Hello Dr. Jason,

    Can I apply log-loss metric on naive bayes model performance while using log-probabilities because if we use log-probabilities then it gives real values but log-loss expects values from [0,1].

    • Avatar
      Jason Brownlee August 21, 2018 at 2:18 pm #

      I don’t follow, what is the problem you are having exactly?

  14. Avatar
    Mudi September 7, 2018 at 1:49 am #

    The model has feature vectors with both label 0 and 1. I want to use Naive Bayes to predict a feature given positive labels, say 0. How to do that?

  15. Avatar
    Khoa Ng April 27, 2019 at 9:26 pm #

    Hello, thank you for your great post.

    I have an unrelated question about Naive Bayes: How can I predict unknown class with Naive Bayes? For example, only class A, B and C are trained. The data to be predicted has a class different than A, B and C. How can we calculate the probability to identify such case?

    Thanks in advance

  16. Avatar
    Richard June 10, 2019 at 7:08 am #

    I was intrigued by the comment about numerical underflow, so I generated a series of non-zero pseudo-random numbers averaging 1% (range 0% to 2%) and compared the product to the exponential of the sum of the natural logarithms. When reaching about 133 of these values the product underflows using 64-bit (double-precision) floating point numbers, whilst the sum of the logarithms is fine.

    The smallest non-zero 64-bit float is approx 2e-308, my random values had an average ln of -5.3, ln(2e-308)/5.3 = 133 and indeed the product then becomes at the count.

    So this helps us quantify when this problem might occur and when the CPU overhead of using logarithms only becomes worthwhile (say over 100+ input variables for 64-bit floats, and only 16 inputs for 32-bit floats).

    • Avatar
      Jason Brownlee June 10, 2019 at 7:39 am #

      Nice one!

      It’s almost a golden rule to work with log probs when modeling, and to add an epsilon when logging a probability.

  17. Avatar
    vian December 3, 2020 at 9:38 pm #

    thx a lot, I have a question and hope for your help.
    I collected data with multiple features and multiclass some features are redundant in class and different from others, e.g. features value [2,2,2] for class A, [3,3,3] for class B, and so on.
    so the variance should be 0 so how Gaussian naive Bayes algorithm work
    I tried to run my program and it gave me a good performance but I don’t know how?

      • Avatar
        vian December 4, 2020 at 11:21 pm #

        thank you so much. I read it and it is so useful but still don’t have a solution to my problem.
        when I checked my program I found the default standard deviation given if it is 0 value but don’t know how right am i.. please help me if you have any idea..

        • Avatar
          Jason Brownlee December 5, 2020 at 8:07 am #

          If the standard deviation is zero it suggests the column has a single value and can probably be removed.

          • Avatar
            vian December 5, 2020 at 10:38 pm #

            no, the column with different values but static with a class and really it gave high accuracy.. please I have other questions 1- can I plot ROC with Gaussian naive Bayes multiclass ..2- can I change theta and epsilon of Gaussian Naive Bayes ..
            I appreciated your help thanks a lot.

          • Avatar
            Jason Brownlee December 6, 2020 at 7:02 am #

            No, ROC is for binary (2 class) classification problems.

            The mean and stdev for a variable used in naive bayes is only for real-valued variables and is estimated from the training dataset.

  18. Avatar
    vian December 6, 2020 at 8:28 am #

    ok Mr, Jason thank you so much

  19. Avatar
    Shubham August 14, 2021 at 4:51 pm #

    how to find most contributing attribute of a naive bayes model in python.

    • Avatar
      Adrian Tam August 14, 2021 at 11:31 pm #

      As always, by testing! Try varying some parameters in the model with the same input, and see how its performance metric varies.

  20. Avatar
    Kedar September 14, 2021 at 12:56 pm #

    Can we use Naive bays for a problem where we are getting input features values one by one and we want to update the prediction as we get more information?

    • Avatar
      Adrian Tam September 14, 2021 at 1:32 pm #

      It sounds possible!

  21. Avatar
    Peter Utubor December 31, 2022 at 3:09 am #

    Thank you Jason for this fantastic write up. I have a challenge .what if Each feature has a different distribution exponential, guassian etc how do I handle that in Naive Bayes?

Leave a Reply