Better Naive Bayes: 12 Tips To Get The Most From The Naive Bayes Algorithm

Naive Bayes is a simple and powerful technique that you should be testing and using on your classification problems.

It is simple to understand, gives good results and is fast to build a model and make predictions. For these reasons alone you should take a closer look at the algorithm.

In a recent blog post, you learned how to implement the Naive Bayes algorithm from scratch in python.

In this post you will learn tips and tricks to get the most from the Naive Bayes algorithm.

Better Naive Bayes

Better Naive Bayes
Photo by Duncan Hull, some rights reserved

1. Missing Data

Naive Bayes can handle missing data.

Attributes are handled separately by the algorithm at both model construction time and prediction time.

As such, if a data instance has a missing value for an attribute, it can be ignored while preparing the model, and ignored when a probability is calculated for a class value.

2. Use Log Probabilities

Probabilities are often small numbers. To calculate joint probabilities, you need to multiply probabilities together. When you multiply one small number by another small number, you get a very small number.

It is possible to get into difficulty with the precision of your floating point values, such as under-runs. To avoid this problem, work in the log probability space (take the logarithm of your probabilities).

This works because to make a prediction in Naive Bayes we need to know which class has the larger probability (rank) rather than what the specific probability was.

Get your FREE Algorithms Mind Map

Machine Learning Algorithms Mind Map

Sample of the handy machine learning algorithms mind map.

I've created a handy mind map of 60+ algorithms organized by type.

Download it, print it and use it. 

Download For Free


Also get exclusive access to the machine learning algorithms email mini-course.

 

 

3. Use Other Distributions

To use Naive Bayes with categorical attributes, you calculate a frequency for each observation.

To use Naive Bayes with real-valued attributes, you can summarize the density of the attribute using a Gaussian distribution. Alternatively you can use another functional form that better describes the distribution of the data, such as an exponential.

Don’t constrain yourself to the distributions used in examples of the Naive Bayes algorithm. Choose distributions that best characterize your data and prediction problem.

4. Use Probabilities For Feature Selection

Feature selection is the selection of those data attributes that best characterize a predicted variable.

In Naive Bayes, the probabilities for each attribute are calculated independently from the training dataset. You can use a search algorithm to explore the combination of the probabilities of different attributes together and evaluate their performance at predicting the output variable.

5. Segment The Data

Is their a well-defined subset of your data that responds well to the the Naive Bayes probabilistic approach?

Identifying and separating out segments that are easily handled by a simple probabilistic approach like Naive Bayes can give you increase performance and focus on the elements of the problem that are more difficult to model.

Explore different subsets, such as as the average or popular cases that are very likely handled well by Naive Bayes.

6. Re-compute Probabilities

Calculate the probabilities for each attribute is very fast.

This benefit of Naive Bayes means that you can re-calculate the probabilities as the data changes. This may be monthly, daily, even hourly.

This is something that may be unthinkable for other algorithms, but should be tested when using Naive Bayes if there is some temporal drift in the problem being modeled.

7. Use as a Generative Model

The Naive Bayes method characterizes the problem, which in turn can be used for making predictions about unseen data.

This probabilistic characterization can also be used to generate instances of the problem.

In the case of a numeric vector, the probability distributions can be sampled to create new fictitious vectors.

In the case of text (a very popular application of Naive Bayes), the model can be used to create fictitious input documents.

How might this be useful in your problem?

At the very least you can use the generative approach to help provide context for what the model has characterized.

8. Remove Redundant Features

The performance of Naive Bayes can degrade if the data contains highly correlated features.

This is because the highly correlated features are voted for twice in the model, over inflating their importance.

Evaluate the correlation of attributes pairwise with each other using a correlation matrix and remove those features that are the most highly correlated.

Nevertheless, always test your problem before and after such a change and stick with the form of the problem that leads to the better results.

9. Parallelize Probability Calculation

The probabilities for each attribute are calculated independently. This is the independence assumption in the approach and the reason why it has it’s name “naive”.

You can exploit this assumption to further speed up the execution of the algorithm by calculating attribute probabilities in parallel.

Depending on the size of the dataset and your resources, you could do this using different CPUs, different machines or different clusters.

10. Less Data Than You Think

Naive Bayes does not need a lot of data to perform well.

It needs enough data to understand the probabilistic relationship of each attribute in isolation with the output variable.

Given that interactions between attributes are ignored in the model, we do not need examples of these interactions and therefore generally less data than other algorithms, such as logistic regression.

Further, it is less likely to overfit the training data with a smaller sample size.

Try Naive Bayes if you do not have much training data.

11. Zero Observations Problem

Naive Bayes will not be reliable if there are significant differences in the attribute distributions compared to the training dataset.

An important example of this is the case where a categorical attribute has a value that was not observed in training. In this case, the model will assign a 0 probability and be unable to make a prediction.

These cases should be checked for and handled differently. After such cases have been resolved (an answer is known), the probabilities should be recalculated and the model updated.

12. It Works Anyway

An interesting point about Naive Bayes is that even when the independence assumption is violated and there are clear known relationships between attributes, it works anyway.

Importantly, this is one of the reasons why you need to spot check a variety of algorithms on a given problem, because the results can very likely surprise you.

Summary

In this post you learned a lot about how to use and get more out of the Naive Bayes algorithm.

Do you have some tricks and tips for using Naive Bayes not covered in this post Leave a comment.


Frustrated With Machine Learning Math?

Mater Machine Learning Algorithms

See How Algorithms Work in Minutes

…with just arithmetic and simple examples

Discover how in my new Ebook: Master Machine Learning Algorithms

It covers explanations and examples of 10 top algorithms, like:
Linear Regression, k-Nearest Neighbors, Support Vector Machines and much more…

Finally, Pull Back the Curtain on
Machine Learning Algorithms

Skip the Academics. Just Results.

Click to learn more.


34 Responses to Better Naive Bayes: 12 Tips To Get The Most From The Naive Bayes Algorithm

  1. Vaibhav August 28, 2015 at 3:54 am #

    Great post! Very informative. Thanks a lot. 🙂

  2. Wade January 21, 2016 at 3:36 am #

    Fantastic! Thank you

  3. Virgil July 27, 2016 at 11:47 pm #

    Do you have any suggestions to handle the zero-observation problem? Currently I am using Laplacian correction

  4. Matt Birdsall August 30, 2016 at 4:07 am #

    Small thing. I think you meant “join[t]” probabilities in section 2.

    • Jason Brownlee August 30, 2016 at 8:28 am #

      Thanks Matt, fixed.

      • Ayele Sankura May 7, 2018 at 5:59 pm #

        Hello Dr. Jason,
        Thank you very much for your contribution for computing environment,
        I’m following your information every day.
        I have one unclear idea concerning naive Bayes, I used to train and test the algorithm with the same data-set but the accuracy is not 100%. why it doesn’t so?

        The algorithm is implemented with python with help of your posts, and I used to train the algorithm with SEER cancer data after careful preparation; again I used for the testing the same file with the separate calling, the accuracy yet not 100%.
        Thank you!

  5. Matthew Teow February 16, 2017 at 4:56 pm #

    Oh! I almost miss this, Thank God, found it now 🙂

  6. Joseph Woolf April 10, 2017 at 1:09 pm #

    Thanks for the explanation on how to improve on Naive Bayes Jason. I read through your posts on Naive Bayes, but I’m not entirely sure on the disadvantages of using the algorithm. I searched around on possible disadvantages, such as poor estimator, works poorly with highly correlated features, etc., but is it possible that features with similar means and standard deviations could cause the algorithm to perform poorly?

    • Jason Brownlee April 11, 2017 at 9:31 am #

      Hi Joseph,

      The main limitation is that the algorithm does not capture the joint distributions of input variables. That is, any interesting and useful interactions between input features and their contribution to the output variable. The so-called independence assumption of naive bayes.

      You could contrive a dataset where the joint distribution of two contrived variables is needed to make accurate predictions and show when naive bayes falls down.

      I hope that helps.

      • Joseph Woolf May 7, 2017 at 6:21 am #

        I apologize for the late reply. Thank you for the explanation. That did help.

  7. johnson August 10, 2017 at 4:02 pm #

    Hi Jason:
    I get stuck with a problem when doing text classifications using native bayes:
    There are more than 3000 samples in trainset and more than 750 samples in testset.
    And the samples should be classified into 95 categories. And i got 39% accuracy finally.

    How can i improve the accuracy. Increase trainning samples? or decrease categories?

    Thanks!

    • Jason Brownlee August 10, 2017 at 4:45 pm #

      With so many categories, I expect accuracy does not mean anything any more johnson.

      Consider log loss or AUC instead?

      • jhonson August 10, 2017 at 5:51 pm #

        Thanks for your reply.
        I searched google, someone suggests that use so called “one-against-many” scheme.
        I tried, but when categories grows, the accurary declines very quickly.

        Can you give an example of log loss or AUC?
        Thanks!

        one-against-many: you begin with a two-class classifier (Class A and ‘all else’) then the results in the ‘all else’ class are returned to the algorithm for classification into Class B and ‘all else’, etc.

        • Jason Brownlee August 11, 2017 at 6:37 am #

          A one-vs-all or similar is the structure of the model, not the performance measure.

          If you are using Python, then sklearn offers implementations of a suite of metrics:
          http://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics

          • johnson August 11, 2017 at 11:21 am #

            Thanks.
            I use python, and i tried metrics. I changed the number of category to 30, and got
            this:

            precision recall f1-score support

            25 0.48 1.00 0.65 372
            26 0.00 0.00 0.00 22
            27 0.00 0.00 0.00 1
            20 0.00 0.00 0.00 4
            21 0.00 0.00 0.00 4
            22 0.00 0.00 0.00 13
            23 0.00 0.00 0.00 60
            28 0.00 0.00 0.00 21
            29 0.00 0.00 0.00 39
            1 0.00 0.00 0.00 7
            0 0.00 0.00 0.00 17
            3 0.00 0.00 0.00 25
            2 0.00 0.00 0.00 15
            5 0.00 0.00 0.00 7
            4 0.00 0.00 0.00 13
            7 0.00 0.00 0.00 2
            6 0.00 0.00 0.00 6
            9 0.00 0.00 0.00 2
            8 0.00 0.00 0.00 31
            11 0.00 0.00 0.00 29
            10 0.00 0.00 0.00 20
            13 0.60 0.60 0.60 5
            12 0.00 0.00 0.00 4
            15 0.00 0.00 0.00 7
            14 0.00 0.00 0.00 11
            17 0.00 0.00 0.00 24
            16 0.00 0.00 0.00 9
            19 0.00 0.00 0.00 15
            18 0.00 0.00 0.00 1

            avg / total 0.23 0.48 0.31 786

          • Jason Brownlee August 12, 2017 at 6:45 am #

            Hang in there!

      • johnson August 10, 2017 at 6:19 pm #

        why my comment always gets lost?

        Thanks jason.

        use log loss or AUC in native bayes?

        Could you give me an example for it?

        Thanks a lot!

  8. Ravi January 12, 2018 at 3:17 pm #

    Awesome, to make the result more accurate. Thanks for putting all the loophole of Naive Bayes at a single place.

  9. Vasanthi January 28, 2018 at 4:24 am #

    It was tough initially. Now I got a clear pic of Navie Bayes . Can you give some practical example of Remove redundant feature.
    Thank you .

  10. Eliza March 22, 2018 at 11:06 pm #

    Hello Mr.Jason
    If I remove one feature that has the same value of another feature should I retrain the model ?

  11. Ravi August 8, 2018 at 8:28 am #

    Great , It helped a lot.

  12. prem August 13, 2018 at 7:02 pm #

    Hi,

    what could be the alpha value range in Naive Bayes algorithm for smoothing? Why can’t we apply Naive Bayes on negative values?

  13. Ravi Gurnatham August 21, 2018 at 2:10 pm #

    Hello Dr. Jason,

    Can I apply log-loss metric on naive bayes model performance while using log-probabilities because if we use log-probabilities then it gives real values but log-loss expects values from [0,1].

    • Jason Brownlee August 21, 2018 at 2:18 pm #

      I don’t follow, what is the problem you are having exactly?

  14. Mudi September 7, 2018 at 1:49 am #

    Hi,
    The model has feature vectors with both label 0 and 1. I want to use Naive Bayes to predict a feature given positive labels, say 0. How to do that?

Leave a Reply