The post Naive Bayes Classifier From Scratch in Python appeared first on Machine Learning Mastery.

]]>In this tutorial you are going to learn about the **Naive Bayes algorithm** including how it works and how to implement it from scratch in Python (without libraries).

We can use probability to make predictions in machine learning. Perhaps the most widely used example is called the Naive Bayes algorithm. Not only is it straightforward to understand, but it also achieves surprisingly good results on a wide range of problems.

After completing this tutorial you will know:

- How to calculate the probabilities required by the Naive Bayes algorithm.
- How to implement the Naive Bayes algorithm from scratch.
- How to apply Naive Bayes to a real-world predictive modeling problem.

Discover how to code ML algorithms from scratch including kNN, decision trees, neural nets, ensembles and much more in my new book, with full Python code and no fancy libraries.

Let’s get started.

**Update Dec/2014**: Original implementation.**Update Oct/2019**: Rewrote the tutorial and code from the ground-up.

This section provides a brief overview of the Naive Bayes algorithm and the Iris flowers dataset that we will use in this tutorial.

Bayes’ Theorem provides a way that we can calculate the probability of a piece of data belonging to a given class, given our prior knowledge. Bayes’ Theorem is stated as:

- P(class|data) = (P(data|class) * P(class)) / P(data)

Where P(class|data) is the probability of class given the provided data.

For an in-depth introduction to Bayes Theorem, see the tutorial:

Naive Bayes is a classification algorithm for binary (two-class) and multiclass classification problems. It is called Naive Bayes or idiot Bayes because the calculations of the probabilities for each class are simplified to make their calculations tractable.

Rather than attempting to calculate the probabilities of each attribute value, they are assumed to be conditionally independent given the class value.

This is a very strong assumption that is most unlikely in real data, i.e. that the attributes do not interact. Nevertheless, the approach performs surprisingly well on data where this assumption does not hold.

For an in-depth introduction to Naive Bayes, see the tutorial:

In this tutorial we will use the Iris Flower Species Dataset.

The Iris Flower Dataset involves predicting the flower species given measurements of iris flowers.

It is a multiclass classification problem. The number of observations for each class is balanced. There are 150 observations with 4 input variables and 1 output variable. The variable names are as follows:

- Sepal length in cm.
- Sepal width in cm.
- Petal length in cm.
- Petal width in cm.
- Class

A sample of the first 5 rows is listed below.

5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 5.0,3.6,1.4,0.2,Iris-setosa ...

The baseline performance on the problem is approximately 33%.

Download the dataset and save it into your current working directory with the filename *iris.csv*.

First we will develop each piece of the algorithm in this section, then we will tie all of the elements together into a working implementation applied to a real dataset in the next section.

This Naive Bayes tutorial is broken down into 5 parts:

- Step 1: Separate By Class.
- Step 2: Summarize Dataset.
- Step 3: Summarize Data By Class.
- Step 4: Gaussian Probability Density Function.
- Step 5: Class Probabilities.

These steps will provide the foundation that you need to implement Naive Bayes from scratch and apply it to your own predictive modeling problems.

**Note**: This tutorial assumes that you are using **Python 3**. If you need help installing Python, see this tutorial:

**Note**: if you are using **Python 2.7**, you must change all calls to the *items()* function on dictionary objects to *iteritems()*.

We will need to calculate the probability of data by the class they belong to, the so-called base rate.

This means that we will first need to separate our training data by class. A relatively straightforward operation.

We can create a dictionary object where each key is the class value and then add a list of all the records as the value in the dictionary.

Below is a function named *separate_by_class()* that implements this approach. It assumes that the last column in each row is the class value.

# Split the dataset by class values, returns a dictionary def separate_by_class(dataset): separated = dict() for i in range(len(dataset)): vector = dataset[i] class_value = vector[-1] if (class_value not in separated): separated[class_value] = list() separated[class_value].append(vector) return separated

We can contrive a small dataset to test out this function.

X1 X2 Y 3.393533211 2.331273381 0 3.110073483 1.781539638 0 1.343808831 3.368360954 0 3.582294042 4.67917911 0 2.280362439 2.866990263 0 7.423436942 4.696522875 1 5.745051997 3.533989803 1 9.172168622 2.511101045 1 7.792783481 3.424088941 1 7.939820817 0.791637231 1

We can plot this dataset and use separate colors for each class.

Putting this all together, we can test our *separate_by_class()* function on the contrived dataset.

# Example of separating data by class value # Split the dataset by class values, returns a dictionary def separate_by_class(dataset): separated = dict() for i in range(len(dataset)): vector = dataset[i] class_value = vector[-1] if (class_value not in separated): separated[class_value] = list() separated[class_value].append(vector) return separated # Test separating data by class dataset = [[3.393533211,2.331273381,0], [3.110073483,1.781539638,0], [1.343808831,3.368360954,0], [3.582294042,4.67917911,0], [2.280362439,2.866990263,0], [7.423436942,4.696522875,1], [5.745051997,3.533989803,1], [9.172168622,2.511101045,1], [7.792783481,3.424088941,1], [7.939820817,0.791637231,1]] separated = separate_by_class(dataset) for label in separated: print(label) for row in separated[label]: print(row)

Running the example sorts observations in the dataset by their class value, then prints the class value followed by all identified records.

0 [3.393533211, 2.331273381, 0] [3.110073483, 1.781539638, 0] [1.343808831, 3.368360954, 0] [3.582294042, 4.67917911, 0] [2.280362439, 2.866990263, 0] 1 [7.423436942, 4.696522875, 1] [5.745051997, 3.533989803, 1] [9.172168622, 2.511101045, 1] [7.792783481, 3.424088941, 1] [7.939820817, 0.791637231, 1]

Next we can start to develop the functions needed to collect statistics.

We need two statistics from a given set of data.

We’ll see how these statistics are used in the calculation of probabilities in a few steps. The two statistics we require from a given dataset are the mean and the standard deviation (average deviation from the mean).

The mean is the average value and can be calculated as:

- mean = sum(x)/n * count(x)

Where *x* is the list of values or a column we are looking.

Below is a small function named *mean()* that calculates the mean of a list of numbers.

# Calculate the mean of a list of numbers def mean(numbers): return sum(numbers)/float(len(numbers))

The sample standard deviation is calculated as the mean difference from the mean value. This can be calculated as:

- standard deviation = sqrt((sum i to N (x_i – mean(x))^2) / N-1)

You can see that we square the difference between the mean and a given value, calculate the average squared difference from the mean, then take the square root to return the units back to their original value.

Below is a small function named *standard_deviation()* that calculates the standard deviation of a list of numbers. You will notice that it calculates the mean. It might be more efficient to calculate the mean of a list of numbers once and pass it to the *standard_deviation()* function as a parameter. You can explore this optimization if you’re interested later.

from math import sqrt # Calculate the standard deviation of a list of numbers def stdev(numbers): avg = mean(numbers) variance = sum([(x-avg)**2 for x in numbers]) / float(len(numbers)-1) return sqrt(variance)

We require the mean and standard deviation statistics to be calculated for each input attribute or each column of our data.

We can do that by gathering all of the values for each column into a list and calculating the mean and standard deviation on that list. Once calculated, we can gather the statistics together into a list or tuple of statistics. Then, repeat this operation for each column in the dataset and return a list of tuples of statistics.

Below is a function named *summarize_dataset()* that implements this approach. It uses some Python tricks to cut down on the number of lines required.

# Calculate the mean, stdev and count for each column in a dataset def summarize_dataset(dataset): summaries = [(mean(column), stdev(column), len(column)) for column in zip(*dataset)] del(summaries[-1]) return summaries

The first trick is the use of the zip() function that will aggregate elements from each provided argument. We pass in the dataset to the *zip()* function with the * operator that separates the dataset (that is a list of lists) into separate lists for each row. The *zip()* function then iterates over each element of each row and returns a column from the dataset as a list of numbers. A clever little trick.

We then calculate the mean, standard deviation and count of rows in each column. A tuple is created from these 3 numbers and a list of these tuples is stored. We then remove the statistics for the class variable as we will not need these statistics.

Let’s test all of these functions on our contrived dataset from above. Below is the complete example.

# Example of summarizing a dataset from math import sqrt # Calculate the mean of a list of numbers def mean(numbers): return sum(numbers)/float(len(numbers)) # Calculate the standard deviation of a list of numbers def stdev(numbers): avg = mean(numbers) variance = sum([(x-avg)**2 for x in numbers]) / float(len(numbers)-1) return sqrt(variance) # Calculate the mean, stdev and count for each column in a dataset def summarize_dataset(dataset): summaries = [(mean(column), stdev(column), len(column)) for column in zip(*dataset)] del(summaries[-1]) return summaries # Test summarizing a dataset dataset = [[3.393533211,2.331273381,0], [3.110073483,1.781539638,0], [1.343808831,3.368360954,0], [3.582294042,4.67917911,0], [2.280362439,2.866990263,0], [7.423436942,4.696522875,1], [5.745051997,3.533989803,1], [9.172168622,2.511101045,1], [7.792783481,3.424088941,1], [7.939820817,0.791637231,1]] summary = summarize_dataset(dataset) print(summary)

Running the example prints out the list of tuples of statistics on each of the two input variables.

Interpreting the results, we can see that the mean value of X1 is 5.178333386499999 and the standard deviation of X1 is 2.7665845055177263.

[(5.178333386499999, 2.7665845055177263, 10), (2.9984683241, 1.218556343617447, 10)]

Now we are ready to use these functions on each group of rows in our dataset.

We require statistics from our training dataset organized by class.

Above, we have developed the *separate_by_class()* function to separate a dataset into rows by class. And we have developed *summarize_dataset()* function to calculate summary statistics for each column.

We can put all of this together and summarize the columns in the dataset organized by class values.

Below is a function named *summarize_by_class()* that implements this operation. The dataset is first split by class, then statistics are calculated on each subset. The results in the form of a list of tuples of statistics are then stored in a dictionary by their class value.

# Split dataset by class then calculate statistics for each row def summarize_by_class(dataset): separated = separate_by_class(dataset) summaries = dict() for class_value, rows in separated.items(): summaries[class_value] = summarize_dataset(rows) return summaries

Again, let’s test out all of these behaviors on our contrived dataset.

# Example of summarizing data by class value from math import sqrt # Split the dataset by class values, returns a dictionary def separate_by_class(dataset): separated = dict() for i in range(len(dataset)): vector = dataset[i] class_value = vector[-1] if (class_value not in separated): separated[class_value] = list() separated[class_value].append(vector) return separated # Calculate the mean of a list of numbers def mean(numbers): return sum(numbers)/float(len(numbers)) # Calculate the standard deviation of a list of numbers def stdev(numbers): avg = mean(numbers) variance = sum([(x-avg)**2 for x in numbers]) / float(len(numbers)-1) return sqrt(variance) # Calculate the mean, stdev and count for each column in a dataset def summarize_dataset(dataset): summaries = [(mean(column), stdev(column), len(column)) for column in zip(*dataset)] del(summaries[-1]) return summaries # Split dataset by class then calculate statistics for each row def summarize_by_class(dataset): separated = separate_by_class(dataset) summaries = dict() for class_value, rows in separated.items(): summaries[class_value] = summarize_dataset(rows) return summaries # Test summarizing by class dataset = [[3.393533211,2.331273381,0], [3.110073483,1.781539638,0], [1.343808831,3.368360954,0], [3.582294042,4.67917911,0], [2.280362439,2.866990263,0], [7.423436942,4.696522875,1], [5.745051997,3.533989803,1], [9.172168622,2.511101045,1], [7.792783481,3.424088941,1], [7.939820817,0.791637231,1]] summary = summarize_by_class(dataset) for label in summary: print(label) for row in summary[label]: print(row)

Running this example calculates the statistics for each input variable and prints them organized by class value. Interpreting the results, we can see that the X1 values for rows for class 0 have a mean value of 2.7420144012.

0 (2.7420144012, 0.9265683289298018, 5) (3.0054686692, 1.1073295894898725, 5) 1 (7.6146523718, 1.2344321550313704, 5) (2.9914679790000003, 1.4541931384601618, 5)

There is one more piece we need before we start calculating probabilities.

Calculating the probability or likelihood of observing a given real-value like X1 is difficult.

One way we can do this is to assume that X1 values are drawn from a distribution, such as a bell curve or Gaussian distribution.

A Gaussian distribution can be summarized using only two numbers: the mean and the standard deviation. Therefore, with a little math, we can estimate the probability of a given value. This piece of math is called a Gaussian Probability Distribution Function (or Gaussian PDF) and can be calculated as:

- f(x) = (1 / sqrt(2 * PI) * sigma) * exp(-((x-mean)^2 / (2 * sigma^2)))

Where *sigma* is the standard deviation for *x*, *mean* is the mean for *x* and *PI* is the value of pi.

Below is a function that implements this. I tried to split it up to make it more readable.

# Calculate the Gaussian probability distribution function for x def calculate_probability(x, mean, stdev): exponent = exp(-((x-mean)**2 / (2 * stdev**2 ))) return (1 / (sqrt(2 * pi) * stdev)) * exponent

Let’s test it out to see how it works. Below are some worked examples.

# Example of Gaussian PDF from math import sqrt from math import pi from math import exp # Calculate the Gaussian probability distribution function for x def calculate_probability(x, mean, stdev): exponent = exp(-((x-mean)**2 / (2 * stdev**2 ))) return (1 / (sqrt(2 * pi) * stdev)) * exponent # Test Gaussian PDF print(calculate_probability(1.0, 1.0, 1.0)) print(calculate_probability(2.0, 1.0, 1.0)) print(calculate_probability(0.0, 1.0, 1.0))

Running it prints the probability of some input values. You can see that when the value is 1 and the mean and standard deviation is 1 our input is the most likely (top of the bell curve) and has the probability of 0.39.

We can see that when we keep the statistics the same and change the x value to 1 standard deviation either side of the mean value (2 and 0 or the same distance either side of the bell curve) the probabilities of those input values are the same at 0.24.

0.3989422804014327 0.24197072451914337 0.24197072451914337

Now that we have all the pieces in place, let’s see how we can calculate the probabilities we need for the Naive Bayes classifier.

Now it is time to use the statistics calculated from our training data to calculate probabilities for new data.

Probabilities are calculated separately for each class. This means that we first calculate the probability that a new piece of data belongs to the first class, then calculate probabilities that it belongs to the second class, and so on for all the classes.

The probability that a piece of data belongs to a class is calculated as follows:

- P(class|data) = P(X|class) * P(class)

You may note that this is different from the Bayes Theorem described above.

The division has been removed to simplify the calculation.

This means that the result is no longer strictly a probability of the data belonging to a class. The value is still maximized, meaning that the calculation for the class that results in the largest value is taken as the prediction. This is a common implementation simplification as we are often more interested in the class prediction rather than the probability.

The input variables are treated separately, giving the technique it’s name “*naive*“. For the above example where we have 2 input variables, the calculation of the probability that a row belongs to the first class 0 can be calculated as:

- P(class=0|X1,X2) = P(X1|class=0) * P(X2|class=0) * P(class=0)

Now you can see why we need to separate the data by class value. The Gaussian Probability Density function in the previous step is how we calculate the probability of a real value like X1 and the statistics we prepared are used in this calculation.

Below is a function named *calculate_class_probabilities()* that ties all of this together.

It takes a set of prepared summaries and a new row as input arguments.

First the total number of training records is calculated from the counts stored in the summary statistics. This is used in the calculation of the probability of a given class or *P(class)* as the ratio of rows with a given class of all rows in the training data.

Next, probabilities are calculated for each input value in the row using the Gaussian probability density function and the statistics for that column and of that class. Probabilities are multiplied together as they accumulated.

This process is repeated for each class in the dataset.

Finally a dictionary of probabilities is returned with one entry for each class.

# Calculate the probabilities of predicting each class for a given row def calculate_class_probabilities(summaries, row): total_rows = sum([summaries[label][0][2] for label in summaries]) probabilities = dict() for class_value, class_summaries in summaries.items(): probabilities[class_value] = summaries[class_value][0][2]/float(total_rows) for i in range(len(class_summaries)): mean, stdev, count = class_summaries[i] probabilities[class_value] *= calculate_probability(row[i], mean, stdev) return probabilities

Let’s tie this together with an example on the contrived dataset.

The example below first calculates the summary statistics by class for the training dataset, then uses these statistics to calculate the probability of the first record belonging to each class.

# Example of calculating class probabilities from math import sqrt from math import pi from math import exp # Split the dataset by class values, returns a dictionary def separate_by_class(dataset): separated = dict() for i in range(len(dataset)): vector = dataset[i] class_value = vector[-1] if (class_value not in separated): separated[class_value] = list() separated[class_value].append(vector) return separated # Calculate the mean of a list of numbers def mean(numbers): return sum(numbers)/float(len(numbers)) # Calculate the standard deviation of a list of numbers def stdev(numbers): avg = mean(numbers) variance = sum([(x-avg)**2 for x in numbers]) / float(len(numbers)-1) return sqrt(variance) # Calculate the mean, stdev and count for each column in a dataset def summarize_dataset(dataset): summaries = [(mean(column), stdev(column), len(column)) for column in zip(*dataset)] del(summaries[-1]) return summaries # Split dataset by class then calculate statistics for each row def summarize_by_class(dataset): separated = separate_by_class(dataset) summaries = dict() for class_value, rows in separated.items(): summaries[class_value] = summarize_dataset(rows) return summaries # Calculate the Gaussian probability distribution function for x def calculate_probability(x, mean, stdev): exponent = exp(-((x-mean)**2 / (2 * stdev**2 ))) return (1 / (sqrt(2 * pi) * stdev)) * exponent # Calculate the probabilities of predicting each class for a given row def calculate_class_probabilities(summaries, row): total_rows = sum([summaries[label][0][2] for label in summaries]) probabilities = dict() for class_value, class_summaries in summaries.items(): probabilities[class_value] = summaries[class_value][0][2]/float(total_rows) for i in range(len(class_summaries)): mean, stdev, _ = class_summaries[i] probabilities[class_value] *= calculate_probability(row[i], mean, stdev) return probabilities # Test calculating class probabilities dataset = [[3.393533211,2.331273381,0], [3.110073483,1.781539638,0], [1.343808831,3.368360954,0], [3.582294042,4.67917911,0], [2.280362439,2.866990263,0], [7.423436942,4.696522875,1], [5.745051997,3.533989803,1], [9.172168622,2.511101045,1], [7.792783481,3.424088941,1], [7.939820817,0.791637231,1]] summaries = summarize_by_class(dataset) probabilities = calculate_class_probabilities(summaries, dataset[0]) print(probabilities)

Running the example prints the probabilities calculated for each class.

We can see that the probability of the first row belonging to the 0 class (0.0503) is higher than the probability of it belonging to the 1 class (0.0001). We would therefore correctly conclude that it belongs to the 0 class.

{0: 0.05032427673372075, 1: 0.00011557718379945765}

Now that we have seen how to implement the Naive Bayes algorithm, let’s apply it to the Iris flowers dataset.

This section applies the Naive Bayes algorithm to the Iris flowers dataset.

The first step is to load the dataset and convert the loaded data to numbers that we can use with the mean and standard deviation calculations. For this we will use the helper function *load_csv()* to load the file, *str_column_to_float()* to convert string numbers to floats and *str_column_to_int()* to convert the class column to integer values.

We will evaluate the algorithm using k-fold cross-validation with 5 folds. This means that 150/5=30 records will be in each fold. We will use the helper functions *evaluate_algorithm()* to evaluate the algorithm with cross-validation and *accuracy_metric()* to calculate the accuracy of predictions.

A new function named *predict()* was developed to manage the calculation of the probabilities of a new row belonging to each class and selecting the class with the largest probability value.

Another new function named *naive_bayes()* was developed to manage the application of the Naive Bayes algorithm, first learning the statistics from a training dataset and using them to make predictions for a test dataset.

If you would like more help with the data loading functions used below, see the tutorial:

If you would like more help with the way the model is evaluated using cross validation, see the tutorial:

The complete example is listed below.

# Naive Bayes On The Iris Dataset from csv import reader from random import seed from random import randrange from math import sqrt from math import exp from math import pi # Load a CSV file def load_csv(filename): dataset = list() with open(filename, 'r') as file: csv_reader = reader(file) for row in csv_reader: if not row: continue dataset.append(row) return dataset # Convert string column to float def str_column_to_float(dataset, column): for row in dataset: row[column] = float(row[column].strip()) # Convert string column to integer def str_column_to_int(dataset, column): class_values = [row[column] for row in dataset] unique = set(class_values) lookup = dict() for i, value in enumerate(unique): lookup[value] = i for row in dataset: row[column] = lookup[row[column]] return lookup # Split a dataset into k folds def cross_validation_split(dataset, n_folds): dataset_split = list() dataset_copy = list(dataset) fold_size = int(len(dataset) / n_folds) for _ in range(n_folds): fold = list() while len(fold) < fold_size: index = randrange(len(dataset_copy)) fold.append(dataset_copy.pop(index)) dataset_split.append(fold) return dataset_split # Calculate accuracy percentage def accuracy_metric(actual, predicted): correct = 0 for i in range(len(actual)): if actual[i] == predicted[i]: correct += 1 return correct / float(len(actual)) * 100.0 # Evaluate an algorithm using a cross validation split def evaluate_algorithm(dataset, algorithm, n_folds, *args): folds = cross_validation_split(dataset, n_folds) scores = list() for fold in folds: train_set = list(folds) train_set.remove(fold) train_set = sum(train_set, []) test_set = list() for row in fold: row_copy = list(row) test_set.append(row_copy) row_copy[-1] = None predicted = algorithm(train_set, test_set, *args) actual = [row[-1] for row in fold] accuracy = accuracy_metric(actual, predicted) scores.append(accuracy) return scores # Split the dataset by class values, returns a dictionary def separate_by_class(dataset): separated = dict() for i in range(len(dataset)): vector = dataset[i] class_value = vector[-1] if (class_value not in separated): separated[class_value] = list() separated[class_value].append(vector) return separated # Calculate the mean of a list of numbers def mean(numbers): return sum(numbers)/float(len(numbers)) # Calculate the standard deviation of a list of numbers def stdev(numbers): avg = mean(numbers) variance = sum([(x-avg)**2 for x in numbers]) / float(len(numbers)-1) return sqrt(variance) # Calculate the mean, stdev and count for each column in a dataset def summarize_dataset(dataset): summaries = [(mean(column), stdev(column), len(column)) for column in zip(*dataset)] del(summaries[-1]) return summaries # Split dataset by class then calculate statistics for each row def summarize_by_class(dataset): separated = separate_by_class(dataset) summaries = dict() for class_value, rows in separated.items(): summaries[class_value] = summarize_dataset(rows) return summaries # Calculate the Gaussian probability distribution function for x def calculate_probability(x, mean, stdev): exponent = exp(-((x-mean)**2 / (2 * stdev**2 ))) return (1 / (sqrt(2 * pi) * stdev)) * exponent # Calculate the probabilities of predicting each class for a given row def calculate_class_probabilities(summaries, row): total_rows = sum([summaries[label][0][2] for label in summaries]) probabilities = dict() for class_value, class_summaries in summaries.items(): probabilities[class_value] = summaries[class_value][0][2]/float(total_rows) for i in range(len(class_summaries)): mean, stdev, _ = class_summaries[i] probabilities[class_value] *= calculate_probability(row[i], mean, stdev) return probabilities # Predict the class for a given row def predict(summaries, row): probabilities = calculate_class_probabilities(summaries, row) best_label, best_prob = None, -1 for class_value, probability in probabilities.items(): if best_label is None or probability > best_prob: best_prob = probability best_label = class_value return best_label # Naive Bayes Algorithm def naive_bayes(train, test): summarize = summarize_by_class(train) predictions = list() for row in test: output = predict(summarize, row) predictions.append(output) return(predictions) # Test Naive Bayes on Iris Dataset seed(1) filename = 'iris.csv' dataset = load_csv(filename) for i in range(len(dataset[0])-1): str_column_to_float(dataset, i) # convert class column to integers str_column_to_int(dataset, len(dataset[0])-1) # evaluate algorithm n_folds = 5 scores = evaluate_algorithm(dataset, naive_bayes, n_folds) print('Scores: %s' % scores) print('Mean Accuracy: %.3f%%' % (sum(scores)/float(len(scores))))

Running the example prints the mean classification accuracy scores on each cross-validation fold as well as the mean accuracy score.

We can see that the mean accuracy of about 95% is dramatically better than the baseline accuracy of 33%.

Scores: [93.33333333333333, 96.66666666666667, 100.0, 93.33333333333333, 93.33333333333333] Mean Accuracy: 95.333%

We can fit the model on the entire dataset and then use the model to make predictions for new observations (rows of data).

For example, the model is just a set of probabilities calculated via the *summarize_by_class()* function.

... # fit model model = summarize_by_class(dataset)

Once calculated, we can use them in a call to the predict() function with a row representing our new observation to predict the class label.

... # predict the label label = predict(model, row)

We also might like to know the class label (string) for a prediction. We can update the str_column_to_int() function to print the mapping of string class names to integers so we can interpret the prediction by the model.

# Convert string column to integer def str_column_to_int(dataset, column): class_values = [row[column] for row in dataset] unique = set(class_values) lookup = dict() for i, value in enumerate(unique): lookup[value] = i print('[%s] => %d' % (value, i)) for row in dataset: row[column] = lookup[row[column]] return lookup

Tying this together, a complete example of fitting the Naive Bayes model on the entire dataset and making a single prediction for a new observation is listed below.

# Make Predictions with Naive Bayes On The Iris Dataset from csv import reader from math import sqrt from math import exp from math import pi # Load a CSV file def load_csv(filename): dataset = list() with open(filename, 'r') as file: csv_reader = reader(file) for row in csv_reader: if not row: continue dataset.append(row) return dataset # Convert string column to float def str_column_to_float(dataset, column): for row in dataset: row[column] = float(row[column].strip()) # Convert string column to integer def str_column_to_int(dataset, column): class_values = [row[column] for row in dataset] unique = set(class_values) lookup = dict() for i, value in enumerate(unique): lookup[value] = i print('[%s] => %d' % (value, i)) for row in dataset: row[column] = lookup[row[column]] return lookup # Split the dataset by class values, returns a dictionary def separate_by_class(dataset): separated = dict() for i in range(len(dataset)): vector = dataset[i] class_value = vector[-1] if (class_value not in separated): separated[class_value] = list() separated[class_value].append(vector) return separated # Calculate the mean of a list of numbers def mean(numbers): return sum(numbers)/float(len(numbers)) # Calculate the standard deviation of a list of numbers def stdev(numbers): avg = mean(numbers) variance = sum([(x-avg)**2 for x in numbers]) / float(len(numbers)-1) return sqrt(variance) # Calculate the mean, stdev and count for each column in a dataset def summarize_dataset(dataset): summaries = [(mean(column), stdev(column), len(column)) for column in zip(*dataset)] del(summaries[-1]) return summaries # Split dataset by class then calculate statistics for each row def summarize_by_class(dataset): separated = separate_by_class(dataset) summaries = dict() for class_value, rows in separated.items(): summaries[class_value] = summarize_dataset(rows) return summaries # Calculate the Gaussian probability distribution function for x def calculate_probability(x, mean, stdev): exponent = exp(-((x-mean)**2 / (2 * stdev**2 ))) return (1 / (sqrt(2 * pi) * stdev)) * exponent # Calculate the probabilities of predicting each class for a given row def calculate_class_probabilities(summaries, row): total_rows = sum([summaries[label][0][2] for label in summaries]) probabilities = dict() for class_value, class_summaries in summaries.items(): probabilities[class_value] = summaries[class_value][0][2]/float(total_rows) for i in range(len(class_summaries)): mean, stdev, _ = class_summaries[i] probabilities[class_value] *= calculate_probability(row[i], mean, stdev) return probabilities # Predict the class for a given row def predict(summaries, row): probabilities = calculate_class_probabilities(summaries, row) best_label, best_prob = None, -1 for class_value, probability in probabilities.items(): if best_label is None or probability > best_prob: best_prob = probability best_label = class_value return best_label # Make a prediction with Naive Bayes on Iris Dataset filename = 'iris.csv' dataset = load_csv(filename) for i in range(len(dataset[0])-1): str_column_to_float(dataset, i) # convert class column to integers str_column_to_int(dataset, len(dataset[0])-1) # fit model model = summarize_by_class(dataset) # define a new record row = [5.7,2.9,4.2,1.3] # predict the label label = predict(model, row) print('Data=%s, Predicted: %s' % (row, label))

Running the data first summarizes the mapping of class labels to integers and then fits the model on the entire dataset.

Then a new observation is defined (in this case I took a row from the dataset), and a predicted label is calculated. In this case our observation is predicted as belonging to class 2 which we know is “Iris-setosa”.

[Iris-virginica] => 0 [Iris-versicolor] => 1 [Iris-setosa] => 2 Data=[5.7, 2.9, 4.2, 1.3], Predicted: 1

This section lists extensions to the tutorial that you may wish to explore.

**Log Probabilities**: The conditional probabilities for each class given an attribute value are small. When they are multiplied together they result in very small values, which can lead to floating point underflow (numbers too small to represent in Python). A common fix for this is to add the log of the probabilities together. Research and implement this improvement.**Nominal Attributes**: Update the implementation to support nominal attributes. This is much similar and the summary information you can collect for each attribute is the ratio of category values for each class. Dive into the references for more information.**Different Density Function**(*bernoulli*or*multinomial*): We have looked at Gaussian Naive Bayes, but you can also look at other distributions. Implement a different distribution such as multinomial, bernoulli or kernel naive bayes that make different assumptions about the distribution of attribute values and/or their relationship with the class value.

If you try any of these extensions, let me know in the comments below.

- A Gentle Introduction to Bayes Theorem for Machine Learning
- How to Develop a Naive Bayes Classifier from Scratch in Python
- Naive Bayes Tutorial for Machine Learning
- Naive Bayes for Machine Learning
- Better Naive Bayes: 12 Tips To Get The Most From The Naive Bayes Algorithm

- Section 13.6 Naive Bayes, page 353, Applied Predictive Modeling, 2013.
- Section 4.2, Statistical modeling, page 88, Data Mining: Practical Machine Learning Tools and Techniques, 2nd edition, 2005.

In this tutorial you discovered how to implement the Naive Bayes algorithm from scratch in Python.

Specifically, you learned:

- How to calculate the probabilities required by the Naive interpretation of Bayes Theorem.
- How to use probabilities to make predictions on new data.
- How to apply Naive Bayes to a real-world predictive modeling problem.

Take action!

- Follow the tutorial and implement Naive Bayes from scratch.
- Adapt the example to another dataset.
- Follow the extensions and improve upon the implementation.

Leave a comment and share your experiences.

The post Naive Bayes Classifier From Scratch in Python appeared first on Machine Learning Mastery.

]]>The post How to Calculate the Divergence Between Probability Distributions appeared first on Machine Learning Mastery.

]]>It is often desirable to quantify the difference between probability distributions for a given random variable.

This occurs frequently in machine learning, when we may be interested in calculating the difference between an actual and observed probability distribution.

This can be achieved using techniques from information theory, such as the Kullback-Leibler Divergence (KL divergence), or relative entropy, and the Jensen-Shannon Divergence that provides a normalized and symmetrical version of the KL divergence. These scoring methods can be used as shortcuts in the calculation of other widely used methods, such as mutual information for feature selection prior to modeling, and cross-entropy used as a loss function for many different classifier models.

In this post, you will discover how to calculate the divergence between probability distributions.

After reading this post, you will know:

- Statistical distance is the general idea of calculating the difference between statistical objects like different probability distributions for a random variable.
- Kullback-Leibler divergence calculates a score that measures the divergence of one probability distribution from another.
- Jensen-Shannon divergence extends KL divergence to calculate a symmetrical score and distance measure of one probability distribution from another.

Discover bayes opimization, naive bayes, maximum likelihood, distributions, cross entropy, and much more in my new book, with 28 step-by-step tutorials and full Python source code.

Let’s get started.

**Update Oct/2019**: Added a description of the alternative form of the equation (thanks Ori).

This tutorial is divided into three parts; they are:

- Statistical Distance
- Kullback-Leibler Divergence
- Jensen-Shannon Divergence

There are many situations where we may want to compare two probability distributions.

Specifically, we may have a single random variable and two different probability distributions for the variable, such as a true distribution and an approximation of that distribution.

In situations like this, it can be useful to quantify the difference between the distributions. Generally, this is referred to as the problem of calculating the statistical distance between two statistical objects, e.g. probability distributions.

One approach is to calculate a distance measure between the two distributions. This can be challenging as it can be difficult to interpret the measure.

Instead, it is more common to calculate a divergence between two probability distributions. A divergence is like a measure but is not symmetrical. This means that a divergence is a scoring of how one distribution differs from another, where calculating the divergence for distributions P and Q would give a different score from Q and P.

Divergence scores are an important foundation for many different calculations in information theory and more generally in machine learning. For example, they provide shortcuts for calculating scores such as mutual information (information gain) and cross-entropy used as a loss function for classification models.

Divergence scores are also used directly as tools for understanding complex modeling problems, such as approximating a target probability distribution when optimizing generative adversarial network (GAN) models.

Two commonly used divergence scores from information theory are Kullback-Leibler Divergence and Jensen-Shannon Divergence.

We will take a closer look at both of these scores in the following section.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The Kullback-Leibler Divergence score, or KL divergence score, quantifies how much one probability distribution differs from another probability distribution.

The KL divergence between two distributions Q and P is often stated using the following notation:

- KL(P || Q)

Where the “||” operator indicates “*divergence*” or Ps divergence from Q.

KL divergence can be calculated as the negative sum of probability of each event in P multiplied by the log of the probability of the event in Q over the probability of the event in P.

- KL(P || Q) = – sum x in X P(x) * log(Q(x) / P(x))

The value within the sum is the divergence for a given event.

This is the same as the positive sum of probability of each event in P multiplied by the log of the probability of the event in P over the probability of the event in Q (e.g. the terms in the fraction are flipped). This is the more common implementation used in practice.

- KL(P || Q) = sum x in X P(x) * log(P(x) / Q(x))

The intuition for the KL divergence score is that when the probability for an event from P is large, but the probability for the same event in Q is small, there is a large divergence. When the probability from P is small and the probability from Q is large, there is also a large divergence, but not as large as the first case.

It can be used to measure the divergence between discrete and continuous probability distributions, where in the latter case the integral of the events is calculated instead of the sum of the probabilities of the discrete events.

One way to measure the dissimilarity of two probability distributions, p and q, is known as the Kullback-Leibler divergence (KL divergence) or relative entropy.

— Page 57, Machine Learning: A Probabilistic Perspective, 2012.

The log can be base-2 to give units in “*bits*,” or the natural logarithm base-e with units in “*nats*.” When the score is 0, it suggests that both distributions are identical, otherwise the score is positive.

Importantly, the KL divergence score is not symmetrical, for example:

- KL(P || Q) != KL(Q || P)

It is named for the two authors of the method Solomon Kullback and Richard Leibler, and is sometimes referred to as “*relative entropy*.”

This is known as the relative entropy or Kullback-Leibler divergence, or KL divergence, between the distributions p(x) and q(x).

— Page 55, Pattern Recognition and Machine Learning, 2006.

If we are attempting to approximate an unknown probability distribution, then the target probability distribution from data is P and Q is our approximation of the distribution.

In this case, the KL divergence summarizes the number of additional bits (i.e. calculated with the base-2 logarithm) required to represent an event from the random variable. The better our approximation, the less additional information is required.

… the KL divergence is the average number of extra bits needed to encode the data, due to the fact that we used distribution q to encode the data instead of the true distribution p.

— Page 58, Machine Learning: A Probabilistic Perspective, 2012.

We can make the KL divergence concrete with a worked example.

Consider a random variable with three events as different colors. We may have two different probability distributions for this variable; for example:

... # define distributions events = ['red', 'green', 'blue'] p = [0.10, 0.40, 0.50] q = [0.80, 0.15, 0.05]

We can plot a bar chart of these probabilities to compare them directly as probability histograms.

The complete example is listed below.

# plot of distributions from matplotlib import pyplot # define distributions events = ['red', 'green', 'blue'] p = [0.10, 0.40, 0.50] q = [0.80, 0.15, 0.05] print('P=%.3f Q=%.3f' % (sum(p), sum(q))) # plot first distribution pyplot.subplot(2,1,1) pyplot.bar(events, p) # plot second distribution pyplot.subplot(2,1,2) pyplot.bar(events, q) # show the plot pyplot.show()

Running the example creates a histogram for each probability distribution, allowing the probabilities for each event to be directly compared.

We can see that indeed the distributions are different.

Next, we can develop a function to calculate the KL divergence between the two distributions.

We will use log base-2 to ensure the result has units in bits.

# calculate the kl divergence def kl_divergence(p, q): return sum(p[i] * log2(p[i]/q[i]) for i in range(len(p)))

We can then use this function to calculate the KL divergence of P from Q, as well as the reverse, Q from P.

# calculate (P || Q) kl_pq = kl_divergence(p, q) print('KL(P || Q): %.3f bits' % kl_pq) # calculate (Q || P) kl_qp = kl_divergence(q, p) print('KL(Q || P): %.3f bits' % kl_qp)

Tying this all together, the complete example is listed below.

# example of calculating the kl divergence between two mass functions from math import log2 # calculate the kl divergence def kl_divergence(p, q): return sum(p[i] * log2(p[i]/q[i]) for i in range(len(p))) # define distributions p = [0.10, 0.40, 0.50] q = [0.80, 0.15, 0.05] # calculate (P || Q) kl_pq = kl_divergence(p, q) print('KL(P || Q): %.3f bits' % kl_pq) # calculate (Q || P) kl_qp = kl_divergence(q, p) print('KL(Q || P): %.3f bits' % kl_qp)

Running the example first calculates the divergence of P from Q as just under 2 bits, then Q from P as just over 2 bits.

This is intuitive if we consider P has large probabilities when Q is small, giving P less divergence than Q from P as Q has more small probabilities when P has large probabilities. There is more divergence in this second case.

KL(P || Q): 1.927 bits KL(Q || P): 2.022 bits

If we change *log2()* to the natural logarithm *log()* function, the result is in nats, as follows:

# KL(P || Q): 1.336 nats # KL(Q || P): 1.401 nats

The SciPy library provides the kl_div() function for calculating the KL divergence, although with a different definition as defined here. It also provides the rel_entr() function for calculating the relative entropy, which matches the definition of KL divergence here. This is odd as “*relative entropy*” is often used as a synonym for “*KL divergence*.”

Nevertheless, we can calculate the KL divergence using the rel_entr() SciPy function and confirm that our manual calculation is correct.

The *rel_entr()* function takes lists of probabilities across all events from each probability distribution as arguments and returns a list of divergences for each event. These can be summed to give the KL divergence. The calculation uses the natural logarithm instead of log base-2 so the units are in nats instead of bits.

The complete example using SciPy to calculate KL(P || Q) and KL(Q || P) for the same probability distributions used above is listed below:

# example of calculating the kl divergence (relative entropy) with scipy from scipy.special import rel_entr # define distributions p = [0.10, 0.40, 0.50] q = [0.80, 0.15, 0.05] # calculate (P || Q) kl_pq = rel_entr(p, q) print('KL(P || Q): %.3f nats' % sum(kl_pq)) # calculate (Q || P) kl_qp = rel_entr(q, p) print('KL(Q || P): %.3f nats' % sum(kl_qp))

Running the example, we can see that the calculated divergences match our manual calculation of about 1.3 nats and about 1.4 nats for KL(P || Q) and KL(Q || P) respectively.

KL(P || Q): 1.336 nats KL(Q || P): 1.401 nats

The Jensen-Shannon divergence, or JS divergence for short, is another way to quantify the difference (or similarity) between two probability distributions.

It uses the KL divergence to calculate a normalized score that is symmetrical. This means that the divergence of P from Q is the same as Q from P, or stated formally:

- JS(P || Q) == JS(Q || P)

The JS divergence can be calculated as follows:

- JS(P || Q) = 1/2 * KL(P || M) + 1/2 * KL(Q || M)

Where M is calculated as:

- M = 1/2 * (P + Q)

And *KL()* is calculated as the KL divergence described in the previous section.

It is more useful as a measure as it provides a smoothed and normalized version of KL divergence, with scores between 0 (identical) and 1 (maximally different), when using the base-2 logarithm.

The square root of the score gives a quantity referred to as the Jensen-Shannon distance, or JS distance for short.

We can make the JS divergence concrete with a worked example.

First, we can define a function to calculate the JS divergence that uses the *kl_divergence()* function prepared in the previous section.

# calculate the kl divergence def kl_divergence(p, q): return sum(p[i] * log2(p[i]/q[i]) for i in range(len(p))) # calculate the js divergence def js_divergence(p, q): m = 0.5 * (p + q) return 0.5 * kl_divergence(p, m) + 0.5 * kl_divergence(q, m)

We can then test this function using the same probability distributions used in the previous section.

First, we will calculate the JS divergence score for the distributions, then calculate the square root of the score to give the JS distance between the distributions. For example:

... # calculate JS(P || Q) js_pq = js_divergence(p, q) print('JS(P || Q) divergence: %.3f bits' % js_pq) print('JS(P || Q) distance: %.3f' % sqrt(js_pq))

This can then be repeated for the reverse case to show that the divergence is symmetrical, unlike the KL divergence.

... # calculate JS(Q || P) js_qp = js_divergence(q, p) print('JS(Q || P) divergence: %.3f bits' % js_qp) print('JS(Q || P) distance: %.3f' % sqrt(js_qp))

Tying this together, the complete example of calculating the JS divergence and JS distance is listed below.

# example of calculating the js divergence between two mass functions from math import log2 from math import sqrt from numpy import asarray # calculate the kl divergence def kl_divergence(p, q): return sum(p[i] * log2(p[i]/q[i]) for i in range(len(p))) # calculate the js divergence def js_divergence(p, q): m = 0.5 * (p + q) return 0.5 * kl_divergence(p, m) + 0.5 * kl_divergence(q, m) # define distributions p = asarray([0.10, 0.40, 0.50]) q = asarray([0.80, 0.15, 0.05]) # calculate JS(P || Q) js_pq = js_divergence(p, q) print('JS(P || Q) divergence: %.3f bits' % js_pq) print('JS(P || Q) distance: %.3f' % sqrt(js_pq)) # calculate JS(Q || P) js_qp = js_divergence(q, p) print('JS(Q || P) divergence: %.3f bits' % js_qp) print('JS(Q || P) distance: %.3f' % sqrt(js_qp))

Running the example shows that the JS divergence between the distributions is about 0.4 bits and that the distance is about 0.6.

We can see that the calculation is symmetrical, giving the same score and distance measure for JS(P || Q) and JS(Q || P).

JS(P || Q) divergence: 0.420 bits JS(P || Q) distance: 0.648 JS(Q || P) divergence: 0.420 bits JS(Q || P) distance: 0.648

The SciPy library provides an implementation of the JS distance via the jensenshannon() function.

It takes arrays of probabilities across all events from each probability distribution as arguments and returns the JS distance score, not a divergence score. We can use this function to confirm our manual calculation of the JS distance.

The complete example is listed below.

# calculate the jensen-shannon distance metric from scipy.spatial.distance import jensenshannon from numpy import asarray # define distributions p = asarray([0.10, 0.40, 0.50]) q = asarray([0.80, 0.15, 0.05]) # calculate JS(P || Q) js_pq = jensenshannon(p, q, base=2) print('JS(P || Q) Distance: %.3f' % js_pq) # calculate JS(Q || P) js_qp = jensenshannon(q, p, base=2) print('JS(Q || P) Distance: %.3f' % js_qp)

Running the example, we can confirm the distance score matches our manual calculation of 0.648, and that the distance calculation is symmetrical as expected.

JS(P || Q) Distance: 0.648 JS(Q || P) Distance: 0.648

This section provides more resources on the topic if you are looking to go deeper.

- Machine Learning: A Probabilistic Perspective, 2012.
- Pattern Recognition and Machine Learning, 2006.

- How to Choose Loss Functions When Training Deep Learning Neural Networks
- Loss and Loss Functions for Training Deep Learning Neural Networks

- Statistical distance, Wikipedia.
- Divergence (statistics), Wikipedia.
- Kullback-Leibler divergence, Wikipedia.
- Jensen-Shannon divergence, Wikipedia.

In this post, you discovered how to calculate the divergence between probability distributions.

Specifically, you learned:

- Statistical distance is the general idea of calculating the difference between statistical objects like different probability distributions for a random variable.
- Kullback-Leibler divergence calculates a score that measures the divergence of one probability distribution from another.
- Jensen-Shannon divergence extends KL divergence to calculate a symmetrical score and distance measure of one probability distribution from another.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Calculate the Divergence Between Probability Distributions appeared first on Machine Learning Mastery.

]]>The post Information Gain and Mutual Information for Machine Learning appeared first on Machine Learning Mastery.

]]>Information gain calculates the reduction in entropy or surprise from transforming a dataset in some way.

It is commonly used in the construction of decision trees from a training dataset, by evaluating the information gain for each variable, and selecting the variable that maximizes the information gain, which in turn minimizes the entropy and best splits the dataset into groups for effective classification.

Information gain can also be used for feature selection, by evaluating the gain of each variable in the context of the target variable. In this slightly different usage, the calculation is referred to as mutual information between the two random variables.

In this post, you will discover information gain and mutual information in machine learning.

After reading this post, you will know:

- Information gain is the reduction in entropy or surprise by transforming a dataset and is often used in training decision trees.
- Information gain is calculated by comparing the entropy of the dataset before and after a transformation.
- Mutual information calculates the statistical dependence between two variables and is the name given to information gain when applied to variable selection.

Discover bayes opimization, naive bayes, maximum likelihood, distributions, cross entropy, and much more in my new book, with 28 step-by-step tutorials and full Python source code.

Let’s get started.

This tutorial is divided into five parts; they are:

- What Is Information Gain?
- Worked Example of Calculating Information Gain
- Examples of Information Gain in Machine Learning
- What Is Mutual Information?
- How Are Information Gain and Mutual Information Related?

Information Gain, or IG for short, measures the reduction in entropy or surprise by splitting a dataset according to a given value of a random variable.

A larger information gain suggests a lower entropy group or groups of samples, and hence less surprise.

Information quantifies how surprising an event is from a random variable in bits. Entropy quantifies how much information there is in a random variable, or more specifically, the probability distribution for the events of the random variable.

A larger entropy suggests lower probability events or more surprise, whereas a lower entropy suggests larger probability events with less surprise.

We can think about the entropy of a dataset in terms of the probability distribution of observations in the dataset belonging to one class or another, e.g. two classes in the case of a binary classification dataset.

One interpretation of entropy from information theory is that it specifies the minimum number of bits of information needed to encode the classification of an arbitrary member of S (i.e., a member of S drawn at random with uniform probability).

— Page 58, Machine Learning, 1997.

For example, in a binary classification problem (two classes), we can calculate the entropy of the data sample as follows:

- Entropy = -(p(0) * log(P(0)) + p(1) * log(P(1)))

A dataset with a 50/50 split of samples for the two classes would have a maximum entropy (maximum surprise) of 1 bit, whereas an imbalanced dataset with a split of 10/90 would have a smaller entropy as there would be less surprise for a randomly drawn example from the dataset.

We can demonstrate this with an example of calculating the entropy for this imbalanced dataset in Python. The complete example is listed below.

# calculate the entropy for a dataset from math import log2 # proportion of examples in each class class0 = 10/100 class1 = 90/100 # calculate entropy entropy = -(class0 * log2(class0) + class1 * log2(class1)) # print the result print('entropy: %.3f bits' % entropy)

Running the example, we can see that entropy of the dataset for binary classification is less than 1 bit. That is, less than one bit of information is required to encode the class label for an arbitrary example from the dataset.

entropy: 0.469 bits

In this way, entropy can be used as a calculation of the purity of a dataset, e.g. how balanced the distribution of classes happens to be.

An entropy of 0 bits indicates a dataset containing one class; an entropy of 1 or more bits suggests maximum entropy for a balanced dataset (depending on the number of classes), with values in between indicating levels between these extremes.

Information gain provides a way to use entropy to calculate how a change to the dataset impacts the purity of the dataset, e.g. the distribution of classes. A smaller entropy suggests more purity or less surprise.

… information gain, is simply the expected reduction in entropy caused by partitioning the examples according to this attribute.

— Page 57, Machine Learning, 1997.

For example, we may wish to evaluate the impact on purity by splitting a dataset *S* by a random variable with a range of values.

This can be calculated as follows:

- IG(S, a) = H(S) – H(S | a)

Where *IG(S, a)* is the information for the dataset *S* for the variable a for a random variable, *H(S)* is the entropy for the dataset before any change (described above) and *H(S | a)* is the conditional entropy for the dataset given the variable *a*.

This calculation describes the gain in the dataset *S* for the variable a. It is the number of bits saved when transforming the dataset.

The conditional entropy can be calculated by splitting the dataset into groups for each observed value of a and calculating the sum of the ratio of examples in each group out of the entire dataset multiplied by the entropy of each group.

- H(S | a) = sum v in a Sa(v)/S * H(Sa(v))

Where *Sa(v)/S* is the ratio of the number of examples in the dataset with variable a has the value *v*, and *H(Sa(v))* is the entropy of group of samples where variable a has the value *v*.

This might sound a little confusing.

We can make the calculation of information gain concrete with a worked example.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In this section, we will make the calculation of information gain concrete with a worked example.

We can define a function to calculate the entropy of a group of samples based on the ratio of samples that belong to class 0 and class 1.

# calculate the entropy for the split in the dataset def entropy(class0, class1): return -(class0 * log2(class0) + class1 * log2(class1))

Now, consider a dataset with 20 examples, 13 for class 0 and 7 for class 1. We can calculate the entropy for this dataset, which will have less than 1 bit.

... # split of the main dataset class0 = 13 / 20 class1 = 7 / 20 # calculate entropy before the change s_entropy = entropy(class0, class1) print('Dataset Entropy: %.3f bits' % s_entropy)

Now consider that one of the variables in the dataset has two unique values, say “*value1*” and “*value2*.” We are interested in calculating the information gain of this variable.

Let’s assume that if we split the dataset by value1, we have a group of eight samples, seven for class 0 and one for class 1. We can then calculate the entropy of this group of samples.

... # split 1 (split via value1) s1_class0 = 7 / 8 s1_class1 = 1 / 8 # calculate the entropy of the first group s1_entropy = entropy(s1_class0, s1_class1) print('Group1 Entropy: %.3f bits' % s1_entropy)

Now, let’s assume that we split the dataset by value2; we have a group of 12 samples with six in each group. We would expect this group to have an entropy of 1.

... # split 2 (split via value2) s2_class0 = 6 / 12 s2_class1 = 6 / 12 # calculate the entropy of the second group s2_entropy = entropy(s2_class0, s2_class1) print('Group2 Entropy: %.3f bits' % s2_entropy)

Finally, we can calculate the information gain for this variable based on the groups created for each value of the variable and the calculated entropy.

The first variable resulted in a group of eight examples from the dataset, and the second group had the remaining 12 samples in the data set. Therefore, we have everything we need to calculate the information gain.

In this case, information gain can be calculated as:

- Entropy(Dataset) – Count(Group1) / Count(Dataset) * Entropy(Group1) + Count(Group2) / Count(Dataset) * Entropy(Group2)

Or:

- Entropy(13/20, 7/20) – 8/20 * Entropy(7/8, 1/8) + 12/20 * Entropy(6/12, 6/12)

Or in code:

... # calculate the information gain gain = s_entropy - (8/20 * s1_entropy + 12/20 * s2_entropy) print('Information Gain: %.3f bits' % gain)

Tying this all together, the complete example is listed below.

# calculate the information gain from math import log2 # calculate the entropy for the split in the dataset def entropy(class0, class1): return -(class0 * log2(class0) + class1 * log2(class1)) # split of the main dataset class0 = 13 / 20 class1 = 7 / 20 # calculate entropy before the change s_entropy = entropy(class0, class1) print('Dataset Entropy: %.3f bits' % s_entropy) # split 1 (split via value1) s1_class0 = 7 / 8 s1_class1 = 1 / 8 # calculate the entropy of the first group s1_entropy = entropy(s1_class0, s1_class1) print('Group1 Entropy: %.3f bits' % s1_entropy) # split 2 (split via value2) s2_class0 = 6 / 12 s2_class1 = 6 / 12 # calculate the entropy of the second group s2_entropy = entropy(s2_class0, s2_class1) print('Group2 Entropy: %.3f bits' % s2_entropy) # calculate the information gain gain = s_entropy - (8/20 * s1_entropy + 12/20 * s2_entropy) print('Information Gain: %.3f bits' % gain)

First, the entropy of the dataset is calculated at just under 1 bit. Then the entropy for the first and second groups are calculated at about 0.5 and 1 bits respectively.

Finally, the information gain for the variable is calculated as 0.117 bits. That is, the gain to the dataset by splitting it via the chosen variable is 0.117 bits.

Dataset Entropy: 0.934 bits Group1 Entropy: 0.544 bits Group2 Entropy: 1.000 bits Information Gain: 0.117 bits

Perhaps the most popular use of information gain in machine learning is in decision trees.

An example is the Iterative Dichotomiser 3 algorithm, or ID3 for short, used to construct a decision tree.

Information gain is precisely the measure used by ID3 to select the best attribute at each step in growing the tree.

— Page 58, Machine Learning, 1997.

The information gain is calculated for each variable in the dataset. The variable that has the largest information gain is selected to split the dataset. Generally, a larger gain indicates a smaller entropy or less surprise.

Note that minimizing the entropy is equivalent to maximizing the information gain …

— Page 547, Machine Learning: A Probabilistic Perspective, 2012.

The process is then repeated on each created group, excluding the variable that was already chosen. This stops once a desired depth to the decision tree is reached or no more splits are possible.

The process of selecting a new attribute and partitioning the training examples is now repeated for each non terminal descendant node, this time using only the training examples associated with that node. Attributes that have been incorporated higher in the tree are excluded, so that any given attribute can appear at most once along any path through the tree.

— Page 60, Machine Learning, 1997.

Information gain can be used as a split criterion in most modern implementations of decision trees, such as the implementation of the Classification and Regression Tree (CART) algorithm in the scikit-learn Python machine learning library in the DecisionTreeClassifier class for classification.

This can be achieved by setting the criterion argument to “*entropy*” when configuring the model; for example:

# example of a decision tree trained with information gain from sklearn.tree import DecisionTreeClassifier model = sklearn.tree.DecisionTreeClassifier(criterion='entropy') ...

Information gain can also be used for feature selection prior to modeling.

It involves calculating the information gain between the target variable and each input variable in the training dataset. The Weka machine learning workbench provides an implementation of information gain for feature selection via the InfoGainAttributeEval class.

In this context of feature selection, information gain may be referred to as “*mutual information*” and calculate the statistical dependence between two variables. An example of using information gain (mutual information) for feature selection is the mutual_info_classif() scikit-learn function.

Mutual information is calculated between two variables and measures the reduction in uncertainty for one variable given a known value of the other variable.

A quantity called mutual information measures the amount of information one can obtain from one random variable given another.

— Page 310, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.

The mutual information between two random variables *X* and *Y* can be stated formally as follows:

- I(X ; Y) = H(X) – H(X | Y)

Where *I(X ; Y)* is the mutual information for *X* and *Y*, *H(X)* is the entropy for *X* and *H(X | Y)* is the conditional entropy for *X* given *Y*. The result has the units of bits.

Mutual information is a measure of dependence or “*mutual dependence*” between two random variables. As such, the measure is symmetrical, meaning that *I(X ; Y) = I(Y ; X)*.

It measures the average reduction in uncertainty about x that results from learning the value of y; or vice versa, the average amount of information that x conveys about y.

— Page 139, Information Theory, Inference, and Learning Algorithms, 2003.

Kullback-Leibler, or KL, divergence is a measure that calculates the difference between two probability distributions.

The mutual information can also be calculated as the KL divergence between the joint probability distribution and the product of the marginal probabilities for each variable.

If the variables are not independent, we can gain some idea of whether they are ‘close’ to being independent by considering the Kullback-Leibler divergence between the joint distribution and the product of the marginals […] which is called the mutual information between the variables

— Page 57, Pattern Recognition and Machine Learning, 2006.

This can be stated formally as follows:

- I(X ; Y) = KL(p(X, Y) || p(X) * p(Y))

Mutual information is always larger than or equal to zero, where the larger the value, the greater the relationship between the two variables. If the calculated result is zero, then the variables are independent.

Mutual information is often used as a general form of a correlation coefficient, e.g. a measure of the dependence between random variables.

It is also used as an aspect in some machine learning algorithms. A common example is the Independent Component Analysis, or ICA for short, that provides a projection of statistically independent components of a dataset.

Mutual Information and Information Gain are the same thing, although the context or usage of the measure often gives rise to the different names.

For example:

- Effect of Transforms to a Dataset (
*decision trees*): Information Gain. - Dependence Between Variables (
*feature selection*): Mutual Information.

Notice the similarity in the way that the mutual information is calculated and the way that information gain is calculated; they are equivalent:

- I(X ; Y) = H(X) – H(X | Y)

and

- IG(S, a) = H(S) – H(S | a)

As such, mutual information is sometimes used as a synonym for information gain. Technically, they calculate the same quantity if applied to the same data.

We can understand the relationship between the two as the more the difference in the joint and marginal probability distributions (mutual information), the larger the gain in information (information gain).

This section provides more resources on the topic if you are looking to go deeper.

- Information Theory, Inference, and Learning Algorithms, 2003.
- Machine Learning: A Probabilistic Perspective, 2012.
- Pattern Recognition and Machine Learning, 2006.
- Machine Learning, 1997.
- Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.

- Entropy (information theory), Wikipedia.
- Information gain in decision trees, Wikipedia.
- ID3 algorithm, Wikipedia.
- Information gain ratio, Wikipedia.
- Mutual Information, Wikipedia.

In this post, you discovered information gain and mutual information in machine learning.

Specifically, you learned:

- Information gain is the reduction in entropy or surprise by transforming a dataset and is often used in training decision trees.
- Information gain is calculated by comparing the entropy of the dataset before and after a transformation.
- Mutual information calculates the statistical dependence between two variables and is the name given to information gain when applied to variable selection.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Information Gain and Mutual Information for Machine Learning appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Information Entropy appeared first on Machine Learning Mastery.

]]>Information theory is a subfield of mathematics concerned with transmitting data across a noisy channel.

A cornerstone of information theory is the idea of quantifying how much information there is in a message. More generally, this can be used to quantify the information in an event and a random variable, called entropy, and is calculated using probability.

Calculating information and entropy is a useful tool in machine learning and is used as the basis for techniques such as feature selection, building decision trees, and, more generally, fitting classification models. As such, a machine learning practitioner requires a strong understanding and intuition for information and entropy.

In this post, you will discover a gentle introduction to information entropy.

After reading this post, you will know:

- Information theory is concerned with data compression and transmission and builds upon probability and supports machine learning.
- Information provides a way to quantify the amount of surprise for an event measured in bits.
- Entropy provides a measure of the average amount of information needed to represent an event drawn from a probability distribution for a random variable.

Discover bayes opimization, naive bayes, maximum likelihood, distributions, cross entropy, and much more in my new book, with 28 step-by-step tutorials and full Python source code.

Let’s get started.

This tutorial is divided into three parts; they are:

- What Is Information Theory?
- Calculate the Information for an Event
- Calculate the Information for a Random Variable

Information theory is a field of study concerned with quantifying information for communication.

It is a subfield of mathematics and is concerned with topics like data compression and the limits of signal processing. The field was proposed and developed by Claude Shannon while working at the US telephone company Bell Labs.

Information theory is concerned with representing data in a compact fashion (a task known as data compression or source coding), as well as with transmitting and storing it in a way that is robust to errors (a task known as error correction or channel coding).

— Page 56, Machine Learning: A Probabilistic Perspective, 2012.

A foundational concept from information is the quantification of the amount of information in things like events, random variables, and distributions.

Quantifying the amount of information requires the use of probabilities, hence the relationship of information theory to probability.

Measurements of information are widely used in artificial intelligence and machine learning, such as in the construction of decision trees and the optimization of classifier models.

As such, there is an important relationship between information theory and machine learning and a practitioner must be familiar with some of the basic concepts from the field.

Why unify information theory and machine learning? Because they are two sides of the same coin. […] Information theory and machine learning still belong together. Brains are the ultimate compression and communication systems. And the state-of-the-art algorithms for both data compression and error-correcting codes use the same tools as machine learning.

— Page v, Information Theory, Inference, and Learning Algorithms, 2003.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Quantifying information is the foundation of the field of information theory.

The intuition behind quantifying information is the idea of measuring how much surprise there is in an event. Those events that are rare (low probability) are more surprising and therefore have more information those events that are common (high probability).

**Low Probability Event**: High Information (*surprising*).**High Probability Event**: Low Information (*unsurprising*).

The basic intuition behind information theory is that learning that an unlikely event has occurred is more informative than learning that a likely event has occurred.

— Page 73, Deep Learning, 2016.

Rare events are more uncertain or more surprising and require more information to represent them than common events.

We can calculate the amount of information there is in an event using the probability of the event. This is called “*Shannon information*,” “*self-information*,” or simply the “*information*,” and can be calculated for a discrete event *x* as follows:

- information(x) = -log( p(x) )

Where *log()* is the base-2 logarithm and *p(x)* is the probability of the event *x*.

The choice of the base-2 logarithm means that the units of the information measure is in bits (binary digits). This can be directly interpreted in the information processing sense as the number of bits required to represent the event.

The calculation of information is often written as *h()*; for example:

- h(x) = -log( p(x) )

The negative sign ensures that the result is always positive or zero.

Information will be zero when the probability of an event is 1.0 or a certainty, e.g. there is no surprise.

Let’s make this concrete with some examples.

Consider a flip of a single fair coin. The probability of heads (and tails) is 0.5. We can calculate the information for flipping a head in Python using the log2() function.

# calculate the information for a coin flip from math import log2 # probability of the event p = 0.5 # calculate information for event h = -log2(p) # print the result print('p(x)=%.3f, information: %.3f bits' % (p, h))

Running the example prints the probability of the event as 50% and the information content for the event as 1 bit.

p(x)=0.500, information: 1.000 bits

If the same coin was flipped n times, then the information for this sequence of flips would be n bits.

If the coin was not fair and the probability of a head was instead 10% (0.1), then the event would be more rare and would require more than 3 bits of information.

p(x)=0.100, information: 3.322 bits

We can also explore the information in a single roll of a fair six-sided dice, e.g. the information in rolling a 6.

We know the probability of rolling any number is 1/6, which is a smaller number than 1/2 for a coin flip, therefore we would expect more surprise or a larger amount of information.

# calculate the information for a dice roll from math import log2 # probability of the event p = 1.0 / 6.0 # calculate information for event h = -log2(p) # print the result print('p(x)=%.3f, information: %.3f bits' % (p, h))

Running the example, we can see that our intuition is correct and that indeed, there is more than 2.5 bits of information in a single roll of a fair die.

p(x)=0.167, information: 2.585 bits

Other logarithms can be used instead of the base-2. For example, it is also common to use the natural logarithm that uses base-e (Euler’s number) in calculating the information, in which case the units are referred to as “*nats*.”

We can also quantify how much information there is in a random variable.

For example, if we wanted to calculate the information for a random variable *X* with probability distribution *p*, this might be written as a function *H()*; for example:

- H(X)

In effect, calculating the information for a random variable is the same as calculating the information for the probability distribution of the events for the random variable.

Calculating the information for a random variable is called “*information entropy*,” “*Shannon entropy*,” or simply “*entropy*“. It is related to the idea of entropy from physics by analogy, in that both are concerned with uncertainty.

The intuition for entropy is that it is the average number of bits required to represent or transmit an event drawn from the probability distribution for the random variable.

… the Shannon entropy of a distribution is the expected amount of information in an event drawn from that distribution. It gives a lower bound on the number of bits […] needed on average to encode symbols drawn from a distribution P.

— Page 74, Deep Learning, 2016.

Entropy can be calculated for a random variable *X* with *K* discrete states as follows:

- H(X) = -sum(i=1 to K p(K) * log(p(K)))

That is the negative of the sum of the probability of each event multiplied by the log of the probability of each event.

Like information, the *log()* function uses base-2 and the units are bits. A natural logarithm can be used instead and the units will be *nats*.

The lowest entropy is calculated for a random variable that has a single event with a probability of 1.0, a certainty. The largest entropy for a random variable will be if all events are equally likely.

We can consider a roll of a fair die and calculate the entropy for the variable. Each outcome has the same probability of 1/6, therefore it is a uniform probability distribution. We therefore would expect the average information to be the same information for a single event calculated in the previous section.

# calculate the entropy for a dice roll from math import log2 # the number of events n = 6 # probability of one event p = 1.0 /n # calculate entropy entropy = -sum([p * log2(p) for _ in range(n)]) # print the result print('entropy: %.3f bits' % entropy)

Running the example calculates the entropy as more than 2.5 bits, which is the same as the information for a single outcome. This makes sense, as the average information is the same as the lower bound on information as all outcomes are equally likely.

entropy: 2.585 bits

If we know the probability for each event, we can use the entropy() SciPy function to calculate the entropy directly.

For example:

# calculate the entropy for a dice roll from scipy.stats import entropy # discrete probabilities p = [1/6, 1/6, 1/6, 1/6, 1/6, 1/6] # calculate entropy e = entropy(p, base=2) # print the result print('entropy: %.3f bits' % e)

Running the example reports the same result that we calculated manually.

entropy: 2.585 bits

Calculating the entropy for a random variable provides the basis for other measures such as mutual information (information gain).

It also provides the basis for calculating the difference between two probability distributions with cross-entropy and the KL-divergence.

This section provides more resources on the topic if you are looking to go deeper.

- Section 2.8: Information theory, Machine Learning: A Probabilistic Perspective, 2012.
- Section 1.6: Information Theory, Pattern Recognition and Machine Learning, 2006.
- Section 3.13 Information Theory, Deep Learning, 2016.

- Entropy (information theory), Wikipedia.
- Information gain in decision trees, Wikipedia.
- Information gain ratio, Wikipedia.

In this post, you discovered a gentle introduction to information entropy.

Specifically, you learned:

- Information theory is concerned with data compression and transmission and builds upon probability and supports machine learning.
- Information provides a way to quantify the amount of surprise for an event measured in bits.
- Entropy provides a measure of the average amount of information needed to represent an event drawn from a probability distribution for a random variable.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Information Entropy appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Bayesian Belief Networks appeared first on Machine Learning Mastery.

]]>Probabilistic models can define relationships between variables and be used to calculate probabilities.

For example, fully conditional models may require an enormous amount of data to cover all possible cases, and probabilities may be intractable to calculate in practice. Simplifying assumptions such as the conditional independence of all random variables can be effective, such as in the case of Naive Bayes, although it is a drastically simplifying step.

An alternative is to develop a model that preserves known conditional dependence between random variables and conditional independence in all other cases. Bayesian networks are a probabilistic graphical model that explicitly capture the known conditional dependence with directed edges in a graph model. All missing connections define the conditional independencies in the model.

As such Bayesian Networks provide a useful tool to visualize the probabilistic model for a domain, review all of the relationships between the random variables, and reason about causal probabilities for scenarios given available evidence.

In this post, you will discover a gentle introduction to Bayesian Networks.

After reading this post, you will know:

- Bayesian networks are a type of probabilistic graphical model comprised of nodes and directed edges.
- Bayesian network models capture both conditionally dependent and conditionally independent relationships between random variables.
- Models can be prepared by experts or learned from data, then used for inference to estimate the probabilities for causal or subsequent events.

Let’s get started.

This tutorial is divided into five parts; they are:

- Challenge of Probabilistic Modeling
- Bayesian Belief Network as a Probabilistic Model
- How to Develop and Use a Bayesian Network
- Example of a Bayesian Network
- Bayesian Networks in Python

Probabilistic models can be challenging to design and use.

Most often, the problem is the lack of information about the domain required to fully specify the conditional dependence between random variables. If available, calculating the full conditional probability for an event can be impractical.

A common approach to addressing this challenge is to add some simplifying assumptions, such as assuming that all random variables in the model are conditionally independent. This is a drastic assumption, although it proves useful in practice, providing the basis for the Naive Bayes classification algorithm.

An alternative approach is to develop a probabilistic model of a problem with some conditional independence assumptions. This provides an intermediate approach between a fully conditional model and a fully conditionally independent model.

Bayesian belief networks are one example of a probabilistic model where some variables are conditionally independent.

Thus, Bayesian belief networks provide an intermediate approach that is less constraining than the global assumption of conditional independence made by the naive Bayes classifier, but more tractable than avoiding conditional independence assumptions altogether.

— Page 184, Machine Learning, 1997.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

A Bayesian belief network is a type of probabilistic graphical model.

A probabilistic graphical model (PGM), or simply “*graphical model*” for short, is a way of representing a probabilistic model with a graph structure.

The nodes in the graph represent random variables and the edges that connect the nodes represent the relationships between the random variables.

A graph comprises nodes (also called vertices) connected by links (also known as edges or arcs). In a probabilistic graphical model, each node represents a random variable (or group of random variables), and the links express probabilistic relationships between these variables.

— Page 360, Pattern Recognition and Machine Learning, 2006.

**Nodes**: Random variables in a graphical model.**Edges**: Relationships between random variables in a graphical model.

There are many different types of graphical models, although the two most commonly described are the Hidden Markov Model and the Bayesian Network.

The Hidden Markov Model (HMM) is a graphical model where the edges of the graph are undirected, meaning the graph contains cycles. Bayesian Networks are more restrictive, where the edges of the graph are directed, meaning they can only be navigated in one direction. This means that cycles are not possible, and the structure can be more generally referred to as a directed acyclic graph (DAG).

Directed graphs are useful for expressing causal relationships between random variables, whereas undirected graphs are better suited to expressing soft constraints between random variables.

— Page 360, Pattern Recognition and Machine Learning, 2006.

A Bayesian Belief Network, or simply “*Bayesian Network*,” provides a simple way of applying Bayes Theorem to complex problems.

The networks are not exactly Bayesian by definition, although given that both the probability distributions for the random variables (nodes) and the relationships between the random variables (edges) are specified subjectively, the model can be thought to capture the “*belief*” about a complex domain.

Bayesian probability is the study of subjective probabilities or belief in an outcome, compared to the frequentist approach where probabilities are based purely on the past occurrence of the event.

A Bayesian Network captures the joint probabilities of the events represented by the model.

A Bayesian belief network describes the joint probability distribution for a set of variables.

— Page 185, Machine Learning, 1997.

Central to the Bayesian network is the notion of conditional independence.

Independence refers to a random variable that is unaffected by all other variables. A dependent variable is a random variable whose probability is conditional on one or more other random variables.

Conditional independence describes the relationship among multiple random variables, where a given variable may be conditionally independent of one or more other random variables. This does not mean that the variable is independent per se; instead, it is a clear definition that the variable is independent of specific other known random variables.

A probabilistic graphical model, such as a Bayesian Network, provides a way of defining a probabilistic model for a complex problem by stating all of the conditional independence assumptions for the known variables, whilst allowing the presence of unknown (latent) variables.

As such, both the presence and the absence of edges in the graphical model are important in the interpretation of the model.

A graphical model (GM) is a way to represent a joint distribution by making [Conditional Independence] CI assumptions. In particular, the nodes in the graph represent random variables, and the (lack of) edges represent CI assumptions. (A better name for these models would in fact be “independence diagrams” …

— Page 308, Machine Learning: A Probabilistic Perspective, 2012.

Bayesian networks provide useful benefits as a probabilistic model.

For example:

**Visualization**. The model provides a direct way to visualize the structure of the model and motivate the design of new models.**Relationships**. Provides insights into the presence and absence of the relationships between random variables.**Computations**. Provides a way to structure complex probability calculations.

Designing a Bayesian Network requires defining at least three things:

**Random Variables**. What are the random variables in the problem?**Conditional Relationships**. What are the conditional relationships between the variables?**Probability Distributions**. What are the probability distributions for each variable?

It may be possible for an expert in the problem domain to specify some or all of these aspects in the design of the model.

In many cases, the architecture or topology of the graphical model can be specified by an expert, but the probability distributions must be estimated from data from the domain.

Both the probability distributions and the graph structure itself can be estimated from data, although it can be a challenging process. As such, it is common to use learning algorithms for this purpose; for example, assuming a Gaussian distribution for continuous random variables gradient ascent for estimating the distribution parameters.

Once a Bayesian Network has been prepared for a domain, it can be used for reasoning, e.g. making decisions.

Reasoning is achieved via inference with the model for a given situation. For example, the outcome for some events is known and plugged into the random variables. The model can be used to estimate the probability of causes for the events or possible further outcomes.

Reasoning (inference) is then performed by introducing evidence that sets variables in known states, and subsequently computing probabilities of interest, conditioned on this evidence.

— Page 13, Bayesian Reasoning and Machine Learning, 2012.

Practical examples of using Bayesian Networks in practice include medicine (symptoms and diseases), bioinformatics (traits and genes), and speech recognition (utterances and time).

We can make Bayesian Networks concrete with a small example.

Consider a problem with three random variables: A, B, and C. A is dependent upon B, and C is dependent upon B.

We can state the conditional dependencies as follows:

- A is conditionally dependent upon B, e.g. P(A|B)
- C is conditionally dependent upon B, e.g. P(C|B)

We know that C and A have no effect on each other.

We can also state the conditional independencies as follows:

- A is conditionally independent from C: P(A|B, C)
- C is conditionally independent from A: P(C|B, A)

Notice that the conditional dependence is stated in the presence of the conditional independence. That is, A is conditionally independent of C, or A is conditionally dependent upon B in the presence of C.

We might also state the conditional independence of A given C as the conditional dependence of A given B, as A is unaffected by C and can be calculated from A given B alone.

- P(A|C, B) = P(A|B)

We can see that B is unaffected by A and C and has no parents; we can simply state the conditional independence of B from A and C as P(B, P(A|B), P(C|B)) or P(B).

We can also write the joint probability of A and C given B or conditioned on B as the product of two conditional probabilities; for example:

- P(A, C | B) = P(A|B) * P(C|B)

The model summarizes the joint probability of P(A, B, C), calculated as:

- P(A, B, C) = P(A|B) * P(C|B) * P(B)

We can draw the graph as follows:

Notice that the random variables are each assigned a node, and the conditional probabilities are stated as directed connections between the nodes. Also notice that it is not possible to navigate the graph in a cycle, e.g. no loops are possible when navigating from node to node via the edges.

Also notice that the graph is useful even at this point where we don’t know the probability distributions for the variables.

You might want to extend this example by using contrived probabilities for discrete events for each random variable and practice some simple inference for different scenarios.

Bayesian Networks can be developed and used for inference in Python.

A popular library for this is called PyMC and provides a range of tools for Bayesian modeling, including graphical models like Bayesian Networks.

The most recent version of the library is called PyMC3, named for Python version 3, and was developed on top of the Theano mathematical computation library that offers fast automatic differentiation.

PyMC3 is a new open source probabilistic programming framework written in Python that uses Theano to compute gradients via automatic differentiation as well as compile probabilistic programs on-the-fly to C for increased speed.

— Probabilistic programming in Python using PyMC3, 2016.

More generally, the use of probabilistic graphical models in computer software used for inference is referred to “*probabilistic programming*“.

This type of programming is called probabilistic programming, […] it is probabilistic in the sense that we create probability models using programming variables as the model’s components. Model components are first-class primitives within the PyMC framework.

— Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference, 2015.

For an excellent primer on Bayesian methods generally with PyMC, see the free book by Cameron Davidson-Pilon titled “Bayesian Methods for Hackers.”

This section provides more resources on the topic if you are looking to go deeper.

- Bayesian Reasoning and Machine Learning, 2012.
- Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference, 2015.

- Chapter 6: Bayesian Learning, Machine Learning, 1997.
- Chapter 8: Graphical Models, Pattern Recognition and Machine Learning, 2006.
- Chapter 10: Directed graphical models (Bayes nets), Machine Learning: A Probabilistic Perspective, 2012.
- Chapter 14: Probabilistic Reasoning, Artificial Intelligence: A Modern Approach, 3rd edition, 2009.

- Graphical model, Wikipedia.
- Hidden Markov model, Wikipedia.
- Bayesian network, Wikipedia.
- Conditional independence, Wikipedia.
- Probabilistic Programming & Bayesian Methods for Hackers

In this post, you discovered a gentle introduction to Bayesian Networks.

Specifically, you learned:

- Bayesian networks are a type of probabilistic graphical model comprised of nodes and directed edges.
- Bayesian network models capture both conditionally dependent and conditionally independent relationships between random variables.
- Models can be prepared by experts or learned from data, then used for inference to estimate the probabilities for causal or subsequent events.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Bayesian Belief Networks appeared first on Machine Learning Mastery.

]]>The post How to Implement Bayesian Optimization from Scratch in Python appeared first on Machine Learning Mastery.

]]>In this tutorial, you will discover how to implement the **Bayesian Optimization algorithm** for complex optimization problems.

Global optimization is a challenging problem of finding an input that results in the minimum or maximum cost of a given objective function.

Typically, the form of the objective function is complex and intractable to analyze and is often non-convex, nonlinear, high dimension, noisy, and computationally expensive to evaluate.

Bayesian Optimization provides a principled technique based on Bayes Theorem to direct a search of a global optimization problem that is efficient and effective. It works by building a probabilistic model of the objective function, called the surrogate function, that is then searched efficiently with an acquisition function before candidate samples are chosen for evaluation on the real objective function.

Bayesian Optimization is often used in applied machine learning to tune the hyperparameters of a given well-performing model on a validation dataset.

After completing this tutorial, you will know:

- Global optimization is a challenging problem that involves black box and often non-convex, non-linear, noisy, and computationally expensive objective functions.
- Bayesian Optimization provides a probabilistically principled method for global optimization.
- How to implement Bayesian Optimization from scratch and how to use open-source implementations.

Let’s get started.

This tutorial is divided into four parts; they are:

- Challenge of Function Optimization
- What Is Bayesian Optimization
- How to Perform Bayesian Optimization
- Hyperparameter Tuning With Bayesian Optimization

Global function optimization, or function optimization for short, involves finding the minimum or maximum of an objective function.

Samples are drawn from the domain and evaluated by the objective function to give a score or cost.

Let’s define some common terms:

**Samples**. One example from the domain, represented as a vector.**Search Space**: Extent of the domain from which samples can be drawn.**Objective Function**. Function that takes a sample and returns a cost.**Cost**. Numeric score for a sample calculated via the objective function.

Samples are comprised of one or more variables generally easy to devise or create. One sample is often defined as a vector of variables with a predefined range in an n-dimensional space. This space must be sampled and explored in order to find the specific combination of variable values that result in the best cost.

The cost often has units that are specific to a given domain. Optimization is often described in terms of minimizing cost, as a maximization problem can easily be transformed into a minimization problem by inverting the calculated cost. Together, the minimum and maximum of a function are referred to as the extreme of the function (or the plural extrema).

The objective function is often easy to specify but can be computationally challenging to calculate or result in a noisy calculation of cost over time. The form of the objective function is unknown and is often highly nonlinear, and highly multi-dimensional defined by the number of input variables. The function is also probably non-convex. This means that local extrema may or may not be the global extrema (e.g. could be misleading and result in premature convergence), hence the name of the task as global rather than local optimization.

Although little is known about the objective function, (it is known whether the minimum or the maximum cost from the function is sought), and as such, it is often referred to as a black box function and the search process as black box optimization. Further, the objective function is sometimes called an oracle given the ability to only give answers.

Function optimization is a fundamental part of machine learning. Most machine learning algorithms involve the optimization of parameters (weights, coefficients, etc.) in response to training data. Optimization also refers to the process of finding the best set of hyperparameters that configure the training of a machine learning algorithm. Taking one step higher again, the selection of training data, data preparation, and machine learning algorithms themselves is also a problem of function optimization.

Summary of optimization in machine learning:

**Algorithm Training**. Optimization of model parameters.**Algorithm Tuning**. Optimization of model hyperparameters.**Predictive Modeling**. Optimization of data, data preparation, and algorithm selection.

Many methods exist for function optimization, such as randomly sampling the variable search space, called random search, or systematically evaluating samples in a grid across the search space, called grid search.

More principled methods are able to learn from sampling the space so that future samples are directed toward the parts of the search space that are most likely to contain the extrema.

A directed approach to global optimization that uses probability is called Bayesian Optimization.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Bayesian Optimization is an approach that uses Bayes Theorem to direct the search in order to find the minimum or maximum of an objective function.

It is an approach that is most useful for objective functions that are complex, noisy, and/or expensive to evaluate.

Bayesian optimization is a powerful strategy for finding the extrema of objective functions that are expensive to evaluate. […] It is particularly useful when these evaluations are costly, when one does not have access to derivatives, or when the problem at hand is non-convex.

Recall that Bayes Theorem is an approach for calculating the conditional probability of an event:

- P(A|B) = P(B|A) * P(A) / P(B)

We can simplify this calculation by removing the normalizing value of *P(B)* and describe the conditional probability as a proportional quantity. This is useful as we are not interested in calculating a specific conditional probability, but instead in optimizing a quantity.

- P(A|B) = P(B|A) * P(A)

The conditional probability that we are calculating is referred to generally as the *posterior* probability; the reverse conditional probability is sometimes referred to as the likelihood, and the marginal probability is referred to as the *prior *probability; for example:

- posterior = likelihood * prior

This provides a framework that can be used to quantify the beliefs about an unknown objective function given samples from the domain and their evaluation via the objective function.

We can devise specific samples (*x1, x2, …, xn*) and evaluate them using the objective function *f(xi)* that returns the cost or outcome for the sample xi. Samples and their outcome are collected sequentially and define our data *D*, e.g. *D = {xi, f(xi), … xn, f(xn)}* and is used to define the prior. The likelihood function is defined as the probability of observing the data given the function *P(D | f)*. This likelihood function will change as more observations are collected.

- P(f|D) = P(D|f) * P(f)

The posterior represents everything we know about the objective function. It is an approximation of the objective function and can be used to estimate the cost of different candidate samples that we may want to evaluate.

In this way, the posterior probability is a surrogate objective function.

The posterior captures the updated beliefs about the unknown objective function. One may also interpret this step of Bayesian optimization as estimating the objective function with a surrogate function (also called a response surface).

**Surrogate Function**: Bayesian approximation of the objective function that can be sampled efficiently.

The surrogate function gives us an estimate of the objective function, which can be used to direct future sampling. Sampling involves careful use of the posterior in a function known as the “*acquisition*” function, e.g. for acquiring more samples. We want to use our belief about the objective function to sample the area of the search space that is most likely to pay off, therefore the acquisition will optimize the conditional probability of locations in the search to generate the next sample.

**Acquisition Function**: Technique by which the posterior is used to select the next sample from the search space.

Once additional samples and their evaluation via the objective function *f()* have been collected, they are added to data *D* and the posterior is then updated.

This process is repeated until the extrema of the objective function is located, a good enough result is located, or resources are exhausted.

The Bayesian Optimization algorithm can be summarized as follows:

- 1. Select a Sample by Optimizing the Acquisition Function.
- 2. Evaluate the Sample With the Objective Function.
- 3. Update the Data and, in turn, the Surrogate Function.
- 4. Go To 1.

In this section, we will explore how Bayesian Optimization works by developing an implementation from scratch for a simple one-dimensional test function.

First, we will define the test problem, then how to model the mapping of inputs to outputs with a surrogate function. Next, we will see how the surrogate function can be searched efficiently with an acquisition function before tying all of these elements together into the Bayesian Optimization procedure.

The first step is to define a test problem.

We will use a multimodal problem with five peaks, calculated as:

- y = x^2 * sin(5 * PI * x)^6

Where *x* is a real value in the range [0,1] and *PI* is the value of pi.

We will augment this function by adding Gaussian noise with a mean of zero and a standard deviation of 0.1. This will mean that the real evaluation will have a positive or negative random value added to it, making the function challenging to optimize.

The *objective()* function below implements this.

# objective function def objective(x, noise=0.1): noise = normal(loc=0, scale=noise) return (x**2 * sin(5 * pi * x)**6.0) + noise

We can test this function by first defining a grid-based sample of inputs from 0 to 1 with a step size of 0.01 across the domain.

... # grid-based sample of the domain [0,1] X = arange(0, 1, 0.01)

We can then evaluate these samples using the target function without any noise to see what the real objective function looks like.

... # sample the domain without noise y = [objective(x, 0) for x in X]

We can then evaluate these same points with noise to see what the objective function will look like when we are optimizing it.

... # sample the domain with noise ynoise = [objective(x) for x in X]

We can look at all of the non-noisy objective function values to find the input that resulted in the best score and report it. This will be the optima, in this case, maxima, as we are maximizing the output of the objective function.

We would not know this in practice, but for out test problem, it is good to know the real best input and output of the function to see if the Bayesian Optimization algorithm can locate it.

... # find best result ix = argmax(y) print('Optima: x=%.3f, y=%.3f' % (X[ix], y[ix]))

Finally, we can create a plot, first showing the noisy evaluation as a scatter plot with input on the x-axis and score on the y-axis, then a line plot of the scores without any noise.

... # plot the points with noise pyplot.scatter(X, ynoise) # plot the points without noise pyplot.plot(X, y) # show the plot pyplot.show()

The complete example of reviewing the test function that we wish to optimize is listed below.

# example of the test problem from math import sin from math import pi from numpy import arange from numpy import argmax from numpy.random import normal from matplotlib import pyplot # objective function def objective(x, noise=0.1): noise = normal(loc=0, scale=noise) return (x**2 * sin(5 * pi * x)**6.0) + noise # grid-based sample of the domain [0,1] X = arange(0, 1, 0.01) # sample the domain without noise y = [objective(x, 0) for x in X] # sample the domain with noise ynoise = [objective(x) for x in X] # find best result ix = argmax(y) print('Optima: x=%.3f, y=%.3f' % (X[ix], y[ix])) # plot the points with noise pyplot.scatter(X, ynoise) # plot the points without noise pyplot.plot(X, y) # show the plot pyplot.show()

Running the example first reports the global optima as an input with the value 0.9 that gives the score 0.81.

Optima: x=0.900, y=0.810

A plot is then created showing the noisy evaluation of the samples (dots) and the non-noisy and true shape of the objective function (line).

Your specific dots will differ given the stochastic nature of the noisy objective function.

Now that we have a test problem, let’s review how to train a surrogate function.

The surrogate function is a technique used to best approximate the mapping of input examples to an output score.

Probabilistically, it summarizes the conditional probability of an objective function (*f*), given the available data (*D*) or *P(f|D)*.

A number of techniques can be used for this, although the most popular is to treat the problem as a regression predictive modeling problem with the data representing the input and the score representing the output to the model. This is often best modeled using a random forest or a Gaussian Process.

A Gaussian Process, or GP, is a model that constructs a joint probability distribution over the variables, assuming a multivariate Gaussian distribution. As such, it is capable of efficient and effective summarization of a large number of functions and smooth transition as more observations are made available to the model.

This smooth structure and smooth transition to new functions based on data are desirable properties as we sample the domain, and the multivariate Gaussian basis to the model means that an estimate from the model will be a mean of a distribution with a standard deviation; that will be helpful later in the acquisition function.

As such, using a GP regression model is often preferred.

We can fit a GP regression model using the GaussianProcessRegressor scikit-learn implementation from a sample of inputs (*X*) and noisy evaluations from the objective function (*y*).

First, the model must be defined. An important aspect in defining the GP model is the kernel. This controls the shape of the function at specific points based on distance measures between actual data observations. Many different kernel functions can be used, and some may offer better performance for specific datasets.

By default, a Radial Basis Function, or RBF, is used that can work well.

... # define the model model = GaussianProcessRegressor()

Once defined, the model can be fit on the training dataset directly by calling the *fit()* function.

The defined model can be fit again at any time with updated data concatenated to the existing data by another call to *fit()*.

... # fit the model model.fit(X, y)

The model will estimate the cost for one or more samples provided to it.

The model is used by calling the *predict()* function. The result for a given sample will be a mean of the distribution at that point. We can also get the standard deviation of the distribution at that point in the function by specifying the argument *return_std=True*; for example:

... yhat = model.predict(X, return_std=True)

This function can result in warnings if the distribution is thin at a given point we are interested in sampling.

Therefore, we can silence all of the warnings when making a prediction. The *surrogate()* function below takes the fit model and one or more samples and returns the mean and standard deviation estimated costs whilst not printing any warnings.

# surrogate or approximation for the objective function def surrogate(model, X): # catch any warning generated when making a prediction with catch_warnings(): # ignore generated warnings simplefilter("ignore") return model.predict(X, return_std=True)

We can call this function any time to estimate the cost of one or more samples, such as when we want to optimize the acquisition function in the next section.

For now, it is interesting to see what the surrogate function looks like across the domain after it is trained on a random sample.

We can achieve this by first fitting the GP model on a random sample of 100 data points and their real objective function values with noise. We can then plot a scatter plot of these points. Next, we can perform a grid-based sample across the input domain and estimate the cost at each point using the surrogate function and plot the result as a line.

We would expect the surrogate function to have a crude approximation of the true non-noisy objective function.

The *plot()* function below creates this plot, given the random data sample of the real noisy objective function and the fit model.

# plot real observations vs surrogate function def plot(X, y, model): # scatter plot of inputs and real objective function pyplot.scatter(X, y) # line plot of surrogate function across domain Xsamples = asarray(arange(0, 1, 0.001)) Xsamples = Xsamples.reshape(len(Xsamples), 1) ysamples, _ = surrogate(model, Xsamples) pyplot.plot(Xsamples, ysamples) # show the plot pyplot.show()

Tying this together, the complete example of fitting a Gaussian Process regression model on noisy samples and plotting the sample vs. the surrogate function is listed below.

# example of a gaussian process surrogate function from math import sin from math import pi from numpy import arange from numpy import asarray from numpy.random import normal from numpy.random import random from matplotlib import pyplot from warnings import catch_warnings from warnings import simplefilter from sklearn.gaussian_process import GaussianProcessRegressor # objective function def objective(x, noise=0.1): noise = normal(loc=0, scale=noise) return (x**2 * sin(5 * pi * x)**6.0) + noise # surrogate or approximation for the objective function def surrogate(model, X): # catch any warning generated when making a prediction with catch_warnings(): # ignore generated warnings simplefilter("ignore") return model.predict(X, return_std=True) # plot real observations vs surrogate function def plot(X, y, model): # scatter plot of inputs and real objective function pyplot.scatter(X, y) # line plot of surrogate function across domain Xsamples = asarray(arange(0, 1, 0.001)) Xsamples = Xsamples.reshape(len(Xsamples), 1) ysamples, _ = surrogate(model, Xsamples) pyplot.plot(Xsamples, ysamples) # show the plot pyplot.show() # sample the domain sparsely with noise X = random(100) y = asarray([objective(x) for x in X]) # reshape into rows and cols X = X.reshape(len(X), 1) y = y.reshape(len(y), 1) # define the model model = GaussianProcessRegressor() # fit the model model.fit(X, y) # plot the surrogate function plot(X, y, model)

Running the example first draws the random sample, evaluates it with the noisy objective function, then fits the GP model.

The data sample and a grid of points across the domain evaluated via the surrogate function are then plotted as dots and a line respectively.

Your specific results will vary given the stochastic nature of the data sample. Consider running the example a few times.

In this case, as we expected, the plot resembles a crude version of the underlying non-noisy objective function, importantly with a peak around 0.9 where we know the true maxima is located.

Next, we must define a strategy for sampling the surrogate function.

The surrogate function is used to test a range of candidate samples in the domain.

From these results, one or more candidates can be selected and evaluated with the real, and in normal practice, computationally expensive cost function.

This involves two pieces: the search strategy used to navigate the domain in response to the surrogate function and the acquisition function that is used to interpret and score the response from the surrogate function.

A simple search strategy, such as a random sample or grid-based sample, can be used, although it is more common to use a local search strategy, such as the popular BFGS algorithm. In this case, we will use a random search or random sample of the domain in order to keep the example simple.

This involves first drawing a random sample of candidate samples from the domain, evaluating them with the acquisition function, then maximizing the acquisition function or choosing the candidate sample that gives the best score. The *opt_acquisition()* function below implements this.

# optimize the acquisition function def opt_acquisition(X, y, model): # random search, generate random samples Xsamples = random(100) Xsamples = Xsamples.reshape(len(Xsamples), 1) # calculate the acquisition function for each sample scores = acquisition(X, Xsamples, model) # locate the index of the largest scores ix = argmax(scores) return Xsamples[ix, 0]

The acquisition function is responsible for scoring or estimating the likelihood that a given candidate sample (input) is worth evaluating with the real objective function.

We could just use the surrogate score directly. Alternately, given that we have chosen a Gaussian Process model as the surrogate function, we can use the probabilistic information from this model in the acquisition function to calculate the probability that a given sample is worth evaluating.

There are many different types of probabilistic acquisition functions that can be used, each providing a different trade-off for how exploitative (greedy) and explorative they are.

Three common examples include:

- Probability of Improvement (PI).
- Expected Improvement (EI).
- Lower Confidence Bound (LCB).

The Probability of Improvement method is the simplest, whereas the Expected Improvement method is the most commonly used.

In this case, we will use the simpler Probability of Improvement method, which is calculated as the normal cumulative probability of the normalized expected improvement, calculated as follows:

- PI = cdf((mu – best_mu) / stdev)

Where PI is the probability of improvement, *cdf()* is the normal cumulative distribution function, *mu* is the mean of the surrogate function for a given sample *x*, *stdev* is the standard deviation of the surrogate function for a given sample *x*, and *best_mu* is the mean of the surrogate function for the best sample found so far.

We can add a very small number to the standard deviation to avoid a divide by zero error.

The *acquisition()* function below implements this given the current training dataset of input samples, an array of new candidate samples, and the fit GP model.

# probability of improvement acquisition function def acquisition(X, Xsamples, model): # calculate the best surrogate score found so far yhat, _ = surrogate(model, X) best = max(yhat) # calculate mean and stdev via surrogate function mu, std = surrogate(model, Xsamples) mu = mu[:, 0] # calculate the probability of improvement probs = norm.cdf((mu - best) / (std+1E-9)) return probs

We can tie all of this together into the Bayesian Optimization algorithm.

The main algorithm involves cycles of selecting candidate samples, evaluating them with the objective function, then updating the GP model.

... # perform the optimization process for i in range(100): # select the next point to sample x = opt_acquisition(X, y, model) # sample the point actual = objective(x) # summarize the finding for our own reporting est, _ = surrogate(model, [[x]]) print('>x=%.3f, f()=%3f, actual=%.3f' % (x, est, actual)) # add the data to the dataset X = vstack((X, [[x]])) y = vstack((y, [[actual]])) # update the model model.fit(X, y)

The complete example is listed below.

# example of bayesian optimization for a 1d function from scratch from math import sin from math import pi from numpy import arange from numpy import vstack from numpy import argmax from numpy import asarray from numpy.random import normal from numpy.random import random from scipy.stats import norm from sklearn.gaussian_process import GaussianProcessRegressor from warnings import catch_warnings from warnings import simplefilter from matplotlib import pyplot # objective function def objective(x, noise=0.1): noise = normal(loc=0, scale=noise) return (x**2 * sin(5 * pi * x)**6.0) + noise # surrogate or approximation for the objective function def surrogate(model, X): # catch any warning generated when making a prediction with catch_warnings(): # ignore generated warnings simplefilter("ignore") return model.predict(X, return_std=True) # probability of improvement acquisition function def acquisition(X, Xsamples, model): # calculate the best surrogate score found so far yhat, _ = surrogate(model, X) best = max(yhat) # calculate mean and stdev via surrogate function mu, std = surrogate(model, Xsamples) mu = mu[:, 0] # calculate the probability of improvement probs = norm.cdf((mu - best) / (std+1E-9)) return probs # optimize the acquisition function def opt_acquisition(X, y, model): # random search, generate random samples Xsamples = random(100) Xsamples = Xsamples.reshape(len(Xsamples), 1) # calculate the acquisition function for each sample scores = acquisition(X, Xsamples, model) # locate the index of the largest scores ix = argmax(scores) return Xsamples[ix, 0] # plot real observations vs surrogate function def plot(X, y, model): # scatter plot of inputs and real objective function pyplot.scatter(X, y) # line plot of surrogate function across domain Xsamples = asarray(arange(0, 1, 0.001)) Xsamples = Xsamples.reshape(len(Xsamples), 1) ysamples, _ = surrogate(model, Xsamples) pyplot.plot(Xsamples, ysamples) # show the plot pyplot.show() # sample the domain sparsely with noise X = random(100) y = asarray([objective(x) for x in X]) # reshape into rows and cols X = X.reshape(len(X), 1) y = y.reshape(len(y), 1) # define the model model = GaussianProcessRegressor() # fit the model model.fit(X, y) # plot before hand plot(X, y, model) # perform the optimization process for i in range(100): # select the next point to sample x = opt_acquisition(X, y, model) # sample the point actual = objective(x) # summarize the finding est, _ = surrogate(model, [[x]]) print('>x=%.3f, f()=%3f, actual=%.3f' % (x, est, actual)) # add the data to the dataset X = vstack((X, [[x]])) y = vstack((y, [[actual]])) # update the model model.fit(X, y) # plot all samples and the final surrogate function plot(X, y, model) # best result ix = argmax(y) print('Best Result: x=%.3f, y=%.3f' % (X[ix], y[ix]))

Running the example first creates an initial random sample of the search space and evaluation of the results. Then a GP model is fit on this data.

Your specific results will vary given the stochastic nature of the sampling of the domain. Try running the example a few times.

A plot is created showing the raw observations as dots and the surrogate function across the entire domain. In this case, the initial sample has a good spread across the domain and the surrogate function has a bias towards the part of the domain where we know the optima is located.

The algorithm then iterates for 100 cycles, selecting samples, evaluating them, and adding them to the dataset to update the surrogate function, and over again.

Each cycle reports the selected input value, the estimated score from the surrogate function, and the actual score. Ideally, these scores would get closer and closer as the algorithm converges on one area of the search space.

... >x=0.922, f()=0.661501, actual=0.682 >x=0.895, f()=0.661668, actual=0.905 >x=0.928, f()=0.648008, actual=0.403 >x=0.908, f()=0.674864, actual=0.750 >x=0.436, f()=0.071377, actual=-0.115

Next, a final plot is created with the same form as the prior plot.

This time, all 200 samples evaluated during the optimization task are plotted. We would expect an overabundance of sampling around the known optima, and this is what we see, with may dots around 0.9. We also see that the surrogate function has a stronger representation of the underlying target domain.

Finally, the best input and its objective function score are reported.

We know the optima has an input of 0.9 and an output of 0.810 if there was no sampling noise.

Given the sampling noise, the optimization algorithm gets close in this case, suggesting an input of 0.905.

Best Result: x=0.905, y=1.150

It can be a useful exercise to implement Bayesian Optimization to learn how it works.

In practice, when using Bayesian Optimization on a project, it is a good idea to use a standard implementation provided in an open-source library. This is to both avoid bugs and to leverage a wider range of configuration options and speed improvements.

Two popular libraries for Bayesian Optimization include Scikit-Optimize and HyperOpt. In machine learning, these libraries are often used to tune the hyperparameters of algorithms.

Hyperparameter tuning is a good fit for Bayesian Optimization because the evaluation function is computationally expensive (e.g. training models for each set of hyperparameters) and noisy (e.g. noise in training data and stochastic learning algorithms).

In this section, we will take a brief look at how to use the Scikit-Optimize library to optimize the hyperparameters of a k-nearest neighbor classifier for a simple test classification problem. This will provide a useful template that you can use on your own projects.

The Scikit-Optimize project is designed to provide access to Bayesian Optimization for applications that use SciPy and NumPy, or applications that use scikit-learn machine learning algorithms.

First, the library must be installed, which can be achieved easily using pip; for example:

sudo pip install scikit-optimize

It is also assumed that you have scikit-learn installed for this example.

Once installed, there are two ways that scikit-optimize can be used to optimize the hyperparameters of a scikit-learn algorithm. The first is to perform the optimization directly on a search space, and the second is to use the BayesSearchCV class, a sibling of the scikit-learn native classes for random and grid searching.

In this example, will use the simpler approach of optimizing the hyperparameters directly.

The first step is to prepare the data and define the model. We will use a simple test classification problem via the make_blobs() function with 500 examples, each with two features and three class labels. We will then use a KNeighborsClassifier algorithm.

... # generate 2d classification dataset X, y = make_blobs(n_samples=500, centers=3, n_features=2) # define the model model = KNeighborsClassifier()

Next, we must define the search space.

In this case, we will tune the number of neighbors (*n_neighbors*) and the shape of the neighborhood function (*p*). This requires ranges be defined for a given data type. In this case, they are Integers, defined with the min, max, and the name of the parameter to the scikit-learn model. For your algorithm, you can just as easily optimize *Real()* and *Categorical()* data types.

... # define the space of hyperparameters to search search_space = [Integer(1, 5, name='n_neighbors'), Integer(1, 2, name='p')]

Next, we need to define a function that will be used to evaluate a given set of hyperparameters. We want to minimize this function, therefore smaller values returned must indicate a better performing model.

We can use the *use_named_args()* decorator from the scikit-optimize project on the function definition that allows the function to be called directly with a specific set of parameters from the search space.

As such, our custom function will take the hyperparameter values as arguments, which can be provided to the model directly in order to configure it. We can define these arguments generically in python using the ***params* argument to the function, then pass them to the model via the set_params(**) function.

Now that the model is configured, we can evaluate it. In this case, we will use 5-fold cross-validation on our dataset and evaluate the accuracy for each fold. We can then report the performance of the model as one minus the mean accuracy across these folds. This means that a perfect model with an accuracy of 1.0 will return a value of 0.0 (1.0 – mean accuracy).

This function is defined after we have loaded the dataset and defined the model so that both the dataset and model are in scope and can be used directly.

# define the function used to evaluate a given configuration @use_named_args(search_space) def evaluate_model(**params): # something model.set_params(**params) # calculate 5-fold cross validation result = cross_val_score(model, X, y, cv=5, n_jobs=-1, scoring='accuracy') # calculate the mean of the scores estimate = mean(result) return 1.0 - estimate

Next, we can perform the optimization.

This is achieved by calling the gp_minimize() function with the name of the objective function and the defined search space.

By default, this function will use a ‘*gp_hedge*‘ acquisition function that tries to figure out the best strategy, but this can be configured via the *acq_func* argument. The optimization will also run for 100 iterations by default, but this can be controlled via the *n_calls* argument.

... # perform optimization result = gp_minimize(evaluate_model, search_space)

Once run, we can access the best score via the “fun” property and the best set of hyperparameters via the “*x*” array property.

... # summarizing finding: print('Best Accuracy: %.3f' % (1.0 - result.fun)) print('Best Parameters: n_neighbors=%d, p=%d' % (result.x[0], result.x[1]))

Tying this all together, the complete example is listed below.

# example of bayesian optimization with scikit-optimize from numpy import mean from sklearn.datasets.samples_generator import make_blobs from sklearn.model_selection import cross_val_score from sklearn.neighbors import KNeighborsClassifier from skopt.space import Integer from skopt.utils import use_named_args from skopt import gp_minimize # generate 2d classification dataset X, y = make_blobs(n_samples=500, centers=3, n_features=2) # define the model model = KNeighborsClassifier() # define the space of hyperparameters to search search_space = [Integer(1, 5, name='n_neighbors'), Integer(1, 2, name='p')] # define the function used to evaluate a given configuration @use_named_args(search_space) def evaluate_model(**params): # something model.set_params(**params) # calculate 5-fold cross validation result = cross_val_score(model, X, y, cv=5, n_jobs=-1, scoring='accuracy') # calculate the mean of the scores estimate = mean(result) return 1.0 - estimate # perform optimization result = gp_minimize(evaluate_model, search_space) # summarizing finding: print('Best Accuracy: %.3f' % (1.0 - result.fun)) print('Best Parameters: n_neighbors=%d, p=%d' % (result.x[0], result.x[1]))

Running the example executes the hyperparameter tuning using Bayesian Optimization.

The code may report many warning messages, such as:

UserWarning: The objective has been evaluated at this point before.

This is to be expected and is caused by the same hyperparameter configuration being evaluated more than once.

Your specific results will vary given the stochastic nature of the test problem. Try running the example a few times.

In this case, the model achieved about 97% accuracy via mean 5-fold cross-validation with 3 neighbors and a p-value of 2.

Best Accuracy: 0.976 Best Parameters: n_neighbors=3, p=2

This section provides more resources on the topic if you are looking to go deeper.

- A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning, 2010.
- Practical Bayesian Optimization of Machine Learning Algorithms, 2012.
- A Tutorial on Bayesian Optimization, 2018.

- Gaussian Processes, Scikit-Learn API.
- Hyperopt: Distributed Asynchronous Hyper-parameter Optimization
- Scikit-Optimize Project.
- Tuning a scikit-learn estimator with skopt

- Global optimization, Wikipedia.
- Bayesian optimization, Wikipedia.
- Bayesian optimization, 2018.
- How does Bayesian optimization work?, Quora.

In this tutorial, you discovered Bayesian Optimization for directed search of complex optimization problems.

Specifically, you learned:

- Global optimization is a challenging problem that involves black box and often non-convex, non-linear, noisy, and computationally expensive objective functions.
- Bayesian Optimization provides a probabilistically principled method for global optimization.
- How to implement Bayesian Optimization from scratch and how to use open-source implementations.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Implement Bayesian Optimization from Scratch in Python appeared first on Machine Learning Mastery.

]]>The post How to Develop a Naive Bayes Classifier from Scratch in Python appeared first on Machine Learning Mastery.

]]>Classification is a predictive modeling problem that involves assigning a label to a given input data sample.

The problem of classification predictive modeling can be framed as calculating the conditional probability of a class label given a data sample. Bayes Theorem provides a principled way for calculating this conditional probability, although in practice requires an enormous number of samples (very large-sized dataset) and is computationally expensive.

Instead, the calculation of Bayes Theorem can be simplified by making some assumptions, such as each input variable is independent of all other input variables. Although a dramatic and unrealistic assumption, this has the effect of making the calculations of the conditional probability tractable and results in an effective classification model referred to as Naive Bayes.

In this tutorial, you will discover the Naive Bayes algorithm for classification predictive modeling.

After completing this tutorial, you will know:

- How to frame classification predictive modeling as a conditional probability model.
- How to use Bayes Theorem to solve the conditional probability model of classification.
- How to implement simplified Bayes Theorem for classification, called the Naive Bayes algorithm.

Let’s get started.

**Updated Oct/2019**: Fixed minor inconsistency issue in math notation.

This tutorial is divided into five parts; they are:

- Conditional Probability Model of Classification
- Simplified or Naive Bayes
- How to Calculate the Prior and Conditional Probabilities
- Worked Example of Naive Bayes
- 5 Tips When Using Naive Bayes

In machine learning, we are often interested in a predictive modeling problem where we want to predict a class label for a given observation. For example, classifying the species of plant based on measurements of the flower.

Problems of this type are referred to as classification predictive modeling problems, as opposed to regression problems that involve predicting a numerical value. The observation or input to the model is referred to as *X* and the class label or output of the model is referred to as *y*.

Together, X and y represent observations collected from the domain, i.e. a table or matrix (columns and rows or features and samples) of training data used to fit a model. The model must learn how to map specific examples to class labels or *y = f(X)* that minimized the error of misclassification.

One approach to solving this problem is to develop a probabilistic model. From a probabilistic perspective, we are interested in estimating the conditional probability of the class label, given the observation.

For example, a classification problem may have k class labels *y1, y2, …, yk* and n input variables, *X1, X2, …, Xn*. We can calculate the conditional probability for a class label with a given instance or set of input values for each column *x1, x2, …, xn* as follows:

- P(yi | x1, x2, …, xn)

The conditional probability can then be calculated for each class label in the problem and the label with the highest probability can be returned as the most likely classification.

The conditional probability can be calculated using the joint probability, although it would be intractable. Bayes Theorem provides a principled way for calculating the conditional probability.

The simple form of the calculation for Bayes Theorem is as follows:

- P(A|B) = P(B|A) * P(A) / P(B)

Where the probability that we are interested in calculating P(A|B) is called the posterior probability and the marginal probability of the event P(A) is called the prior.

We can frame classification as a conditional classification problem with Bayes Theorem as follows:

- P(yi | x1, x2, …, xn) = P(x1, x2, …, xn | yi) * P(yi) / P(x1, x2, …, xn)

The prior *P(yi)* is easy to estimate from a dataset, but the conditional probability of the observation based on the class *P(x1, x2, …, xn | yi)* is not feasible unless the number of examples is extraordinarily large, e.g. large enough to effectively estimate the probability distribution for all different possible combinations of values.

As such, the direct application of Bayes Theorem also becomes intractable, especially as the number of variables or features (*n*) increases.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The solution to using Bayes Theorem for a conditional probability classification model is to simplify the calculation.

The Bayes Theorem assumes that each input variable is dependent upon all other variables. This is a cause of complexity in the calculation. We can remove this assumption and consider each input variable as being independent from each other.

This changes the model from a dependent conditional probability model to an independent conditional probability model and dramatically simplifies the calculation.

First, the denominator is removed from the calculation *P(x1, x2, …, xn)* as it is a constant used in calculating the conditional probability of each class for a given instance and has the effect of normalizing the result.

- P(yi | x1, x2, …, xn) = P(x1, x2, …, xn | yi) * P(yi)

Next, the conditional probability of all variables given the class label is changed into separate conditional probabilities of each variable value given the class label. These independent conditional variables are then multiplied together. For example:

- P(yi | x1, x2, …, xn) = P(x1|yi) * P(x2|yi) * … P(xn|yi) * P(yi)

This calculation can be performed for each of the class labels, and the label with the largest probability can be selected as the classification for the given instance. This decision rule is referred to as the maximum a posteriori, or MAP, decision rule.

This simplification of Bayes Theorem is common and widely used for classification predictive modeling problems and is generally referred to as Naive Bayes.

The word “naive” is French and typically has a diaeresis (umlaut) over the “i”, which is commonly left out for simplicity, and “Bayes” is capitalized as it is named for Reverend Thomas Bayes.

Now that we know what Naive Bayes is, we can take a closer look at how to calculate the elements of the equation.

The calculation of the prior P(yi) is straightforward. It can be estimated by dividing the frequency of observations in the training dataset that have the class label by the total number of examples (rows) in the training dataset. For example:

- P(yi) = examples with yi / total examples

The conditional probability for a feature value given the class label can also be estimated from the data. Specifically, those data examples that belong to a given class, and one data distribution per variable. This means that if there are *K* classes and *n* variables, that *k * n* different probability distributions must be created and maintained.

A different approach is required depending on the data type of each feature. Specifically, the data is used to estimate the parameters of one of three standard probability distributions.

In the case of categorical variables, such as counts or labels, a multinomial distribution can be used. If the variables are binary, such as yes/no or true/false, a binomial distribution can be used. If a variable is numerical, such as a measurement, often a Gaussian distribution is used.

**Binary**: Binomial distribution.**Categorical**: Multinomial distribution.**Numeric**: Gaussian distribution.

These three distributions are so common that the Naive Bayes implementation is often named after the distribution. For example:

**Binomial Naive Bayes**: Naive Bayes that uses a binomial distribution.**Multinomial Naive Bayes**: Naive Bayes that uses a multinomial distribution.**Gaussian Naive Bayes**: Naive Bayes that uses a Gaussian distribution.

A dataset with mixed data types for the input variables may require the selection of different types of data distributions for each variable.

Using one of the three common distributions is not mandatory; for example, if a real-valued variable is known to have a different specific distribution, such as exponential, then that specific distribution may be used instead. If a real-valued variable does not have a well-defined distribution, such as bimodal or multimodal, then a kernel density estimator can be used to estimate the probability distribution instead.

The Naive Bayes algorithm has proven effective and therefore is popular for text classification tasks. The words in a document may be encoded as binary (word present), count (word occurrence), or frequency (tf/idf) input vectors and binary, multinomial, or Gaussian probability distributions used respectively.

In this section, we will make the Naive Bayes calculation concrete with a small example on a machine learning dataset.

We can generate a small contrived binary (2 class) classification problem using the make_blobs() function from the scikit-learn API.

The example below generates 100 examples with two numerical input variables, each assigned one of two classes.

# example of generating a small classification dataset from sklearn.datasets.samples_generator import make_blobs # generate 2d classification dataset X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1) # summarize print(X.shape, y.shape) print(X[:5]) print(y[:5])

Running the example generates the dataset and summarizes the size, confirming the dataset was generated as expected.

The “*random_state*” argument is set to 1, ensuring that the same random sample of observations is generated each time the code is run.

The input and output elements of the first five examples are also printed, showing that indeed, the two input variables are numeric and the class labels are either 0 or 1 for each example.

(100, 2) (100,) [[-10.6105446 4.11045368] [ 9.05798365 0.99701708] [ 8.705727 1.36332954] [ -8.29324753 2.35371596] [ 6.5954554 2.4247682 ]] [0 1 1 0 1]

We will model the numerical input variables using a Gaussian probability distribution.

This can be achieved using the norm SciPy API. First, the distribution can be constructed by specifying the parameters of the distribution, e.g. the mean and standard deviation, then the probability density function can be sampled for specific values using the norm.pdf() function.

We can estimate the parameters of the distribution from the dataset using the *mean()* and *std()* NumPy functions.

The *fit_distribution()* function below takes a sample of data for one variable and fits a data distribution.

# fit a probability distribution to a univariate data sample def fit_distribution(data): # estimate parameters mu = mean(data) sigma = std(data) print(mu, sigma) # fit distribution dist = norm(mu, sigma) return dist

Recall that we are interested in the conditional probability of each input variable. This means we need one distribution for each of the input variables, and one set of distributions for each of the class labels, or four distributions in total.

First, we must split the data into groups of samples for each of the class labels.

... # sort data into classes Xy0 = X[y == 0] Xy1 = X[y == 1] print(Xy0.shape, Xy1.shape)

We can then use these groups to calculate the prior probabilities for a data sample belonging to each group.

This will be 50% exactly given that we have created the same number of examples in each of the two classes; nevertheless, we will calculate these priors for completeness.

... # calculate priors priory0 = len(Xy0) / len(X) priory1 = len(Xy1) / len(X) print(priory0, priory1)

Finally, we can call the *fit_distribution()* function that we defined to prepare a probability distribution for each variable, for each class label.

... # create PDFs for y==0 X1y0 = fit_distribution(Xy0[:, 0]) X2y0 = fit_distribution(Xy0[:, 1]) # create PDFs for y==1 X1y1 = fit_distribution(Xy1[:, 0]) X2y1 = fit_distribution(Xy1[:, 1])

Tying this all together, the complete probabilistic model of the dataset is listed below.

# summarize probability distributions of the dataset from sklearn.datasets.samples_generator import make_blobs from scipy.stats import norm from numpy import mean from numpy import std # fit a probability distribution to a univariate data sample def fit_distribution(data): # estimate parameters mu = mean(data) sigma = std(data) print(mu, sigma) # fit distribution dist = norm(mu, sigma) return dist # generate 2d classification dataset X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1) # sort data into classes Xy0 = X[y == 0] Xy1 = X[y == 1] print(Xy0.shape, Xy1.shape) # calculate priors priory0 = len(Xy0) / len(X) priory1 = len(Xy1) / len(X) print(priory0, priory1) # create PDFs for y==0 X1y0 = fit_distribution(Xy0[:, 0]) X2y0 = fit_distribution(Xy0[:, 1]) # create PDFs for y==1 X1y1 = fit_distribution(Xy1[:, 0]) X2y1 = fit_distribution(Xy1[:, 1])

Running the example first splits the dataset into two groups for the two class labels and confirms the size of each group is even and the priors are 50%.

Probability distributions are then prepared for each variable for each class label and the mean and standard deviation parameters of each distribution are reported, confirming that the distributions differ.

(50, 2) (50, 2) 0.5 0.5 -1.5632888906409914 0.787444265443213 4.426680361487157 0.958296071258367 -9.681177100524485 0.8943078901048118 -3.9713794295185845 0.9308177595208521

Next, we can use the prepared probabilistic model to make a prediction.

The independent conditional probability for each class label can be calculated using the prior for the class (50%) and the conditional probability of the value for each variable.

The *probability()* function below performs this calculation for one input example (array of two values) given the prior and conditional probability distribution for each variable. The value returned is a score rather than a probability as the quantity is not normalized, a simplification often performed when implementing naive bayes.

# calculate the independent conditional probability def probability(X, prior, dist1, dist2): return prior * dist1.pdf(X[0]) * dist2.pdf(X[1])

We can use this function to calculate the probability for an example belonging to each class.

First, we can select an example to be classified; in this case, the first example in the dataset.

... # classify one example Xsample, ysample = X[0], y[0]

Next, we can calculate the score of the example belonging to the first class, then the second class, then report the results.

... py0 = probability(Xsample, priory0, distX1y0, distX2y0) py1 = probability(Xsample, priory1, distX1y1, distX2y1) print('P(y=0 | %s) = %.3f' % (Xsample, py0*100)) print('P(y=1 | %s) = %.3f' % (Xsample, py1*100))

The class with the largest score will be the resulting classification.

Tying this together, the complete example of fitting the Naive Bayes model and using it to make one prediction is listed below.

# example of preparing and making a prediction with a naive bayes model from sklearn.datasets.samples_generator import make_blobs from scipy.stats import norm from numpy import mean from numpy import std # fit a probability distribution to a univariate data sample def fit_distribution(data): # estimate parameters mu = mean(data) sigma = std(data) print(mu, sigma) # fit distribution dist = norm(mu, sigma) return dist # calculate the independent conditional probability def probability(X, prior, dist1, dist2): return prior * dist1.pdf(X[0]) * dist2.pdf(X[1]) # generate 2d classification dataset X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1) # sort data into classes Xy0 = X[y == 0] Xy1 = X[y == 1] # calculate priors priory0 = len(Xy0) / len(X) priory1 = len(Xy1) / len(X) # create PDFs for y==0 distX1y0 = fit_distribution(Xy0[:, 0]) distX2y0 = fit_distribution(Xy0[:, 1]) # create PDFs for y==1 distX1y1 = fit_distribution(Xy1[:, 0]) distX2y1 = fit_distribution(Xy1[:, 1]) # classify one example Xsample, ysample = X[0], y[0] py0 = probability(Xsample, priory0, distX1y0, distX2y0) py1 = probability(Xsample, priory1, distX1y1, distX2y1) print('P(y=0 | %s) = %.3f' % (Xsample, py0*100)) print('P(y=1 | %s) = %.3f' % (Xsample, py1*100)) print('Truth: y=%d' % ysample)

Running the example first prepares the prior and conditional probabilities as before, then uses them to make a prediction for one example.

The score of the example belonging to *y=0* is about 0.3 (recall this is an unnormalized probability), whereas the score of the example belonging to *y=1* is 0.0. Therefore, we would classify the example as belonging to *y=0*.

In this case, the true or actual outcome is known, *y=0*, which matches the prediction by our Naive Bayes model.

P(y=0 | [-0.79415228 2.10495117]) = 0.348 P(y=1 | [-0.79415228 2.10495117]) = 0.000 Truth: y=0

In practice, it is a good idea to use optimized implementations of the Naive Bayes algorithm. The scikit-learn library provides three implementations, one for each of the three main probability distributions; for example, BernoulliNB, MultinomialNB, and GaussianNB for binomial, multinomial and Gaussian distributed input variables respectively.

To use a scikit-learn Naive Bayes model, first the model is defined, then it is fit on the training dataset. Once fit, probabilities can be predicted via the *predict_proba()* function and class labels can be predicted directly via the *predict()* function.

The complete example of fitting a Gaussian Naive Bayes model (GaussianNB) to the same test dataset is listed below.

# example of gaussian naive bayes from sklearn.datasets.samples_generator import make_blobs from sklearn.naive_bayes import GaussianNB # generate 2d classification dataset X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1) # define the model model = GaussianNB() # fit the model model.fit(X, y) # select a single sample Xsample, ysample = [X[0]], y[0] # make a probabilistic prediction yhat_prob = model.predict_proba(Xsample) print('Predicted Probabilities: ', yhat_prob) # make a classification prediction yhat_class = model.predict(Xsample) print('Predicted Class: ', yhat_class) print('Truth: y=%d' % ysample)

Running the example fits the model on the training dataset, then makes predictions for the same first example that we used in the prior example.

In this case, the probability of the example belonging to *y=0* is 1.0 or a certainty. The probability of *y=1* is a very small value close to 0.0.

Finally, the class label is predicted directly, again matching the ground truth for the example.

Predicted Probabilities: [[1.00000000e+00 5.52387327e-30]] Predicted Class: [0] Truth: y=0

This section lists some practical tips when working with Naive Bayes models.

If the probability distribution for a variable is complex or unknown, it can be a good idea to use a kernel density estimator or KDE to approximate the distribution directly from the data samples.

A good example would be the Gaussian KDE.

By definition, Naive Bayes assumes the input variables are independent of each other.

This works well most of the time, even when some or most of the variables are in fact dependent. Nevertheless, the performance of the algorithm degrades the more dependent the input variables happen to be.

The calculation of the independent conditional probability for one example for one class label involves multiplying many probabilities together, one for the class and one for each input variable. As such, the multiplication of many small numbers together can become numerically unstable, especially as the number of input variables increases.

To overcome this problem, it is common to change the calculation from the product of probabilities to the sum of log probabilities. For example:

- P(yi | x1, x2, …, xn) = log(P(x1|y1)) + log(P(x2|y1)) + … log(P(xn|y1)) + log(P(yi))

Calculating the natural logarithm of probabilities has the effect of creating larger (negative) numbers and adding the numbers together will mean that larger probabilities will be closer to zero. The resulting values can still be compared and maximized to give the most likely class label.

This is often called the log-trick when multiplying probabilities.

As new data becomes available, it can be relatively straightforward to use this new data with the old data to update the estimates of the parameters for each variable’s probability distribution.

This allows the model to easily make use of new data or the changing distributions of data over time.

The probability distributions will summarize the conditional probability of each input variable value for each class label.

These probability distributions can be useful more generally beyond use in a classification model.

For example, the prepared probability distributions can be randomly sampled in order to create new plausible data instances. The conditional independence assumption assumed may mean that the examples are more or less plausible based on how much actual interdependence exists between the input variables in the dataset.

This section provides more resources on the topic if you are looking to go deeper.

- Naive Bayes Tutorial for Machine Learning
- Naive Bayes for Machine Learning
- Better Naive Bayes: 12 Tips To Get The Most From The Naive Bayes Algorithm

- Machine Learning, 1997.
- Machine Learning: A Probabilistic Perspective, 2012.
- Pattern Recognition and Machine Learning, 2006.
- Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.

- sklearn.datasets.make_blobs API.
- scipy.stats.norm API.
- Naive Bayes, scikit-learn documentation.
- sklearn.naive_bayes.GaussianNB API

- Bayes’ theorem, Wikipedia.
- Naive Bayes classifier, Wikipedia.
- Maximum a posteriori estimation, Wikipedia.

In this tutorial, you discovered the Naive Bayes algorithm for classification predictive modeling.

Specifically, you learned:

- How to frame classification predictive modeling as a conditional probability model.
- How to use Bayes Theorem to solve the conditional probability model of classification.
- How to implement simplified Bayes Theorem for classification called the Naive Bayes algorithm.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Develop a Naive Bayes Classifier from Scratch in Python appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Bayes Theorem for Machine Learning appeared first on Machine Learning Mastery.

]]>Bayes Theorem provides a principled way for calculating a conditional probability.

It is a deceptively simple calculation, although it can be used to easily calculate the conditional probability of events where intuition often fails.

Although it is a powerful tool in the field of probability, Bayes Theorem is also widely used in the field of machine learning. Including its use in a probability framework for fitting a model to a training dataset, referred to as maximum a posteriori or MAP for short, and in developing models for classification predictive modeling problems such as the Bayes Optimal Classifier and Naive Bayes.

In this post, you will discover Bayes Theorem for calculating conditional probabilities and how it is used in machine learning.

After reading this post, you will know:

- What Bayes Theorem is and how to work through the calculation on a real scenario.
- What the terms in the Bayes theorem calculation mean and the intuitions behind them.
- Examples of how Bayes theorem is used in classifiers, optimization and causal models.

Let’s get started.

**Update Oct/2019**: Join the discussion about this tutorial on HackerNews.**Update Oct/2019**: Expanded to add more examples and uses of Bayes Theorem.

This tutorial is divided into six parts; they are:

- Bayes Theorem of Conditional Probability
- Naming the Terms in the Theorem
- Worked Example for Calculating Bayes Theorem
- Diagnostic Test Scenario
- Manual Calculation
- Python Code Calculation
- Binary Classifier Terminology

- Bayes Theorem for Modeling Hypotheses
- Bayes Theorem for Classification
- Naive Bayes Classifier
- Bayes Optimal Classifier

- More Uses of Bayes Theorem in Machine Learning
- Bayesian Optimization
- Bayesian Belief Networks

Before we dive into Bayes theorem, let’s review marginal, joint, and conditional probability.

Recall that marginal probability is the probability of an event, irrespective of other random variables. If the random variable is independent, then it is the probability of the event directly, otherwise, if the variable is dependent upon other variables, then the marginal probability is the probability of the event summed over all outcomes for the dependent variables, called the sum rule.

**Marginal Probability**: The probability of an event irrespective of the outcomes of other random variables, e.g. P(A).

The joint probability is the probability of two (or more) simultaneous events, often described in terms of events A and B from two dependent random variables, e.g. X and Y. The joint probability is often summarized as just the outcomes, e.g. A and B.

**Joint Probability**: Probability of two (or more) simultaneous events, e.g. P(A and B) or P(A, B).

The conditional probability is the probability of one event given the occurrence of another event, often described in terms of events A and B from two dependent random variables e.g. X and Y.

**Conditional Probability**: Probability of one (or more) event given the occurrence of another event, e.g. P(A given B) or P(A | B).

The joint probability can be calculated using the conditional probability; for example:

- P(A, B) = P(A | B) * P(B)

This is called the product rule. Importantly, the joint probability is symmetrical, meaning that:

- P(A, B) = P(B, A)

The conditional probability can be calculated using the joint probability; for example:

- P(A | B) = P(A, B) / P(B)

The conditional probability is not symmetrical; for example:

- P(A | B) != P(B | A)

We are now up to speed with marginal, joint and conditional probability. If you would like more background on these fundamentals, see the tutorial:

Now, there is another way to calculate the conditional probability.

Specifically, one conditional probability can be calculated using the other conditional probability; for example:

- P(A|B) = P(B|A) * P(A) / P(B)

The reverse is also true; for example:

- P(B|A) = P(A|B) * P(B) / P(A)

This alternate approach of calculating the conditional probability is useful either when the joint probability is challenging to calculate (which is most of the time), or when the reverse conditional probability is available or easy to calculate.

This alternate calculation of the conditional probability is referred to as Bayes Rule or Bayes Theorem, named for Reverend Thomas Bayes, who is credited with first describing it. It is grammatically correct to refer to it as Bayes’ Theorem (with the apostrophe), but it is common to omit the apostrophe for simplicity.

**Bayes Theorem**: Principled way of calculating a conditional probability without the joint probability.

It is often the case that we do not have access to the denominator directly, e.g. P(B).

We can calculate it an alternative way; for example:

- P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)

This gives a formulation of Bayes Theorem that we can use that uses the alternate calculation of P(B), described below:

- P(A|B) = P(B|A) * P(A) / P(B|A) * P(A) + P(B|not A) * P(not A)

Or with brackets around the denominator for clarity:

- P(A|B) = P(B|A) * P(A) / (P(B|A) * P(A) + P(B|not A) * P(not A))

**Note**: the denominator is simply the expansion we gave above.

As such, if we have P(A), then we can calculate P(not A) as its complement; for example:

- P(not A) = 1 – P(A)

Additionally, if we have P(not B|not A), then we can calculate P(B|not A) as its complement; for example:

- P(B|not A) = 1 – P(not B|not A)

Now that we are familiar with the calculation of Bayes Theorem, let’s take a closer look at the meaning of the terms in the equation.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The terms in the Bayes Theorem equation are given names depending on the context where the equation is used.

It can be helpful to think about the calculation from these different perspectives and help to map your problem onto the equation.

Firstly, in general, the result P(A|B) is referred to as the **posterior probability** and P(A) is referred to as the **prior probability**.

- P(A|B): Posterior probability.
- P(A): Prior probability.

Sometimes P(B|A) is referred to as the **likelihood** and P(B) is referred to as the **evidence**.

- P(B|A): Likelihood.
- P(B): Evidence.

This allows Bayes Theorem to be restated as:

- Posterior = Likelihood * Prior / Evidence

We can make this clear with a smoke and fire case.

**What is the probability that there is fire given that there is smoke?**

Where P(Fire) is the Prior, P(Smoke|Fire) is the Likelihood, and P(Smoke) is the evidence:

- P(Fire|Smoke) = P(Smoke|Fire) * P(Fire) / P(Smoke)

You can imagine the same situation with rain and clouds.

Now that we are familiar with Bayes Theorem and the meaning of the terms, let’s look at a scenario where we can calculate it.

Bayes theorem is best understood with a real-life worked example with real numbers to demonstrate the calculations.

First we will define a scenario then work through a manual calculation, a calculation in Python, and a calculation using the terms that may be familiar to you from the field of binary classification.

- Diagnostic Test Scenario
- Manual Calculation
- Python Code Calculation
- Binary Classifier Terminology

Let’s go.

An excellent and widely used example of the benefit of Bayes Theorem is in the analysis of a medical diagnostic test.

**Scenario**: Consider a human population that may or may not have cancer (Cancer is True or False) and a medical test that returns positive or negative for detecting cancer (Test is Positive or Negative), e.g. like a mammogram for detecting breast cancer.

Problem: If a randomly selected patient has the test and it comes back positive, what is the probability that the patient has cancer?

Medical diagnostic tests are not perfect; they have error.

Sometimes a patient will have cancer, but the test will not detect it. This capability of the test to detect cancer is referred to as the **sensitivity**, or the true positive rate.

In this case, we will contrive a sensitivity value for the test. The test is good, but not great, with a true positive rate or sensitivity of 85%. That is, of all the people who have cancer and are tested, 85% of them will get a positive result from the test.

- P(Test=Positive | Cancer=True) = 0.85

Given this information, our intuition would suggest that there is an 85% probability that the patient has cancer.

**Our intuitions of probability are wrong.**

This type of error in interpreting probabilities is so common that it has its own name; it is referred to as the base rate fallacy.

It has this name because the error in estimating the probability of an event is caused by ignoring the base rate. That is, it ignores the probability of a randomly selected person having cancer, regardless of the results of a diagnostic test.

In this case, we can assume the probability of breast cancer is low, and use a contrived base rate value of one person in 5,000, or (0.0002) 0.02%.

- P(Cancer=True) = 0.02%.

We can correctly calculate the probability of a patient having cancer given a positive test result using Bayes Theorem.

Let’s map our scenario onto the equation:

- P(A|B) = P(B|A) * P(A) / P(B)
- P(Cancer=True | Test=Positive) = P(Test=Positive|Cancer=True) * P(Cancer=True) / P(Test=Positive)

We know the probability of the test being positive given that the patient has cancer is 85%, and we know the base rate or the prior probability of a given patient having cancer is 0.02%; we can plug these values in:

- P(Cancer=True | Test=Positive) = 0.85 * 0.0002 / P(Test=Positive)

We don’t know P(Test=Positive), it’s not given directly.

Instead, we can estimate it using:

- P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)
- P(Test=Positive) = P(Test=Positive|Cancer=True) * P(Cancer=True) + P(Test=Positive|Cancer=False) * P(Cancer=False)

Firstly, we can calculate P(Cancer=False) as the complement of P(Cancer=True), which we already know

- P(Cancer=False) = 1 – P(Cancer=True)
- = 1 – 0.0002
- = 0.9998

Let’s plugin what we have:

We can plug in our known values as follows:

- P(Test=Positive) = 0.85 * 0.0002 + P(Test=Positive|Cancer=False) * 0.9998

We still do not know the probability of a positive test result given no cancer.

This requires additional information.

Specifically, we need to know how good the test is at correctly identifying people that do not have cancer. That is, testing negative result (Test=Negative) when the patient does not have cancer (Cancer=False), called the true negative rate or the **specificity**.

We will use a contrived specificity value of 95%.

- P(Test=Negative | Cancer=False) = 0.95

With this final piece of information, we can calculate the false positive or false alarm rate as the complement of the true negative rate.

- P(Test=Positive|Cancer=False) = 1 – P(Test=Negative | Cancer=False)
- = 1 – 0.95
- = 0.05

We can plug this false alarm rate into our calculation of P(Test=Positive) as follows:

- P(Test=Positive) = 0.85 * 0.0002 + 0.05 * 0.9998
- P(Test=Positive) = 0.00017 + 0.04999
- P(Test=Positive) = 0.05016

Excellent, so the probability of the test returning a positive result, regardless of whether the person has cancer or not is about 5%.

We now have enough information to calculate Bayes Theorem and estimate the probability of a randomly selected person having cancer if they get a positive test result.

- P(Cancer=True | Test=Positive) = P(Test=Positive|Cancer=True) * P(Cancer=True) / P(Test=Positive)
- P(Cancer=True | Test=Positive) = 0.85 * 0.0002 / 0.05016
- P(Cancer=True | Test=Positive) = 0.00017 / 0.05016
- P(Cancer=True | Test=Positive) = 0.003389154704944

The calculation suggests that if the patient is informed they have cancer with this test, then there is only 0.33% chance that they have cancer.

**It is a terrible diagnostic test!**

The example also shows that the calculation of the conditional probability requires *enough* information.

For example, if we have the values used in Bayes Theorem already, we can use them directly.

This is rarely the case, and we typically have to calculate the bits we need and plug them in, as we did in this case. In our scenario we were given 3 pieces of information, the the **base rate**, the **sensitivity** (or true positive rate), and the **specificity** (or true negative rate).

**Sensitivity**: 85% of people with cancer will get a positive test result.**Base Rate**: 0.02% of people have cancer.**Specificity**: 95% of people without cancer will get a negative test result.

We did not have the P(Test=Positive), but we calculated it given what we already had available.

We might imagine that Bayes Theorem allows us to be even more precise about a given scenario. For example, if we had more information about the patient (e.g. their age) and about the domain (e.g. cancer rates for age ranges), and in turn we could offer an even more accurate probability estimate.

That was a lot of work.

Let’s look at how we can calculate this exact scenario using a few lines of Python code.

To make this example concrete, we can perform the calculation in Python.

The example below performs the same calculation in vanilla Python (no libraries), allowing you to play with the parameters and test different scenarios.

# calculate the probability of cancer patient and diagnostic test # calculate P(A|B) given P(A), P(B|A), P(B|not A) def bayes_theorem(p_a, p_b_given_a, p_b_given_not_a): # calculate P(not A) not_a = 1 - p_a # calculate P(B) p_b = p_b_given_a * p_a + p_b_given_not_a * not_a # calculate P(A|B) p_a_given_b = (p_b_given_a * p_a) / p_b return p_a_given_b # P(A) p_a = 0.0002 # P(B|A) p_b_given_a = 0.85 # P(B|not A) p_b_given_not_a = 0.05 # calculate P(A|B) result = bayes_theorem(p_a, p_b_given_a, p_b_given_not_a) # summarize print('P(A|B) = %.3f%%' % (result * 100))

Running the example calculates the probability that a patient has cancer given the test returns a positive result, matching our manual calculation.

P(A|B) = 0.339%

This is a helpful little script that you may want to adapt to new scenarios.

Now, it is common to describe the calculation of Bayes Theorem for a scenario using the terms from binary classification. It provides a very intuitive way for thinking about a problem. In the next section we will review these terms and see how they map onto the probabilities in the theorem and how they relate to our scenario.

It may be helpful to think about the cancer test example in terms of the common terms from binary (two-class) classification, i.e. where notions of specificity and sensitivity come from.

Personally, I find these terms help everything to make sense.

Firstly, let’s define a confusion matrix:

| Positive Class | Negative Class Positive Prediction | True Positive (TP) | False Positive (FP) Negative Prediction | False Negative (FN) | True Negative (TN)

We can then define some rates from the confusion matrix:

- True Positive Rate (TPR) = TP / (TP + FN)
- False Positive Rate (FPR) = FP / (FP + TN)
- True Negative Rate (TNR) = TN / (TN + FP)
- False Negative Rate (FNR) = FN / (FN + TP)

These terms are called rates, but they can also be interpreted as probabilities.

Also, it might help to notice:

- TPR + FNR = 1.0, or:
- FNR = 1.0 – TPR
- TPR = 1.0 – FNR

- TNR + FPR = 1.0, or:
- TNR = 1.0 – FPR
- FPR = 1.0 – TNR

Recall that in a previous section that we calculated the false positive rate given the complement of true negative rate, or FPR = 1.0 – TNR.

Some of these rates have special names, for example:

- Sensitivity = TPR
- Specificity = TNR

We can map these rates onto familiar terms from Bayes Theorem:

**P(B|A)**: True Positive Rate (TPR).**P(not B|not A)**: True Negative Rate (TNR).**P(B|not A)**: False Positive Rate (FPR).**P(not B|A)**: False Negative Rate (FNR).

We can also map the base rates for the condition (class) and the treatment (prediction) on familiar terms from Bayes Theorem:

**P(A)**: Probability of a Positive Class (PC).**P(not A)**: Probability of a Negative Class (NC).**P(B)**: Probability of a Positive Prediction (PP).**P(not B)**: Probability of a Negative Prediction (NP).

Now, let’s consider Bayes Theorem using these terms:

- P(A|B) = P(B|A) * P(A) / P(B)
- P(A|B) = (TPR * PC) / PP

Where we often cannot calculate P(B), so we use an alternative:

- P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)
- P(B) = TPR * PC + FPR * NC

Now, let’s look at our scenario of cancer and a cancer detection test.

The class or condition would be “*Cancer*” and the treatment or prediction would the “*Test*“.

First, let’s review all of the rates:

- True Positive Rate (TPR): 85%
- False Positive Rate (FPR): 5%
- True Negative Rate (TNR): 95%
- False Negative Rate (FNR): 15%

Let’s also review what we know about base rates:

- Positive Class (PC): 0.02%
- Negative Class (NC): 99.98%
- Positive Prediction (PP): 5.016%
- Negative Prediction (NP): 94.984%

Plugging things in, we can calculate the probability of a positive test result (a positive prediction) as the probability of a positive test result given cancer (the true positive rate) multiplied by the base rate for having cancer (the positive class), plus the probability if a positive test result given no cancer (the false positive rate) plus the probability of not having cancer (the negative class).

The calculation with these terms is as follows:

- P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)
- P(B) = TPR * PC + FPR * NC
- P(B) = 85% * 0.02% + 5% * 99.98%
- P(B) = 5.016%

We can then calculate Bayes Theorem for the scenario, namely the probability of cancer given a positive test result (the posterior) is the probability of a positive test result given cancer (the true positive rate) multiplied by the probability of having cancer (the positive class rate), divided by the probability of a positive test result (a positive prediction).

The calculation with these terms is as follows:

- P(A|B) = P(B|A) * P(A) / P(B)
- P(A|B) = TPR * PC / PP
- P(A|B) = 85% * 0.02% / 5.016%
- P(A|B) = 0.339%

It turns out that in this case, the posterior probability that we are calculating with the Bayes theorem is equivalent to the precision, also called the Positive Predictive Value (PPV) of the confusion matrix:

- PPV = TP / (TP + FP)

Or, stated in our classifier terms:

- P(A|B) = PPV
- PPV = TPR * PC / PP

**So why do we go to all of the trouble of calculating the posterior probability?**

Because we don’t have the confusion matrix for a population of people both with and without cancer that have been tested and have been not tested. Instead, all we have is some priors and probabilities about our population and our test.

This highlights when we might choose to use the calculation in practice.

Specifically, when we have beliefs about the events involved, but we cannot perform the calculation by counting examples in the real world.

Bayes Theorem is a useful tool in applied machine learning.

It provides a way of thinking about the relationship between data and a model.

A machine learning algorithm or model is a specific way of thinking about the structured relationships in the data. In this way, a model can be thought of as a hypothesis about the relationships in the data, such as the relationship between input (*X*) and output (*y*). The practice of applied machine learning is the testing and analysis of different hypotheses (models) on a given dataset.

If this idea of thinking of a model as a hypothesis is new to you, see this tutorial on the topic:

Bayes Theorem provides a probabilistic model to describe the relationship between data (*D*) and a hypothesis (h); for example:

- P(h|D) = P(D|h) * P(h) / P(D)

Breaking this down, it says that the probability of a given hypothesis holding or being true given some observed data can be calculated as the probability of observing the data given the hypothesis multiplied by the probability of the hypothesis being true regardless of the data, divided by the probability of observing the data regardless of the hypothesis.

Bayes theorem provides a way to calculate the probability of a hypothesis based on its prior probability, the probabilities of observing various data given the hypothesis, and the observed data itself.

— Page 156, Machine Learning, 1997.

Under this framework, each piece of the calculation has a specific name; for example:

- P(h|D): Posterior probability of the hypothesis (the thing we want to calculate).
- P(h): Prior probability of the hypothesis.

This gives a useful framework for thinking about and modeling a machine learning problem.

If we have some prior domain knowledge about the hypothesis, this is captured in the prior probability. If we don’t, then all hypotheses may have the same prior probability.

If the probability of observing the data P(D) increases, then the probability of the hypothesis holding given the data P(h|D) decreases. Conversely, if the probability of the hypothesis P(h) and the probability of observing the data given hypothesis increases, the probability of the hypothesis holding given the data P(h|D) increases.

The notion of testing different models on a dataset in applied machine learning can be thought of as estimating the probability of each hypothesis (h1, h2, h3, … in H) being true given the observed data.

The optimization or seeking the hypothesis with the maximum posterior probability in modeling is called maximum a posteriori or MAP for short.

Any such maximally probable hypothesis is called a maximum a posteriori (MAP) hypothesis. We can determine the MAP hypotheses by using Bayes theorem to calculate the posterior probability of each candidate hypothesis.

— Page 157, Machine Learning, 1997.

Under this framework, the probability of the data (D) is constant as it is used in the assessment of each hypothesis. Therefore, it can be removed from the calculation to give the simplified unnormalized estimate as follows:

- max h in H P(h|D) = P(D|h) * P(h)

If we do not have any prior information about the hypothesis being tested, they can be assigned a uniform probability, and this term too will be a constant and can be removed from the calculation to give the following:

- max h in H P(h|D) = P(D|h)

That is, the goal is to locate a hypothesis that best explains the observed data.

Fitting models like linear regression for predicting a numerical value, and logistic regression for binary classification can be framed and solved under the MAP probabilistic framework. This provides an alternative to the more common maximum likelihood estimation (MLE) framework.

Classification is a predictive modeling problem that involves assigning a label to a given input data sample.

The problem of classification predictive modeling can be framed as calculating the conditional probability of a class label given a data sample, for example:

- P(class|data) = (P(data|class) * P(class)) / P(data)

Where P(class|data) is the probability of class given the provided data.

This calculation can be performed for each class in the problem and the class that is assigned the largest probability can be selected and assigned to the input data.

In practice, it is very challenging to calculate full Bayes Theorem for classification.

The priors for the class and the data are easy to estimate from a training dataset, if the dataset is suitability representative of the broader problem.

The conditional probability of the observation based on the class P(data|class) is not feasible unless the number of examples is extraordinarily large, e.g. large enough to effectively estimate the probability distribution for all different possible combinations of values. This is almost never the case, we will not have sufficient coverage of the domain.

As such, the direct application of Bayes Theorem also becomes intractable, especially as the number of variables or features (n) increases.

The solution to using Bayes Theorem for a conditional probability classification model is to simplify the calculation.

The Bayes Theorem assumes that each input variable is dependent upon all other variables. This is a cause of complexity in the calculation. We can remove this assumption and consider each input variable as being independent from each other.

This changes the model from a dependent conditional probability model to an independent conditional probability model and dramatically simplifies the calculation.

This means that we calculate P(data|class) for each input variable separately and multiple the results together, for example:

- P(class | X1, X2, …, Xn) = P(X1|class) * P(X2|class) * … * P(Xn|class) * P(class) / P(data)

We can also drop the probability of observing the data as it is a constant for all calculations, for example:

- P(class | X1, X2, …, Xn) = P(X1|class) * P(X2|class) * … * P(Xn|class) * P(class)

This simplification of Bayes Theorem is common and widely used for classification predictive modeling problems and is generally referred to as Naive Bayes.

The word “*naive*” is French and typically has a diaeresis (umlaut) over the “i”, which is commonly left out for simplicity, and “Bayes” is capitalized as it is named for Reverend Thomas Bayes.

For tutorials on how to implement Naive Bayes from scratch in Python see:

- How to Develop a Naive Bayes Classifier from Scratch in Python
- Naive Bayes Classifier From Scratch in Python

The Bayes optimal classifier is a probabilistic model that makes the most likley prediction for a new example, given the training dataset.

This model is also referred to as the Bayes optimal learner, the Bayes classifier, Bayes optimal decision boundary, or the Bayes optimal discriminant function.

**Bayes Classifier**: Probabilistic model that makes the most probable prediction for new examples.

Specifically, the Bayes optimal classifier answers the question:

What is the most probable classification of the new instance given the training data?

This is different from the MAP framework that seeks the most probable hypothesis (model). Instead, we are interested in making a specific prediction.

The equation below demonstrates how to calculate the conditional probability for a new instance (*vi*) given the training data (*D*), given a space of hypotheses (*H*).

- P(vj | D) = sum {h in H} P(vj | hi) * P(hi | D)

Where *vj* is a new instance to be classified, *H* is the set of hypotheses for classifying the instance, *hi* is a given hypothesis, *P(vj | hi)* is the posterior probability for *vi* given hypothesis *hi*, and *P(hi | D)* is the posterior probability of the hypothesis *hi* given the data *D*.

Selecting the outcome with the maximum probability is an example of a Bayes optimal classification.

Any model that classifies examples using this equation is a Bayes optimal classifier and no other model can outperform this technique, on average.

We have to let that sink in. It is a big deal.

Because the Bayes classifier is optimal, the Bayes error is the minimum possible error that can be made.

**Bayes Error**: The minimum possible error that can be made when making predictions.

It is a theoretical model, but it is held up as an ideal that we may wish to pursue.

The Naive Bayes classifier is an example of a classifier that adds some simplifying assumptions and attempts to approximate the Bayes Optimal Classifier.

Developing classifier models may be the most common application on Bayes Theorem in machine learning.

Nevertheless, there are many other applications. Two important examples are optimization and causal models.

Global optimization is a challenging problem of finding an input that results in the minimum or maximum cost of a given objective function.

Typically, the form of the objective function is complex and intractable to analyze and is often non-convex, nonlinear, high dimension, noisy, and computationally expensive to evaluate.

Bayesian Optimization provides a principled technique based on Bayes Theorem to direct a search of a global optimization problem that is efficient and effective. It works by building a probabilistic model of the objective function, called the surrogate function, that is then searched efficiently with an acquisition function before candidate samples are chosen for evaluation on the real objective function.

Bayesian Optimization is often used in applied machine learning to tune the hyperparameters of a given well-performing model on a validation dataset.

For more on Bayesian Optimization including how to implement it from scratch, see the tutorial:

Probabilistic models can define relationships between variables and be used to calculate probabilities.

Fully conditional models may require an enormous amount of data to cover all possible cases, and probabilities may be intractable to calculate in practice. Simplifying assumptions such as the conditional independence of all random variables can be effective, such as in the case of Naive Bayes, although it is a drastically simplifying step.

An alternative is to develop a model that preserves known conditional dependence between random variables and conditional independence in all other cases. Bayesian networks are a probabilistic graphical model that explicitly capture the known conditional dependence with directed edges in a graph model. All missing connections define the conditional independencies in the model.

As such Bayesian Networks provide a useful tool to visualize the probabilistic model for a domain, review all of the relationships between the random variables, and reason about causal probabilities for scenarios given available evidence.

The networks are not exactly Bayesian by definition, although given that both the probability distributions for the random variables (nodes) and the relationships between the random variables (edges) are specified subjectively, the model can be thought to capture the “belief” about a complex domain.

For more on Bayesian Belief Networks, see the tutorial:

This section provides more resources on the topic if you are looking to go deeper.

- A Gentle Introduction to Joint, Marginal, and Conditional Probability
- What is a Hypothesis in Machine Learning?
- How to Develop a Naive Bayes Classifier from Scratch in Python
- Naive Bayes Classifier From Scratch in Python
- How to Implement Bayesian Optimization from Scratch in Python
- A Gentle Introduction to Bayesian Belief Networks

- Pattern Recognition and Machine Learning, 2006.
- Machine Learning, 1997.
- Pattern Classification, 2nd Edition, 2001.
- Machine Learning: A Probabilistic Perspective, 2012.

- Conditional probability, Wikipedia.
- Bayes’ theorem, Wikipedia.
- Maximum a posteriori estimation, Wikipedia.
- False positives and false negatives, Wikipedia.
- Base rate fallacy, Wikipedia.
- Sensitivity and specificity, Wikipedia.
- Taking the Confusion out of the Confusion Matrix, 2016.

In this post, you discovered Bayes Theorem for calculating conditional probabilities and how it is used in machine learning.

Specifically, you learned:

- What Bayes Theorem is and how to work through the calculation on a real scenario.
- What the terms in the Bayes theorem calculation mean and the intuitions behind them.
- Examples of how Bayes theorem is used in classifiers, optimization and causal models.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Bayes Theorem for Machine Learning appeared first on Machine Learning Mastery.

]]>The post Probability for Machine Learning (7-Day Mini-Course) appeared first on Machine Learning Mastery.

]]>Probability is a field of mathematics that is universally agreed to be the bedrock for machine learning.

Although probability is a large field with many esoteric theories and findings, the nuts and bolts, tools and notations taken from the field are required for machine learning practitioners. With a solid foundation of what probability is, it is possible to focus on just the good or relevant parts.

In this crash course, you will discover how you can get started and confidently understand and implement probabilistic methods used in machine learning with Python in seven days.

This is a big and important post. You might want to bookmark it.

Let’s get started.

Before we get started, let’s make sure you are in the right place.

This course is for developers that may know some applied machine learning. Maybe you know how to work through a predictive modeling problem end-to-end, or at least most of the main steps, with popular tools.

The lessons in this course do assume a few things about you, such as:

- You know your way around basic Python for programming.
- You may know some basic NumPy for array manipulation.
- You want to learn probability to deepen your understanding and application of machine learning.

You do NOT need to be:

- A math wiz!
- A machine learning expert!

This crash course will take you from a developer that knows a little machine learning to a developer who can navigate the basics of probabilistic methods.

**Note**: This crash course assumes you have a working Python3 SciPy environment with at least NumPy installed. If you need help with your environment, you can follow the step-by-step tutorial here:

This crash course is broken down into seven lessons.

You could complete one lesson per day (recommended) or complete all of the lessons in one day (hardcore). It really depends on the time you have available and your level of enthusiasm.

Below is a list of the seven lessons that will get you started and productive with probability for machine learning in Python:

**Lesson 01**: Probability and Machine Learning**Lesson 02**: Three Types of Probability**Lesson 03**: Probability Distributions**Lesson 04**: Naive Bayes Classifier**Lesson 05**: Entropy and Cross-Entropy**Lesson 06**: Naive Classifiers**Lesson 07**: Probability Scores

Each lesson could take you 60 seconds or up to 30 minutes. Take your time and complete the lessons at your own pace. Ask questions and even post results in the comments below.

The lessons expect you to go off and find out how to do things. I will give you hints, but part of the point of each lesson is to force you to learn where to go to look for help on and about the statistical methods and the NumPy API and the best-of-breed tools in Python. (**Hint**: I have all of the answers directly on this blog; use the search box.)

Post your results in the comments; I’ll cheer you on!

Hang in there; don’t give up.

**Note**: This is just a crash course. For a lot more detail and fleshed-out tutorials, see my book on the topic titled “Probability for Machine Learning.”

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In this lesson, you will discover why machine learning practitioners should study probability to improve their skills and capabilities.

Probability is a field of mathematics that quantifies uncertainty.

Machine learning is about developing predictive modeling from uncertain data. Uncertainty means working with imperfect or incomplete information.

Uncertainty is fundamental to the field of machine learning, yet it is one of the aspects that causes the most difficulty for beginners, especially those coming from a developer background.

There are three main sources of uncertainty in machine learning; they are:

**Noise in observations**, e.g. measurement errors and random noise.**Incomplete coverage of the domain**, e.g. you can never observe all data.**Imperfect model of the problem**, e.g. all models have errors, some are useful.

Uncertainty in applied machine learning is managed using probability.

- Probability and statistics help us to understand and quantify the expected value and variability of variables in our observations from the domain.
- Probability helps to understand and quantify the expected distribution and density of observations in the domain.
- Probability helps to understand and quantify the expected capability and variance in performance of our predictive models when applied to new data.

This is the bedrock of machine learning. On top of that, we may need models to predict a probability, we may use probability to develop predictive models (e.g. Naive Bayes), and we may use probabilistic frameworks to train predictive models (e.g. maximum likelihood estimation).

For this lesson, you must list three reasons why you want to learn probability in the context of machine learning.

These may be related to some of the reasons above, or they may be your own personal motivations.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover the three different types of probability and how to calculate them.

In this lesson, you will discover a gentle introduction to joint, marginal, and conditional probability between random variables.

Probability quantifies the likelihood of an event.

Specifically, it quantifies how likely a specific outcome is for a random variable, such as the flip of a coin, the roll of a die, or drawing a playing card from a deck.

We can discuss the probability of just two events: the probability of event *A* for variable *X* and event *B* for variable *Y*, which in shorthand is *X=A* and *Y=B*, and that the two variables are related or dependent in some way.

As such, there are three main types of probability we might want to consider.

We may be interested in the probability of two simultaneous events, like the outcomes of two different random variables.

For example, the joint probability of event *A* and event *B* is written formally as:

- P(A and B)

The joint probability for events *A* and *B* is calculated as the probability of event *A* given event *B* multiplied by the probability of event *B*.

This can be stated formally as follows:

- P(A and B) = P(A given B) * P(B)

We may be interested in the probability of an event for one random variable, irrespective of the outcome of another random variable.

There is no special notation for marginal probability; it is just the sum or union over all the probabilities of all events for the second variable for a given fixed event for the first variable.

- P(X=A) = sum P(X=A, Y=yi) for all y

We may be interested in the probability of an event given the occurrence of another event.

For example, the conditional probability of event *A* given event *B* is written formally as:

- P(A given B)

The conditional probability for events *A* given event *B* can be calculated using the joint probability of the events as follows:

- P(A given B) = P(A and B) / P(B)

For this lesson, you must practice calculating joint, marginal, and conditional probabilities.

For example, if a family has two children and the oldest is a boy, what is the probability of this family having two sons? This is called the “Boy or Girl Problem” and is one of many common toy problems for practicing probability.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover probability distributions for random variables.

In this lesson, you will discover a gentle introduction to probability distributions.

In probability, a random variable can take on one of many possible values, e.g. events from the state space. A specific value or set of values for a random variable can be assigned a probability.

There are two main classes of random variables.

**Discrete Random Variable**. Values are drawn from a finite set of states.**Continuous Random Variable**. Values are drawn from a range of real-valued numerical values.

A discrete random variable has a finite set of states; for example, the colors of a car. A continuous random variable has a range of numerical values; for example, the height of humans.

A probability distribution is a summary of probabilities for the values of a random variable.

A discrete probability distribution summarizes the probabilities for a discrete random variable.

Some examples of well-known discrete probability distributions include:

- Poisson distribution.
- Bernoulli and binomial distributions.
- Multinoulli and multinomial distributions.

A continuous probability distribution summarizes the probability for a continuous random variable.

Some examples of well-known continuous probability distributions include:

- Normal or Gaussian distribution.
- Exponential distribution.
- Pareto distribution.

We can define a distribution with a mean of 50 and a standard deviation of 5 and sample random numbers from this distribution. We can achieve this using the normal() NumPy function.

The example below samples and prints 10 numbers from this distribution.

# sample a normal distribution from numpy.random import normal # define the distribution mu = 50 sigma = 5 n = 10 # generate the sample sample = normal(mu, sigma, n) print(sample)

Running the example prints 10 numbers randomly sampled from the defined normal distribution.

For this lesson, you must develop an example to sample from a different continuous or discrete probability distribution function.

For a bonus, you can plot the values on the x-axis and the probability on the y-axis for a given distribution to show the density of your chosen probability distribution function.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover the Naive Bayes classifier.

In this lesson, you will discover the Naive Bayes algorithm for classification predictive modeling.

In machine learning, we are often interested in a predictive modeling problem where we want to predict a class label for a given observation.

One approach to solving this problem is to develop a probabilistic model. From a probabilistic perspective, we are interested in estimating the conditional probability of the class label given the observation, or the probability of class *y* given input data *X*.

- P(y | X)

Bayes Theorem provides an alternate and principled way for calculating the conditional probability using the reverse of the desired conditional probability, which is often simpler to calculate.

The simple form of the calculation for Bayes Theorem is as follows:

- P(A|B) = P(B|A) * P(A) / P(B)

Where the probability that we are interested in calculating *P(A|B)* is called the posterior probability and the marginal probability of the event *P(A)* is called the prior.

The direct application of Bayes Theorem for classification becomes intractable, especially as the number of variables or features (*n*) increases. Instead, we can simplify the calculation and assume that each input variable is independent. Although dramatic, this simpler calculation often gives very good performance, even when the input variables are highly dependent.

We can implement this from scratch by assuming a probability distribution for each separate input variable and calculating the probability of each specific input value belonging to each class and multiply the results together to give a score used to select the most likely class.

- P(yi | x1, x2, …, xn) = P(x1|y1) * P(x2|y1) * … P(xn|y1) * P(yi)

The scikit-learn library provides an efficient implementation of the algorithm if we assume a Gaussian distribution for each input variable.

To use a scikit-learn Naive Bayes model, first the model is defined, then it is fit on the training dataset. Once fit, probabilities can be predicted via the *predict_proba()* function and class labels can be predicted directly via the *predict()* function.

The complete example of fitting a Gaussian Naive Bayes model (GaussianNB) to a test dataset is listed below.

# example of gaussian naive bayes from sklearn.datasets.samples_generator import make_blobs from sklearn.naive_bayes import GaussianNB # generate 2d classification dataset X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1) # define the model model = GaussianNB() # fit the model model.fit(X, y) # select a single sample Xsample, ysample = [X[0]], y[0] # make a probabilistic prediction yhat_prob = model.predict_proba(Xsample) print('Predicted Probabilities: ', yhat_prob) # make a classification prediction yhat_class = model.predict(Xsample) print('Predicted Class: ', yhat_class) print('Truth: y=%d' % ysample)

Running the example fits the model on the training dataset, then makes predictions for the same first example that we used in the prior example.

For this lesson, you must run the example and report the result.

For a bonus, try the algorithm on a real classification dataset, such as the popular toy classification problem of classifying iris flower species based on flower measurements.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover entropy and the cross-entropy scores.

In this lesson, you will discover cross-entropy for machine learning.

Information theory is a field of study concerned with quantifying information for communication.

The intuition behind quantifying information is the idea of measuring how much surprise there is in an event. Those events that are rare (low probability) are more surprising and therefore have more information than those events that are common (high probability).

**Low Probability Event**: High Information (surprising).**High Probability Event**: Low Information (unsurprising).

We can calculate the amount of information there is in an event using the probability of the event.

- Information(x) = -log( p(x) )

We can also quantify how much information there is in a random variable.

This is called entropy and summarizes the amount of information required on average to represent events.

Entropy can be calculated for a random variable X with K discrete states as follows:

- Entropy(X) = -sum(i=1 to K p(K) * log(p(K)))

Cross-entropy is a measure of the difference between two probability distributions for a given random variable or set of events. It is widely used as a loss function when optimizing classification models.

It builds upon the idea of entropy and calculates the average number of bits required to represent or transmit an event from one distribution compared to the other distribution.

- CrossEntropy(P, Q) = – sum x in X P(x) * log(Q(x))

We can make the calculation of cross-entropy concrete with a small example.

Consider a random variable with three events as different colors. We may have two different probability distributions for this variable. We can calculate the cross-entropy between these two distributions.

The complete example is listed below.

# example of calculating cross entropy from math import log2 # calculate cross entropy def cross_entropy(p, q): return -sum([p[i]*log2(q[i]) for i in range(len(p))]) # define data p = [0.10, 0.40, 0.50] q = [0.80, 0.15, 0.05] # calculate cross entropy H(P, Q) ce_pq = cross_entropy(p, q) print('H(P, Q): %.3f bits' % ce_pq) # calculate cross entropy H(Q, P) ce_qp = cross_entropy(q, p) print('H(Q, P): %.3f bits' % ce_qp)

Running the example first calculates the cross-entropy of Q from P, then P from Q.

For this lesson, you must run the example and describe the results and what they mean. For example, is the calculation of cross-entropy symmetrical?

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover how to develop and evaluate a naive classifier model.

In this lesson, you will discover how to develop and evaluate naive classification strategies for machine learning.

Classification predictive modeling problems involve predicting a class label given an input to the model.

Given a classification model, how do you know if the model has skill or not?

This is a common question on every classification predictive modeling project. The answer is to compare the results of a given classifier model to a baseline or naive classifier model.

Consider a simple two-class classification problem where the number of observations is not equal for each class (e.g. it is imbalanced) with 25 examples for class-0 and 75 examples for class-1. This problem can be used to consider different naive classifier models.

For example, consider a model that randomly predicts class-0 or class-1 with equal probability. How would it perform?

We can calculate the expected performance using a simple probability model.

- P(yhat = y) = P(yhat = 0) * P(y = 0) + P(yhat = 1) * P(y = 1)

We can plug in the occurrence of each class (0.25 and 0.75) and the predicted probability for each class (0.5 and 0.5) and estimate the performance of the model.

- P(yhat = y) = 0.5 * 0.25 + 0.5 * 0.75
- P(yhat = y) = 0.5

It turns out that this classifier is pretty poor.

Now, what if we consider predicting the majority class (class-1) every time? Again, we can plug in the predicted probabilities (0.0 and 1.0) and estimate the performance of the model.

- P(yhat = y) = 0.0 * 0.25 + 1.0 * 0.75
- P(yhat = y) = 0.75

It turns out that this simple change results in a better naive classification model, and is perhaps the best naive classifier to use when classes are imbalanced.

The scikit-learn machine learning library provides an implementation of the majority class naive classification algorithm called the DummyClassifier that you can use on your next classification predictive modeling project.

The complete example is listed below.

# example of the majority class naive classifier in scikit-learn from numpy import asarray from sklearn.dummy import DummyClassifier from sklearn.metrics import accuracy_score # define dataset X = asarray([0 for _ in range(100)]) class0 = [0 for _ in range(25)] class1 = [1 for _ in range(75)] y = asarray(class0 + class1) # reshape data for sklearn X = X.reshape((len(X), 1)) # define model model = DummyClassifier(strategy='most_frequent') # fit model model.fit(X, y) # make predictions yhat = model.predict(X) # calculate accuracy accuracy = accuracy_score(y, yhat) print('Accuracy: %.3f' % accuracy)

Running the example prepares the dataset, then defines and fits the *DummyClassifier* on the dataset using the majority class strategy.

For this lesson, you must run the example and report the result, confirming whether the model performs as we expected from our calculation.

As a bonus, calculate the expected probability of a naive classifier model that randomly chooses a class label from the training dataset each time a prediction is made.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover metrics for scoring models that predict probabilities.

In this lesson, you will discover two scoring methods that you can use to evaluate the predicted probabilities on your classification predictive modeling problem.

Predicting probabilities instead of class labels for a classification problem can provide additional nuance and uncertainty for the predictions.

The added nuance allows more sophisticated metrics to be used to interpret and evaluate the predicted probabilities.

Let’s take a closer look at the two popular scoring methods for evaluating predicted probabilities.

Logistic loss, or log loss for short, calculates the log likelihood between the predicted probabilities and the observed probabilities.

Although developed for training binary classification models like logistic regression, it can be used to evaluate multi-class problems and is functionally equivalent to calculating the cross-entropy derived from information theory.

A model with perfect skill has a log loss score of 0.0. The log loss can be implemented in Python using the log_loss() function in scikit-learn.

For example:

# example of log loss from numpy import asarray from sklearn.metrics import log_loss # define data y_true = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0] y_pred = [0.8, 0.9, 0.9, 0.6, 0.8, 0.1, 0.4, 0.2, 0.1, 0.3] # define data as expected, e.g. probability for each event {0, 1} y_true = asarray([[v, 1-v] for v in y_true]) y_pred = asarray([[v, 1-v] for v in y_pred]) # calculate log loss loss = log_loss(y_true, y_pred) print(loss)

The Brier score, named for Glenn Brier, calculates the mean squared error between predicted probabilities and the expected values.

The score summarizes the magnitude of the error in the probability forecasts.

The error score is always between 0.0 and 1.0, where a model with perfect skill has a score of 0.0.

The Brier score can be calculated in Python using the brier_score_loss() function in scikit-learn.

For example:

# example of brier loss from sklearn.metrics import brier_score_loss # define data y_true = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0] y_pred = [0.8, 0.9, 0.9, 0.6, 0.8, 0.1, 0.4, 0.2, 0.1, 0.3] # calculate brier score score = brier_score_loss(y_true, y_pred, pos_label=1) print(score)

For this lesson, you must run each example and report the results.

As a bonus, change the mock predictions to make them better or worse and compare the resulting scores.

Post your answer in the comments below. I would love to see what you come up with.

This was the final lesson.

(

You made it. Well done!

Take a moment and look back at how far you have come.

You discovered:

- The importance of probability in applied machine learning.
- The three main types of probability and how to calculate them.
- Probability distributions for random variables and how to draw random samples from them.
- How Bayes theorem can be used to calculate conditional probability and how it can be used in a classification model.
- How to calculate information, entropy, and cross-entropy scores and what they mean.
- How to develop and evaluate the expected performance for naive classification models.
- How to evaluate the skill of a model that predicts probability values for a classification problem.

Take the next step and check out my book on Probability for Machine Learning.

**How did you do with the mini-course?**

Did you enjoy this crash course?

**Do you have any questions? Were there any sticking points?**

Let me know. Leave a comment below.

The post Probability for Machine Learning (7-Day Mini-Course) appeared first on Machine Learning Mastery.

]]>The post How to Develop an Intuition for Probability With Worked Examples appeared first on Machine Learning Mastery.

]]>Probability calculations are frustratingly unintuitive.

Our brains are too eager to take shortcuts and get the wrong answer, instead of thinking through a problem and calculating the probability correctly.

To make this issue obvious and aid in developing intuition, it can be useful to work through classical problems from applied probability. These problems, such as the birthday problem, boy or girl problem, and the Monty Hall problem trick us with the incorrect intuitive answer and require a careful application of the rules of marginal, conditional, and joint probability in order to arrive at the correct solution.

In this post, you will discover how to develop an intuition for probability by working through classical thought-provoking problems.

After reading this post, you will know:

- How to solve the birthday problem by multiplying probabilities together.
- How to solve the boy or girl problem using conditional probability.
- How to solve the Monty Hall problem using joint probability.

Let’s get started.

This tutorial is divided into three parts; they are:

- Birthday Problem
- Boy or Girl Problem
- Monty Hall Problem

A classic example of applied probability involves calculating the probability of two people having the same birthday.

It is a classic example because the result does not match our intuition. As such, it is sometimes called the birthday paradox.

The problem can be generally stated as:

Problem:How many people are required so that any two people in the group have the same birthday with at least a 50-50 chance?

There are no tricks to this problem; it involves simply calculating the marginal probability.

It is assumed that the probability of a randomly selected person having a birthday on any given day of the year (excluding leap years) is uniformly distributed across the days of the year, e.g. 1/365 or about 0.273%.

Our intuition might leap to an answer and assume that we might need at least as many people as there are days in the year, e.g. 365. Our intuition likely fails because we are thinking about ourselves and other people matching our own birthday. That is, we are thinking about how many people are needed for another person born on the same day as you. That is a different question.

Instead, to calculate the solution, we can think about comparing pairs of people within a group and the probability of a given pair being born on the same day. This unlocks the calculation required.

The number of pairwise comparisons within a group (excluding comparing each person with themselves) is calculated as follows:

- comparisons = n * (n – 1) / 2

For example, if we have a group of five people, we would be doing 10 pairwise comparisons among the group to check if they have the same birthday, which is more opportunity for a hit than we might expect. Importantly, the number of comparisons within the group increases exponentially with the size of the group.

One more step is required. It is easier to calculate the inverse of the problem. That is, the probability that two people in a group do not have the same birthday. We can then invert the final result to give the desired probability, for example:

- p(2 in n people have the same birthday) = 1 – p(2 in n people do not have the same birthday)

We can see why calculating the probability of non-matching birthdays is easy with an example with a small group, in this case, three people.

People can be added to the group one-by-one. Each time a person is added to the group, it decreases the number of available days where there is no birthday in the year, decreasing the number of available days by one. For example 365 days, 364 days, etc.

Additionally, the probability of a non-match for a given additional person added to the group must be combined with the prior calculated probabilities before it. For example P(n=2) * P(n=3), etc.

This gives the following, calculating the probability of no matching birthdays with a group size of three:

- P(n=3) = 365/365 * 364/365 * 363/365
- P(n=3) = 99.18%

Inverting this gives about 0.820% of a matching birthday among a group of three people.

Stepping through this, the first person has a birthday, which reduces the number of candidate days for the rest of the group from 365 to 364 unused days (i.e. days without a birthday). For the second person, we calculate the probability of a conflicting birthday as 364 safe days from 365 days in the year or about a (364/365) 99.72% probability of not having the same birthday. We now subtract the second person’s birthday from the number of available days to give 363. The probability of the third person of not having a matching birthday is then given as 363/365 multiplied by the prior probability to give about 99.18%

This calculation can get tedious for large groups, therefore we might want to automate it.

The example below calculates the probabilities for group sizes from two to 30.

# example of the birthday problem # define group size n = 30 # number of days in the year days = 365 # calculate probability for different group sizes p = 1.0 for i in range(1, n): av = days - i p *= av / days print('n=%d, %d/%d, p=%.3f 1-p=%.3f' % (i+1, av, days, p*100, (1-p)*100))

Running the example first prints the group size, then the available days divided by the total days in the year, then the probability of no matching birthdays in the group followed by the complement or the probability of two people having a birthday in the group.

n=2, 364/365, p=99.726 1-p=0.274 n=3, 363/365, p=99.180 1-p=0.820 n=4, 362/365, p=98.364 1-p=1.636 n=5, 361/365, p=97.286 1-p=2.714 n=6, 360/365, p=95.954 1-p=4.046 n=7, 359/365, p=94.376 1-p=5.624 n=8, 358/365, p=92.566 1-p=7.434 n=9, 357/365, p=90.538 1-p=9.462 n=10, 356/365, p=88.305 1-p=11.695 n=11, 355/365, p=85.886 1-p=14.114 n=12, 354/365, p=83.298 1-p=16.702 n=13, 353/365, p=80.559 1-p=19.441 n=14, 352/365, p=77.690 1-p=22.310 n=15, 351/365, p=74.710 1-p=25.290 n=16, 350/365, p=71.640 1-p=28.360 n=17, 349/365, p=68.499 1-p=31.501 n=18, 348/365, p=65.309 1-p=34.691 n=19, 347/365, p=62.088 1-p=37.912 n=20, 346/365, p=58.856 1-p=41.144 n=21, 345/365, p=55.631 1-p=44.369 n=22, 344/365, p=52.430 1-p=47.570 n=23, 343/365, p=49.270 1-p=50.730 n=24, 342/365, p=46.166 1-p=53.834 n=25, 341/365, p=43.130 1-p=56.870 n=26, 340/365, p=40.176 1-p=59.824 n=27, 339/365, p=37.314 1-p=62.686 n=28, 338/365, p=34.554 1-p=65.446 n=29, 337/365, p=31.903 1-p=68.097 n=30, 336/365, p=29.368 1-p=70.632

The result is surprising, showing that only 23 people are required to give more than a 50% chance of two people having a birthday on the same day.

More surprising is that with 30 people, this increases to a 70% probability. It’s surprising because 20 to 30 people is about the average class size in school, a number of people for which we all have an intuition (if we attended school).

If the group size is increased to around 60 people, then the probability of two people in the group having the same birthday is above 99%!

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Another classic example of applied probability is the case of calculating the probability of whether a baby is a boy or girl.

The probability of whether a given baby is a boy or a girl with no additional information is 50%. This may or may not be true in reality, but let’s assume it for the case of this useful illustration of probability.

As soon as more information is included, the probability calculation changes, and this trips up even people versed in math and probability.

A popular example is called the “*two-child problem*” that involves being given information about a family with two children and estimating the sex of one child. If the problem is not stated precisely, it can lead to misunderstanding, and in turn, two different ways of calculating the probability. This is the challenge of using natural language instead of notation, and in this case is referred to as the “boy or girl paradox.”

Let’s look at two precisely stated examples.

Case 1:A woman has two children and the oldest is a boy. What is the probability of this woman having two sons?

Our intuition suggests that the probability that the other child is a boy is 0.5 or 50%. Alternately, our intuition might suggest the probability of a family with two boys is 1/4 (e.g. a probability of 0.25) for the four possible combinations of boys and girls for a two-child family.

We can explore this by enumerating all possible combinations that include the information given:

Younger Child | Older Child | Conditional Probability Girl Boy 1/2 Boy Boy 1/2 (*) Girl Girl 0 (impossible) Boy Girl 0 (impossible)

There would be four outcomes, but the information given reduces the domain to 2 possible outcomes (older child is a boy).

Indeed, only one of the two outcomes can be boy-boy, therefore the probability is 1/2 or (0.5) 50%.

Let’s look at a second very similar case.

Case 2:A woman has two children and one of them is a boy. What is the probability of this woman having two sons?

Our intuition leaps to the same conclusion. At least mine did.

**And this would be incorrect.**

For example, 1/2 for a boy as the second child being a boy. Another leap might be 1/4 for the case of boy-boy out of all possible cases of having two children.

To find out why, again, let’s enumerate all possible combinations:

Younger Child | Older Child | Conditional Probability Girl Boy 1/3 Boy Boy 1/3 (*) Boy Girl 1/3 Girl Girl 0 (impossible)

There would be four outcomes, but the information given reduces the domain to three possible outcomes (one child is a boy). One of the three cases is boy-boy, therefore the probability is 1/3 or about 33.33%.

We have more information in Case 1, which allows us to narrow down the domain of possible outcomes and give a result that matches our intuition.

Case 2 looks very similar, but in fact, it includes less information. We have no idea as to whether the older or younger child is a boy, therefore the domain of possible outcomes is larger, resulting in a non-intuitive answer.

These are both problems in conditional probability and we can solve them using the conditional probability formula, rather than enumerating examples.

- P (A | B) = P(A and B) / P(B)

The trick is in how the problem is stated.

The outcomes that we are interested in are a sequence, not a single birth event. We are interested in a boy-boy outcome given some information.

First, let’s state a table of all possible sequences regardless of what information is given, e.g. the unconditional probabilities:

Younger Child | Older Child | Unconditional Probability Girl Boy 1/4 Boy Boy 1/4 Girl Girl 1/4 Boy Girl 1/4

We can calculate the conditional probabilities using the table of unconditional probabilities.

In case 1, we know that the oldest child, or second part of the outcome, is a boy, therefore we can state the problem as follows:

- P(boy-boy | {boy-boy or girl-boy})

We can calculate the conditional probability as follows:

- = P(boy-boy and {boy-boy or girl-boy}) / P({boy-boy or girl-boy})
- = P(boy-boy) / P({boy-boy or girl-boy})
- = 1/4 / 2/4
- = 0.25 / 0.5
- = 0.5

In case 2, we know one child is a boy, but not whether it is the older or younger child; therefore, we can state the problem as follows:

- P(boy-boy | {boy-boy or girl-boy or boy-girl})

We can calculate the conditional probability as follows:

- = P(boy-boy and {boy-boy or girl-boy or boy-girl}) / P({boy-boy or girl-boy or boy-girl})
- = 1/4 / 3/4
- = 0.25 / 0.75
- = 0.333

This is a useful illustration of how we might overcome our incorrect intuitions and achieve the correct answer by first enumerating the possible cases, and second by calculating the conditional probability directly.

A final classical problem in applied probability is called the game show problem, or the Monty Hall problem.

It is based on a real game show called “*Let’s Make a Deal*” and named for the host of the show.

The problem can be described generally as follows:

Problem:The contestant is given a choice of three doors. Behind one is a car, behind the other two are goats. Once a door is chosen, the host, who knows where the car is, opens another door, which has a goat, and asks the contestant if they wish to keep their choice or change to the other unopened door.

It is another classical problem because the solution is not intuitive and in the past has caused great confusion and debate.

Intuition for the problem says that there is a 1 in 3 or 33% chance of picking the car initially, and this becomes 1/2 or 50% once the host opens a door to reveal a goat.

**This is incorrect.**

We can start by enumerating all combinations and listing the unconditional probabilities. Assume the three doors and the user randomly selects a door, e.g. door 1.

Door 1 | Door 2 | Door 3 | Unconditional Probability Goat Goat Car 1/3 Goat Car Goat 1/3 Car Goat Goat 1/3

At this stage, there is a 1/3 probability of a car, matching our intuition so far.

Then, the host opens another door with a goat, in this case, door 2.

The opened door was not selected randomly; instead, it was selected with information about where the car is not.

Our intuition suggests we remove the second case from the table and update the probability to 1/2 for each remaining case.

**This is incorrect and is the cause of the error.**

We can summarize our intuitive conditional probabilities for this scenario as follows:

Door 1 | Door 2 | Door 3 | Uncon. | Cond. Goat Goat Car 1/3 1/2 Goat Car Goat 1/3 0 Car Goat Goat 1/3 1/2

This would be correct if the contestant did not make a choice before the host opened a door, e.g. if the host opening a door was independent.

The trick comes because the contestant made a choice before the host opened a door and this is useful information. It means the host could not open the chosen door (door1) or open a door with a car behind it. The host’s choice was dependent upon the first choice of the contestant and then constrained.

Instead, we must calculate the probability of switching or not switching, regardless of which door the host opens.

Let’s look at a table of outcomes given the choice of door 1 and staying or switching.

Door 1 | Door 2 | Door 3 | Stay | Switch Goat Goat Car Goat Car Goat Car Goat Goat Car Car Goat Goat Car Goat

We can see that 2/3 cases of switching result in winning a car (first two rows), and that 1/3 gives the car if we stay (final row).

The contestant has a 2/3 or 66.66% probability of winning the car if they switch.

**They should always switch.**

We have solved it by enumerating and counting.

Another approach to solving this problem is to calculate the joint probability of the host opening doors to test the stay-versus-switch decision under both cases, in order to maximize the probability of the desired outcome.

For example, given that the contestant has chosen door 1, we can calculate the probability of the host opening door 3 if door 1 has the car as follows:

- P(door1=car and door3=open) = 1/3 * 1/2
- = 0.333 * 0.5
- = 0.166

We can then calculate the joint probability of door 2 having the car and the host opening door 3. This is different because if door 2 contains the car, the host can only open door 3; it has a probability of 1.0, a certainty.

- P(door2=car and door3=open) = 1/3 * 1
- = 0.333 * 1.0
- = 0.333

Having chosen door 1 and the host opening door 3, the probability is higher that the car is behind door 2 (about 33%) than door 1 (about 16%). We should switch.

In this case, we should switch to door 2.

Alternately, we can model the choice of the host opening door 2, which has the same structure of probabilities:

- P(door1=car and door2=open) = 0.166
- P(door3=car and door2=open) = 0.333

Again, having chosen door 1 and the host opening door 2, the probability is higher that the car is behind door 3 (about 33%) than door 1 (about 16%). We should switch.

If we are seeking to maximize these probabilities, then the best strategy is to switch.

Again, in this example, we have seen how we can overcome our faulty intuitions and solve the problem both by enumerating the cases and my using conditional probability.

This section provides more resources on the topic if you are looking to go deeper.

- Conditional probability, Wikipedia.
- Boy or Girl paradox, Wikipedia.
- Birthday problem, Wikipedia.
- Monty Hall problem, Wikipedia.
- Conditional probability involving boy girl scenario, StackExchange.

In this post, you discovered how to develop an intuition for probability by working through classical thought-provoking problems.

Specifically, you learned:

- How to solve the birthday problem by multiplying probabilities together.
- How to solve the boy or girl problem using conditional probability.
- How to solve the Monty Hall problem using joint probability.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Develop an Intuition for Probability With Worked Examples appeared first on Machine Learning Mastery.

]]>