The post A Gentle Introduction to Computational Learning Theory appeared first on Machine Learning Mastery.

]]>These are sub-fields of machine learning that a machine learning practitioner does not need to know in great depth in order to achieve good results on a wide range of problems. Nevertheless, it is a sub-field where having a high-level understanding of some of the more prominent methods may provide insight into the broader task of learning from data.

In this post, you will discover a gentle introduction to computational learning theory for machine learning.

After reading this post, you will know:

- Computational learning theory uses formal methods to study learning tasks and learning algorithms.
- PAC learning provides a way to quantify the computational difficulty of a machine learning task.
- VC Dimension provides a way to quantify the computational capacity of a machine learning algorithm.

**Kick-start your project** with my new book Probability for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into three parts; they are:

- Computational Learning Theory
- PAC Learning (Theory of Learning Problems)
- VC Dimension (Theory of Learning Algorithms)

Computational learning theory, or *CoLT* for short, is a field of study concerned with the use of formal mathematical methods applied to learning systems.

It seeks to use the tools of theoretical computer science to quantify learning problems. This includes characterizing the difficulty of learning specific tasks.

Computational learning theory may be thought of as an extension or sibling of statistical learning theory, or *SLT* for short, that uses formal methods to quantify learning algorithms.

**Computational Learning Theory**(*CoLT*): Formal study of learning tasks.**Statistical Learning Theory**(*SLT*): Formal study of learning algorithms.

This division of learning tasks vs. learning algorithms is arbitrary, and in practice, there is a lot of overlap between the two fields.

One can extend statistical learning theory by taking computational complexity of the learner into account. This field is called computational learning theory or COLT.

— Page 210, Machine Learning: A Probabilistic Perspective, 2012.

They might be considered synonyms in modern usage.

… a theoretical framework known as computational learning theory, also some- times called statistical learning theory.

— Page 344, Pattern Recognition and Machine Learning, 2006.

The focus in computational learning theory is typically on supervised learning tasks. Formal analysis of real problems and real algorithms is very challenging. As such, it is common to reduce the complexity of the analysis by focusing on binary classification tasks and even simple binary rule-based systems. As such, the practical application of the theorems may be limited or challenging to interpret for real problems and algorithms.

The main unanswered question in learning is this: How can we be sure that our learning algorithm has produced a hypothesis that will predict the correct value for previously unseen inputs?

— Page 713, Artificial Intelligence: A Modern Approach, 3rd edition, 2009.

Questions explored in computational learning theory might include:

- How do we know a model has a good approximation for the target function?
- What hypothesis space should be used?
- How do we know if we have a local or globally good solution?
- How do we avoid overfitting?
- How many data examples are needed?

As a machine learning practitioner, it can be useful to know about computational learning theory and some of the main areas of investigation. The field provides a useful grounding for what we are trying to achieve when fitting models on data, and it may provide insight into the methods.

There are many subfields of study, although perhaps two of the most widely discussed areas of study from computational learning theory are:

- PAC Learning.
- VC Dimension.

Tersely, we can say that PAC Learning is the theory of machine learning problems and VC dimension is the theory of machine learning algorithms.

You may encounter the topics as a practitioner and it is useful to have a thumbnail idea of what they are about. Let’s take a closer look at each.

If you would like to dive deeper into the field of computational learning theory, I recommend the book:

**Probably approximately correct** learning, or PAC learning, refers to a theoretical machine learning framework developed by Leslie Valiant.

PAC learning seeks to quantify the difficulty of a learning task and might be considered the premier sub-field of computational learning theory.

Consider that in supervised learning, we are trying to approximate an unknown underlying mapping function from inputs to outputs. We don’t know what this mapping function looks like, but we suspect it exists, and we have examples of data produced by the function.

PAC learning is concerned with how much computational effort is required to find a hypothesis (fit model) that is a close match for the unknown target function.

For more on the use of “*hypothesis*” in machine learning to refer to a fit model, see the tutorial:

The idea is that a bad hypothesis will be found out based on the predictions it makes on new data, e.g. based on its generalization error.

A hypothesis that gets most or a large number of predictions correct, e.g. has a small generalization error, is probably a good approximation for the target function.

The underlying principle is that any hypothesis that is seriously wrong will almost certainly be “found out” with high probability after a small number of examples, because it will make an incorrect prediction. Thus, any hypothesis that is consistent with a sufficiently large set of training examples is unlikely to be seriously wrong: that is, TELY it must be probably approximately correct.

— Page 714, Artificial Intelligence: A Modern Approach, 3rd edition, 2009.

This probabilistic language gives the theorem its name: “*probability approximately correct*.” That is, a hypothesis seeks to “*approximate*” a target function and is “*probably*” good if it has a low generalization error.

A PAC learning algorithm refers to an algorithm that returns a hypothesis that is PAC.

Using formal methods, a minimum generalization error can be specified for a supervised learning task. The theorem can then be used to estimate the expected number of samples from the problem domain that would be required to determine whether a hypothesis was PAC or not. That is, it provides a way to estimate the number of samples required to find a PAC hypothesis.

The goal of the PAC framework is to understand how large a data set needs to be in order to give good generalization. It also gives bounds for the computational cost of learning …

— Page 344, Pattern Recognition and Machine Learning, 2006.

Additionally, a hypothesis space (machine learning algorithm) is efficient under the PAC framework if an algorithm can find a PAC hypothesis (fit model) in polynomial time.

A hypothesis space is said to be efficiently PAC-learnable if there is a polynomial time algorithm that can identify a function that is PAC.

— Page 210, Machine Learning: A Probabilistic Perspective, 2012.

For more on PAC learning, refer to the seminal book on the topic titled:

- Probably Approximately Correct: Nature’s Algorithms for Learning and Prospering in a Complex World, 2013.

Vapnik–Chervonenkis theory, or VC theory for short, refers to a theoretical machine learning framework developed by Vladimir Vapnik and Alexey Chervonenkis.

VC theory learning seeks to quantify the capability of a learning algorithm and might be considered the premier sub-field of statistical learning theory.

VC theory is comprised of many elements, most notably the VC dimension.

The VC dimension quantifies the complexity of a hypothesis space, e.g. the models that could be fit given a representation and learning algorithm.

One way to consider the complexity of a hypothesis space (space of models that could be fit) is based on the number of distinct hypotheses it contains and perhaps how the space might be navigated. The VC dimension is a clever approach that instead measures the number of examples from the target problem that can be discriminated by hypotheses in the space.

The VC dimension measures the complexity of the hypothesis space […] by the number of distinct instances from X that can be completely discriminated using H.

— Page 214, Machine Learning, 1997.

The VC dimension estimates the capability or capacity of a classification machine learning algorithm for a specific dataset (number and dimensionality of examples).

Formally, the VC dimension is the largest number of examples from the training dataset that the space of hypothesis from the algorithm can “*shatter*.”

The Vapnik-Chervonenkis dimension, VC(H), of hypothesis space H defined over instance space X is the size of the largest finite subset of X shattered by H.

— Page 215, Machine Learning, 1997.

Shatter or a shattered set, in the case of a dataset, means points in the feature space can be selected or separated from each other using hypotheses in the space such that the labels of examples in the separate groups are correct (whatever they happen to be).

Whether a group of points can be shattered by an algorithm depends on the hypothesis space and the number of points.

For example, a line (hypothesis space) can be used to shatter three points, but not four points.

Any placement of three points on a 2d plane with class labels 0 or 1 can be “*correctly*” split by label with a line, e.g. shattered. But, there exists placements of four points on plane with binary class labels that cannot be correctly split by label with a line, e.g. cannot be shattered. Instead, another “algorithm” must be used, such as ovals.

The figure below makes this clear.

Therefore, the VC dimension of a machine learning algorithm is the largest number of data points in a dataset that a specific configuration of the algorithm (hyperparameters) or specific fit model can shatter.

A classifier that predicts the same value in all cases will have a VC dimension of 0, no points. A large VC dimension indicates that an algorithm is very flexible, although the flexibility may come at the cost of additional risk of overfitting.

The VC dimension is used as part of the PAC learning framework.

A key quantity in PAC learning is the Vapnik-Chervonenkis dimension, or VC dimension, which provides a measure of the complexity of a space of functions, and which allows the PAC framework to be extended to spaces containing an infinite number of functions.

— Page 344, Pattern Recognition and Machine Learning, 2006.

For more on PCA Learning, refer to the seminal book on the topic titled:

This section provides more resources on the topic if you are looking to go deeper.

- Probably Approximately Correct: Nature’s Algorithms for Learning and Prospering in a Complex World, 2013.
- Artificial Intelligence: A Modern Approach, 3rd edition, 2009.
- An Introduction to Computational Learning Theory, 1994.
- Machine Learning: A Probabilistic Perspective, 2012.
- The Nature of Statistical Learning Theory, 1999.
- Pattern Recognition and Machine Learning, 2006.
- Machine Learning, 1997.

- Statistical learning theory, Wikipedia.
- Computational learning theory, Wikipedia.
- Probably approximately correct learning, Wikipedia.
- Vapnik–Chervonenkis theory, Wikipedia.
- Vapnik–Chervonenkis dimension, Wikipedia.

In this post, you discovered a gentle introduction to computational learning theory for machine learning.

Specifically, you learned:

- Computational learning theory uses formal methods to study learning tasks and learning algorithms.
- PAC learning provides a way to quantify the computational difficulty of a machine learning task.
- VC Dimension provides a way to quantify the computational capacity of a machine learning algorithm.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Computational Learning Theory appeared first on Machine Learning Mastery.

]]>The post Develop an Intuition for Bayes Theorem With Worked Examples appeared first on Machine Learning Mastery.

]]>It is a deceptively simple calculation, providing a method that is easy to use for scenarios where our intuition often fails.

The best way to develop an intuition for Bayes Theorem is to think about the meaning of the terms in the equation and to apply the calculation many times in a range of different real-world scenarios. This will provide the context for what is being calculated and examples that can be used as a starting point when applying the calculation in new scenarios in the future.

In this tutorial, you will discover an intuition for calculating Bayes Theorem by working through multiple realistic scenarios.

After completing this tutorial, you will know:

- Bayes Theorem is a technique for calculating a conditional probability.
- The common and helpful names used for the terms in the Bayes Theorem equation.
- How to work through three realistic scenarios using Bayes Theorem to find a solution.

**Kick-start your project** with my new book Probability for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into five parts; they are:

- Introduction to Bayes Theorem
- Naming the Terms in the Theorem
- Example 1: Elderly Fall and Death
- Example 2: Email and Spam Detection
- Example 3: Liars and Lie Detectors

Conditional probability is the probability of one event given the occurrence of another event, often described in terms of events *A* and *B* from two dependent random variables e.g. *X* and *Y*.

**Conditional Probability**: Probability of one (or more) event given the occurrence of another event, e.g. P(A given B) or P(A | B).

The conditional probability can be calculated using the joint probability; for example:

- P(A | B) = P(A and B) / P(B)

The conditional probability is not symmetrical; for example:

- P(A | B) != P(B | A)

Nevertheless, one conditional probability can be calculated using the other conditional probability.

This is called Bayes Theorem, named for Reverend Thomas Bayes, and can be stated as follows:

- P(A|B) = P(B|A) * P(A) / P(B)

Bayes Theorem provides a principled way for calculating a conditional probability and an alternative to using the joint probability.

This alternate approach to calculating the conditional probability is useful either when the joint probability is challenging to calculate, or when the reverse conditional probability is available or easy to calculate.

**Bayes Theorem**: Principled way of calculating a conditional probability without the joint probability.

It is often the case that we do not have access to the denominator directly, e.g. P(B).

We can calculate it an alternative way; for example:

- P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)

This gives a formulation of Bayes Theorem that we can use that uses the alternate calculation of P(B), described below:

- P(A|B) = P(B|A) * P(A) / P(B|A) * P(A) + P(B|not A) * P(not A)

**Note**: the denominator is simply the expansion we gave above.

As such, if we have P(A), then we can calculate P(not A) as its complement; for example:

- P(not A) = 1 – P(A)

Additionally, if we have P(not B|not A), then we can calculate P(B|not A) as its complement; for example:

- P(B|not A) = 1 – P(not B|not A)

Now that we are familiar with the calculation of Bayes Theorem, let’s take a closer look at the meaning of the terms in the equation.

The terms in the Bayes Theorem equation are given names depending on the context where the equation is used.

It can be helpful to think about the calculation from these different perspectives and help to map your problem onto the equation.

Firstly, in general, the result P(A|B) is referred to as the **posterior probability** and P(A) is referred to as the **prior probability**.

- P(A|B): Posterior probability.
- P(A): Prior probability.

Sometimes P(B|A) is referred to as the **likelihood** and P(B) is referred to as the **evidence**.

- P(B|A): Likelihood.
- P(B): Evidence.

This allows Bayes Theorem to be restated as:

- Posterior = Likelihood * Prior / Evidence

We can make this clear with a smoke and fire case.

What is the probability that there is fire given that there is smoke?

Where P(Fire) is the Prior, P(Smoke|Fire) is the Likelihood, and P(Smoke) is the evidence:

- P(Fire|Smoke) = P(Smoke|Fire) * P(Fire) / P(Smoke)

You can imagine the same situation with rain and clouds.

We can also think about the calculation in the terms of a binary classifier.

For example, P(B|A) may be referred to as the True Positive Rate (TPR) or the **sensitivity**, P(B|not A) may be referred to as the False Positive Rate (FPR), the complement P(not B|not A) may be referred to as the True Negative Rate (TNR) or **specificity**, and the value we are calculating P(A|B) may be referred to as the Positive Predictive Value (PPV) or **precision**.

- P(not B|not A): True Negative Rate or TNR (
**specificity**). - P(B|not A): False Positive Rate or FPR.
- P(not B|A): False Negative Rate or FNR.
- P(B|A): True Positive Rate or TPR (
**sensitivity**or recall). - P(A|B): Positive Predictive Value or PPV (
**precision**).

For example, we may re-state the calculation using these terms as follows:

- PPV = (TPR * P(A)) / (TPR * P(A) + FPR * P(not A))

This is a useful perspective on Bayes Theorem and is elaborated further in the tutorial:

Now that we are familiar with Bayes Theorem and the meaning of the terms, let’s look at some scenarios where we can calculate it.

Note that all of the following examples are contrived; they are not based on real-world probabilities.

Consider the case where an elderly person (over 80 years of age) falls; what is the probability that they will die from the fall?

Let’s assume that the base rate of someone elderly dying P(A) is 10%, and the base rate for elderly people falling P(B) is 5%, and from all elderly people, 7% of those that die had a fall P(B|A).

Let’s plug what we know into the theorem:

- P(A|B) = P(B|A) * P(A) / P(B)
- P(Die|Fall) = P(Fall|Die) * P(Die) / P(Fall)

or

- P(Die|Fall) = 0.07 * 0.10 / 0.05
- P(Die|Fall) = 0.14

That is, if an elderly person falls, then there is a 14 percent probability that they will die from the fall.

To make this concrete, we can perform the calculation in Python, first defining what we know, then using Bayes Theorem to calculate the outcome.

The complete example is listed below.

# calculate P(A|B) given P(B|A), P(A) and P(B) def bayes_theorem(p_a, p_b, p_b_given_a): # calculate P(A|B) = P(B|A) * P(A) / P(B) p_a_given_b = (p_b_given_a * p_a) / p_b return p_a_given_b # P(A) p_a = 0.10 # P(B) p_b = 0.05 # P(B|A) p_b_given_a = 0.07 # calculate P(A|B) result = bayes_theorem(p_a, p_b, p_b_given_a) # summarize print('P(A|B) = %.3f%%' % (result * 100))

Running the example confirms the value we calculated manually.

P(A|B) = 14%

Consider the case where we receive an email and the spam detector puts it in the spam folder; what is the probability it was spam?

Let’s assume some details such as 2 percent of the email we receive is spam P(A). Let’s assume that the spam detector is really good and when an email is spam, it detects it P(B|A) with an accuracy of 99 percent, and when an email is not spam, it will mark it as spam with a very low rate of 0.1 percent P(B|not A).

Let’s plug what we know into the theorem:

- P(A|B) = P(B|A) * P(A) / P(B)
- P(Spam|Detected) = P(Detected|Spam) * P(Spam) / P(Detected)

or

- P(Spam|Detected) = 0.99 * 0.02 / P(Detected)

We don’t know P(B), that is P(Detected), but we can calculate it using:

- P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)

Or in terms of our problem:

- P(Detected) = P(Detected|Spam) * P(Spam) + P(Detected|not Spam) * P(not Spam)

We know P(Detected|not Spam), which is 0.1 percent and we can calculate P(not Spam) as 1 – P(Spam); for example:

- P(not Spam) = 1 – P(Spam)
- P(not Spam) = 1 – 0.02
- P(not Spam) = 0.98

Therefore, we can calculate P(Detected) as:

- P(Detected) = 0.99 * 0.02 + 0.001 * 0.98
- P(Detected) = 0.0198 + 0.00098
- P(Detected) = 0.02078

That is, about 2 percent of all emails are detected as spam, regardless of whether they are spam or not.

Now we can calculate the answer as:

- P(Spam|Detected) = 0.99 * 0.02 / 0.02078
- P(Spam|Detected) = 0.0198 / 0.02078
- P(Spam|Detected) = 0.95283926852743

That is, if an email is in the spam folder, there is a 95.2 percent probability that it is, in fact, spam.

Again, let’s confirm this result by calculating it with an example in Python.

The complete example is listed below.

# calculate the probability of an email in the spam folder being spam # calculate P(A|B) given P(A), P(B|A), P(B|not A) def bayes_theorem(p_a, p_b_given_a, p_b_given_not_a): # calculate P(not A) not_a = 1 - p_a # calculate P(B) p_b = p_b_given_a * p_a + p_b_given_not_a * not_a # calculate P(A|B) p_a_given_b = (p_b_given_a * p_a) / p_b return p_a_given_b # P(A) p_a = 0.02 # P(B|A) p_b_given_a = 0.99 # P(B|not A) p_b_given_not_a = 0.001 # calculate P(A|B) result = bayes_theorem(p_a, p_b_given_a, p_b_given_not_a) # summarize print('P(A|B) = %.3f%%' % (result * 100))

Running the example gives the same result, confirming our manual calculation.

P(A|B) = 95.284%

Consider the case where a person is tested with a lie detector and the test suggests they are lying. What is the probability that the person is indeed lying?

Let’s assume some details, such as most people that are tested are telling the truth, such as 98 percent, meaning (1 – 0.98) or 2 percent are liars P(A). Let’s also assume that when someone is lying, that the test can detect them well, but not great, such as 72 percent of the time P(B|A). Let’s also assume that when the machine says they are not lying, this is true 97 percent of the time P(not B | not A).

Let’s plug what we know into the theorem:

- P(A|B) = P(B|A) * P(A) / P(B)
- P(Lying|Positive) = P(Positive|Lying) * P(Lying) / P(Positive)

Or:

- P(Lying|Positive) = 0.72 * 0.02 / P(Positive)

Again, we don’t know P(B), or in this case how often the detector returns a positive result in general.

We can calculate this using the formula:

- P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)

Or:

- P(Positive) = P(Positive|Lying) * P(Lying) + P(Positive|not Lying) * P(not Lying)

Or, with numbers:

- P(Positive) = 0.72 * 0.02 + P(Positive|not Lying) * (1 – 0.02)
- P(Positive) = 0.72 * 0.02 + P(Positive|not Lying) * 0.98

In this case, we don’t know the probability of a positive detection result given that the person was not lying; that is we don’t know the false positive rate or the false alarm rate.

This can be calculated as follows:

- P(B|not A) = 1 – P(not B|not A)

Or:

- P(Positive|not Lying) = 1 – P(not Positive|not Lying)
- P(Positive|not Lying) = 1 – 0.97
- P(Positive|not Lying) = 0.03

Therefore, we can calculate P(B) or P(Positive) as:

- P(Positive) = 0.72 * 0.02 + 0.03 * 0.98
- P(Positive) = 0.0144 + 0.0294
- P(Positive) = 0.0438

That is, the test returns a positive result about 4 percent of the time, regardless of whether the person is lying or not.

We can now calculate Bayes Theorem for this scenario:

- P(Lying|Positive) = 0.72 * 0.02 / 0.0438
- P(Lying|Positive) = 0.0144 / 0.0438
- P(Lying|Positive) = 0.328767123287671

That is, if the lie detector test comes back with a positive result, then there is a 32.8 percent probability that they are, in fact, lying. It’s a poor test!

Finally, let’s confirm this calculation in Python.

The complete example is listed below.

# calculate the probability of a person lying given a positive lie detector result # calculate P(A|B) given P(A), P(B|A), P(not B|not A) def bayes_theorem(p_a, p_b_given_a, p_not_b_given_not_a): # calculate P(not A) not_a = 1 - p_a # calculate P(B|not A) p_b_given_not_a = 1 - p_not_b_given_not_a # calculate P(B) p_b = p_b_given_a * p_a + p_b_given_not_a * not_a # calculate P(A|B) p_a_given_b = (p_b_given_a * p_a) / p_b return p_a_given_b # P(A), base rate p_a = 0.02 # P(B|A) p_b_given_a = 0.72 # P(not B| not A) p_not_b_given_not_a = 0.97 # calculate P(A|B) result = bayes_theorem(p_a, p_b_given_a, p_not_b_given_not_a) # summarize print('P(A|B) = %.3f%%' % (result * 100))

Running the example gives the same result, confirming our manual calculation.

P(A|B) = 32.877%

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered an intuition for calculating Bayes Theorem by working through multiple realistic scenarios.

Specifically, you learned:

- Bayes Theorem is a technique for calculating a conditional probability.
- The common and helpful names used for the terms in the Bayes Theorem equation.
- How to work through three realistic scenarios using Bayes Theorem to find a solution.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Develop an Intuition for Bayes Theorem With Worked Examples appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to the Bayes Optimal Classifier appeared first on Machine Learning Mastery.

]]>It is described using the Bayes Theorem that provides a principled way for calculating a conditional probability. It is also closely related to the Maximum a Posteriori: a probabilistic framework referred to as MAP that finds the most probable hypothesis for a training dataset.

In practice, the Bayes Optimal Classifier is computationally expensive, if not intractable to calculate, and instead, simplifications such as the Gibbs algorithm and Naive Bayes can be used to approximate the outcome.

In this post, you will discover Bayes Optimal Classifier for making the most accurate predictions for new instances of data.

After reading this post, you will know:

- Bayes Theorem provides a principled way for calculating conditional probabilities, called a posterior probability.
- Maximum a Posteriori is a probabilistic framework that finds the most probable hypothesis that describes the training dataset.
- Bayes Optimal Classifier is a probabilistic model that finds the most probable prediction using the training data and space of hypotheses to make a prediction for a new data instance.

**Kick-start your project** with my new book Probability for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into three parts; they are:

- Bayes Theorem
- Maximum a Posteriori (MAP)
- Bayes Optimal Classifier

Recall that the Bayes theorem provides a principled way of calculating a conditional probability.

It involves calculating the conditional probability of one outcome given another outcome, using the inverse of this relationship, stated as follows:

- P(A | B) = (P(B | A) * P(A)) / P(B)

The quantity that we are calculating is typically referred to as the posterior probability of *A* given *B* and *P(A)* is referred to as the prior probability of *A*.

The normalizing constant of *P(B)* can be removed, and the posterior can be shown to be proportional to the probability of B given A multiplied by the prior.

- P(A | B) is proportional to P(B | A) * P(A)

Or, simply:

- P(A | B) = P(B | A) * P(A)

This is a helpful simplification as we are not interested in estimating a probability, but instead in optimizing a quantity. A proportional quantity is good enough for this purpose.

For more on the topic of Bayes Theorem, see the post:

Now that we are up to speed on Bayes Theorem, let’s also take a look at the Maximum a Posteriori framework.

Machine learning involves finding a model (hypothesis) that best explains the training data.

There are two probabilistic frameworks that underlie many different machine learning algorithms.

They are:

- Maximum a Posteriori (MAP), a Bayesian method.
- Maximum Likelihood Estimation (MLE), a frequentist method.

The objective of both of these frameworks in the context of machine learning is to locate the hypothesis that is most probable given the training dataset.

Specifically, they answer the question:

What is the most probable hypothesis given the training data?

Both approaches frame the problem of fitting a model as optimization and involve searching for a distribution and set of parameters for the distribution that best describes the observed data.

MLE is a frequentist approach, and MAP provides a Bayesian alternative.

A popular replacement for maximizing the likelihood is maximizing the Bayesian posterior probability density of the parameters instead.

— Page 306, Information Theory, Inference and Learning Algorithms, 2003.

Given the simplification of Bayes Theorem to a proportional quantity, we can use it to estimate the proportional hypothesis and parameters (*theta*) that explain our dataset (*X*), stated as:

- P(theta | X) = P(X | theta) * P(theta)

Maximizing this quantity over a range of theta solves an optimization problem for estimating the central tendency of the posterior probability (e.g. the model of the distribution).

As such, this technique is referred to as “*maximum a posteriori estimation*,” or MAP estimation for short, and sometimes simply “*maximum posterior estimation*.”

- maximize P(X | theta) * P(theta)

For more on the topic of Maximum a Posteriori, see the post:

Now that we are familiar with the MAP framework, we can take a closer look at the related concept of the Bayes optimal classifier.

The Bayes optimal classifier is a probabilistic model that makes the most probable prediction for a new example, given the training dataset.

This model is also referred to as the Bayes optimal learner, the Bayes classifier, Bayes optimal decision boundary, or the Bayes optimal discriminant function.

**Bayes Classifier**: Probabilistic model that makes the most probable prediction for new examples.

Specifically, the Bayes optimal classifier answers the question:

What is the most probable classification of the new instance given the training data?

This is different from the MAP framework that seeks the most probable hypothesis (model). Instead, we are interested in making a specific prediction.

In general, the most probable classification of the new instance is obtained by combining the predictions of all hypotheses, weighted by their posterior probabilities.

— Page 175, Machine Learning, 1997.

The equation below demonstrates how to calculate the conditional probability for a new instance (*vi*) given the training data (*D*), given a space of hypotheses (*H*).

- P(vj | D) = sum {h in H} P(vj | hi) * P(hi | D)

Where *vj* is a new instance to be classified, *H* is the set of hypotheses for classifying the instance, *hi* is a given hypothesis, *P(vj | hi)* is the posterior probability for *vi* given hypothesis *hi*, and *P(hi | D)* is the posterior probability of the hypothesis *hi* given the data *D*.

Selecting the outcome with the maximum probability is an example of a Bayes optimal classification.

- max sum {h in H} P(vj | hi) * P(hi | D)

Any model that classifies examples using this equation is a Bayes optimal classifier and no other model can outperform this technique, on average.

Any system that classifies new instances according to [the equation] is called a Bayes optimal classifier, or Bayes optimal learner. No other classification method using the same hypothesis space and same prior knowledge can outperform this method on average.

— Page 175, Machine Learning, 1997.

We have to let that sink in.

It is a big deal.

It means that any other algorithm that operates on the same data, the same set of hypotheses, and same prior probabilities cannot outperform this approach, on average. Hence the name “*optimal classifier*.”

Although the classifier makes optimal predictions, it is not perfect given the uncertainty in the training data and incomplete coverage of the problem domain and hypothesis space. As such, the model will make errors. These errors are often referred to as Bayes errors.

The Bayes classifier produces the lowest possible test error rate, called the Bayes error rate. […] The Bayes error rate is analogous to the irreducible error …

— Page 38, An Introduction to Statistical Learning with Applications in R, 2017.

Because the Bayes classifier is optimal, the Bayes error is the minimum possible error that can be made.

**Bayes Error**: The minimum possible error that can be made when making predictions.

Further, the model is often described in terms of classification, e.g. the Bayes Classifier. Nevertheless, the principle applies just as well to regression: that is, predictive modeling problems where a numerical value is predicted instead of a class label.

It is a theoretical model, but it is held up as an ideal that we may wish to pursue.

In theory we would always like to predict qualitative responses using the Bayes classifier. But for real data, we do not know the conditional distribution of Y given X, and so computing the Bayes classifier is impossible. Therefore, the Bayes classifier serves as an unattainable gold standard against which to compare other methods.

— Page 39, An Introduction to Statistical Learning with Applications in R, 2017.

Because of the computational cost of this optimal strategy, we instead can work with direct simplifications of the approach.

Two of the most commonly used simplifications use a sampling algorithm for hypotheses, such as Gibbs sampling, or to use the simplifying assumptions of the Naive Bayes classifier.

**Gibbs Algorithm**. Randomly sample hypotheses biased on their posterior probability.**Naive Bayes**. Assume that variables in the input data are conditionally independent.

For more on the topic of Naive Bayes, see the post:

Nevertheless, many nonlinear machine learning algorithms are able to make predictions are that are close approximations of the Bayes classifier in practice.

Despite the fact that it is a very simple approach, KNN can often produce classifiers that are surprisingly close to the optimal Bayes classifier.

— Page 39, An Introduction to Statistical Learning with Applications in R, 2017.

This section provides more resources on the topic if you are looking to go deeper.

- A Gentle Introduction to Maximum a Posteriori (MAP) for Machine Learning
- A Gentle Introduction to Bayes Theorem for Machine Learning
- How to Develop a Naive Bayes Classifier from Scratch in Python

- Section 6.7 Bayes Optimal Classifier, Machine Learning, 1997.
- Section 2.4.2 Bayes error and noise, Foundations of Machine Learning, 2nd edition, 2018.
- Section 2.2.3 The Classification Setting, An Introduction to Statistical Learning with Applications in R, 2017.
- Information Theory, Inference and Learning Algorithms, 2003.

- The Multilayer Perceptron As An Approximation To A Bayes Optimal Discriminant Function, 1990.
- Bayes Optimal Multilabel Classification via Probabilistic Classifier Chains, 2010.
- Restricted bayes optimal classifiers, 2000.
- Bayes Classifier And Bayes Error, 2013.

In this post, you discovered the Bayes Optimal Classifier for making the most accurate predictions for new instances of data.

Specifically, you learned:

- Bayes Theorem provides a principled way for calculating conditional probabilities, called a posterior probability.
- Maximum a Posteriori is a probabilistic framework that finds the most probable hypothesis that describes the training dataset.
- Bayes Optimal Classifier is a probabilistic framework that finds the most probable prediction using the training data and space of hypotheses to make a prediction for a new data instance.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to the Bayes Optimal Classifier appeared first on Machine Learning Mastery.

]]>The post How to Use an Empirical Distribution Function in Python appeared first on Machine Learning Mastery.

]]>As such, it is sometimes called the **empirical cumulative distribution function**, or ECDF for short.

In this tutorial, you will discover the empirical probability distribution function.

After completing this tutorial, you will know:

- Some data samples cannot be summarized using a standard distribution.
- An empirical distribution function provides a way of modeling cumulative probabilities for a data sample.
- How to use the statsmodels library to model and sample an empirical cumulative distribution function.

**Kick-start your project** with my new book Probability for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into three parts; they are:

- Empirical Distribution Function
- Bimodal Data Distribution
- Sampling Empirical Distribution

Typically, the distribution of observations for a data sample fits a well-known probability distribution.

For example, the heights of humans will fit the normal (Gaussian) probability distribution.

This is not always the case. Sometimes the observations in a collected data sample do not fit any known probability distribution and cannot be easily forced into an existing distribution by data transforms or parameterization of the distribution function.

Instead, an empirical probability distribution must be used.

There are two main types of probability distribution functions we may need to sample; they are:

- Probability Density Function (PDF).
- Cumulative Distribution Function (CDF).

The PDF returns the expected probability for observing a value. For discrete data, the PDF is referred to as a Probability Mass Function (PMF). The CDF returns the expected probability for observing a value less than or equal to a given value.

An empirical probability density function can be fit and used for a data sampling using a nonparametric density estimation method, such as Kernel Density Estimation (KDE).

An empirical cumulative distribution function is called the Empirical Distribution Function, or EDF for short. It is also referred to as the Empirical Cumulative Distribution Function, or ECDF.

The EDF is calculated by ordering all of the unique observations in the data sample and calculating the cumulative probability for each as the number of observations less than or equal to a given observation divided by the total number of observations.

As follows:

- EDF(x) = number of observations <= x / n

Like other cumulative distribution functions, the sum of probabilities will proceed from 0.0 to 1.0 as the observations in the domain are enumerated from smallest to largest.

To make the empirical distribution function concrete, let’s look at an example with a dataset that clearly does not fit a known probability distribution.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

We can define a dataset that clearly does not match a standard probability distribution function.

A common example is when the data has two peaks (bimodal distribution) or many peaks (multimodal distribution).

We can construct a bimodal distribution by combining samples from two different normal distributions. Specifically, 300 examples with a mean of 20 and a standard deviation of five (the smaller peak), and 700 examples with a mean of 40 and a standard deviation of five (the larger peak).

The means were chosen close together to ensure the distributions overlap in the combined sample.

The complete example of creating this sample with a bimodal probability distribution and plotting the histogram is listed below.

# example of a bimodal data sample from matplotlib import pyplot from numpy.random import normal from numpy import hstack # generate a sample sample1 = normal(loc=20, scale=5, size=300) sample2 = normal(loc=40, scale=5, size=700) sample = hstack((sample1, sample2)) # plot the histogram pyplot.hist(sample, bins=50) pyplot.show()

Running the example creates the data sample and plots the histogram.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We have fewer samples with a mean of 20 than samples with a mean of 40, which we can see reflected in the histogram with a larger density of samples around 40 than around 20.

Data with this distribution does not nicely fit into a common probability distribution by design.

Below is a plot of the probability density function (PDF) of this data sample.

It is a good case for using an empirical distribution function.

An empirical distribution function can be fit for a data sample in Python.

The statmodels Python library provides the ECDF class for fitting an empirical cumulative distribution function and calculating the cumulative probabilities for specific observations from the domain.

The distribution is fit by calling ECDF() and passing in the raw data sample.

... # fit a cdf ecdf = ECDF(sample)

Once fit, the function can be called to calculate the cumulative probability for a given observation.

... # get cumulative probability for values print('P(x<20): %.3f' % ecdf(20)) print('P(x<40): %.3f' % ecdf(40)) print('P(x<60): %.3f' % ecdf(60))

The class also provides an ordered list of unique observations in the data (the *.x* attribute) and their associated probabilities (*.y* attribute). We can access these attributes and plot the CDF function directly.

... # plot the cdf pyplot.plot(ecdf.x, ecdf.y) pyplot.show()

Tying this together, the complete example of fitting an empirical distribution function for the bimodal data sample is below.

# fit an empirical cdf to a bimodal dataset from matplotlib import pyplot from numpy.random import normal from numpy import hstack from statsmodels.distributions.empirical_distribution import ECDF # generate a sample sample1 = normal(loc=20, scale=5, size=300) sample2 = normal(loc=40, scale=5, size=700) sample = hstack((sample1, sample2)) # fit a cdf ecdf = ECDF(sample) # get cumulative probability for values print('P(x<20): %.3f' % ecdf(20)) print('P(x<40): %.3f' % ecdf(40)) print('P(x<60): %.3f' % ecdf(60)) # plot the cdf pyplot.plot(ecdf.x, ecdf.y) pyplot.show()

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example fits the empirical CDF to the data sample, then prints the cumulative probability for observing three values.

P(x<20): 0.149 P(x<40): 0.654 P(x<60): 1.000

Then the cumulative probability for the entire domain is calculated and shown as a line plot.

Here, we can see the familiar S-shaped curve seen for most cumulative distribution functions, here with bumps around the mean of both peaks of the bimodal distribution.

This section provides more resources on the topic if you are looking to go deeper.

- Section 2.3.4 The empirical distribution, Machine Learning: A Probabilistic Perspective, 2012.
- Section 3.9.5 The Dirac Distribution and Empirical Distribution, Deep Learning, 2016.

- Empirical distribution function, Wikipedia.
- Cumulative distribution function, Wikipedia.
- Probability Density Function, Wikipedia.
- Kernel density estimation, Wikipedia.

In this tutorial, you discovered the empirical probability distribution function.

Specifically, you learned:

- Some data samples cannot be summarized using a standard distribution.
- An empirical distribution function provides a way of modeling cumulative probabilities for a data sample.
- How to use the statsmodels library to model and sample an empirical cumulative distribution function.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Use an Empirical Distribution Function in Python appeared first on Machine Learning Mastery.

]]>The post What Does Stochastic Mean in Machine Learning? appeared first on Machine Learning Mastery.

]]>Stochastic refers to a variable process where the outcome involves some randomness and has some uncertainty. It is a mathematical term and is closely related to “*randomness*” and “*probabilistic*” and can be contrasted to the idea of “*deterministic*.”

The stochastic nature of machine learning algorithms is an important foundational concept in machine learning and is required to be understand in order to effectively interpret the behavior of many predictive models.

In this post, you will discover a gentle introduction to stochasticity in machine learning.

After reading this post, you will know:

- A variable or process is stochastic if there is uncertainty or randomness involved in the outcomes.
- Stochastic is a synonym for random and probabilistic, although is different from non-deterministic.
- Many machine learning algorithms are stochastic because they explicitly use randomness during optimization or learning.

**Kick-start your project** with my new book Probability for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into three parts; they are:

- What Does “
*Stochastic*” Mean? - Stochastic vs. Random, Probabilistic, and Nondeterministic
- Stochastic in Machine Learning

A variable is stochastic if the occurrence of events or outcomes involves randomness or uncertainty.

… “stochastic” means that the model has some kind of randomness in it

— Page 66, Think Bayes.

A process is stochastic if it governs one or more stochastic variables.

Games are stochastic because they include an element of randomness, such as shuffling or rolling of a dice in card games and board games.

In real life, many unpredictable external events can put us into unforeseen situations. Many games mirror this unpredictability by including a random element, such as the throwing of dice. We call these stochastic games.

— Page 177, Artificial Intelligence: A Modern Approach, 3rd edition, 2009.

Stochastic is commonly used to describe mathematical processes that use or harness randomness. Common examples include Brownian motion, Markov Processes, Monte Carlo Sampling, and more.

Now that we have some definitions, let’s try and add some more context by comparing stochastic with other notions of uncertainty.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In this section, we’ll try to better understand the idea of a variable or process being stochastic by comparing it to the related terms of “*random*,” “*probabilistic*,” and “*non-deterministic*.”

In statistics and probability, a variable is called a “random variable” and can take on one or more outcomes or events.

It is the common name used for a thing that can be measured.

In general, stochastic is a synonym for random.

For example, a stochastic variable is a random variable. A stochastic process is a random process.

Typically, random is used to refer to a lack of dependence between observations in a sequence. For example, the rolls of a fair die are random, so are the flips of a fair coin.

Strictly speaking, a random variable or a random sequence can still be summarized using a probability distribution; it just may be a uniform distribution.

We may choose to describe something as stochastic over random if we are interested in focusing on the probabilistic nature of the variable, such as a partial dependence of the next event on the current event. We may choose random over stochastic if we wish to focus attention on the independence of the events.

In general, stochastic is a synonym for probabilistic.

For example, a stochastic variable or process is probabilistic. It can be summarized and analyzed using the tools of probability.

Most notably, the distribution of events or the next event in a sequence can be described in terms of a probability distribution.

We may choose to describe a variable or process as probabilistic over stochastic if we wish to emphasize the dependence, such as if we are using a parametric model or known probability distribution to summarize the variable or sequence.

A variable or process is deterministic if the next event in the sequence can be determined exactly from the current event.

For example, a deterministic algorithm will always give the same outcome given the same input. Conversely, a non-deterministic algorithm may give different outcomes for the same input.

A stochastic variable or process is not deterministic because there is uncertainty associated with the outcome.

Nevertheless, a stochastic variable or process is also not non-deterministic because non-determinism only describes the possibility of outcomes, rather than probability.

Describing something as stochastic is a stronger claim than describing it as non-deterministic because we can use the tools of probability in analysis, such as expected outcome and variance.

… “stochastic” generally implies that uncertainty about outcomes is quantified in terms of probabilities; a nondeterministic environment is one in which actions are characterized by their possible outcomes, but no probabilities are attached to them.

— Page 43, Artificial Intelligence: A Modern Approach, 3rd edition, 2009.

Many machine learning algorithms and models are described in terms of being stochastic.

This is because many optimization and learning algorithms both must operate in stochastic domains and because some algorithms make use of randomness or probabilistic decisions.

Let’s take a closer look at the source of uncertainty and the nature of stochastic algorithms in machine learning.

Stochastic domains are those that involve uncertainty.

… machine learning must always deal with uncertain quantities, and sometimes may also need to deal with stochastic (non-deterministic) quantities. Uncertainty and stochasticity can arise from many sources.

— Page 54, Deep Learning, 2016.

This uncertainty can come from a target or objective function that is subjected to statistical noise or random errors.

It can also come from the fact that the data used to fit a model is an incomplete sample from a broader population.

Finally, the models chosen are rarely able to capture all of the aspects of the domain, and instead must generalize to unseen circumstances and lose some fidelity.

Stochastic optimization refers to a field of optimization algorithms that explicitly use randomness to find the optima of an objective function, or optimize an objective function that itself has randomness (statistical noise).

Most commonly, stochastic optimization algorithms seek a balance between exploring the search space and exploiting what has already been learned about the search space in order to hone in on the optima. The choice of the next locations in the search space are chosen stochastically, that is probabilistically based on what areas have been searched recently.

Stochastic hill climbing chooses at random from among the uphill moves; the probability of selection can vary with the steepness of the uphill move.

— Page 124, Artificial Intelligence: A Modern Approach, 3rd edition, 2009.

Popular examples of stochastic optimization algorithms are:

- Simulated Annealing
- Genetic Algorithm
- Particle Swarm Optimization

Particle swarm optimization (PSO) is a stochastic optimization approach, modeled on the social behavior of bird flocks.

— Page 9, Computational Intelligence: An Introduction.

Most machine learning algorithms are stochastic because they make use of randomness during learning.

Using randomness is a feature, not a bug. It allows the algorithms to avoid getting stuck and achieve results that deterministic (non-stochastic) algorithms cannot achieve.

For example, some machine learning algorithms even include “*stochastic*” in their name such as:

- Stochastic Gradient Descent (optimization algorithm).
- Stochastic Gradient Boosting (ensemble algorithm).

Stochastic gradient descent optimizes the parameters of a model, such as an artificial neural network, that involves randomly shuffling the training dataset before each iteration that causes different orders of updates to the model parameters. In addition, model weights in a neural network are often initialized to a random starting point.

Most deep learning algorithms are based on an optimization algorithm called stochastic gradient descent.

— Page 98, Deep Learning, 2016.

Stochastic gradient boosting is an ensemble of decision trees algorithms. The stochastic aspect refers to the random subset of rows chosen from the training dataset used to construct trees, specifically the split points of trees.

Because many machine learning algorithms make use of randomness, their nature (e.g. behavior and performance) is also stochastic.

The stochastic nature of machine learning algorithms is most commonly seen on complex and nonlinear methods used for classification and regression predictive modeling problems.

These algorithms make use of randomness during the process of constructing a model from the training data which has the effect of fitting a different model each time same algorithm is run on the same data. In turn, the slightly different models have different performance when evaluated on a hold out test dataset.

This stochastic behavior of nonlinear machine learning algorithms is challenging for beginners who assume that learning algorithms will be deterministic, e.g. fit the same model when the algorithm is run on the same data.

This stochastic behavior requires that the performance of the model must be summarized using summary statistics that describe the mean or expected performance of the model, rather than the performance of the model from any single training run.

For more on this topic, see the post:

This section provides more resources on the topic if you are looking to go deeper.

- How to Generate Random Numbers in Python
- Introduction to Random Number Generators for Machine Learning in Python
- Embrace Randomness in Machine Learning
- Why Initialize a Neural Network with Random Weights?
- Stochastic Gradient Boosting with XGBoost and scikit-learn in Python

- Random variable, Wikipedia.
- Statistical randomness, Wikipedia.
- Stochastic, Wikipedia.
- Stochastic process, Wikipedia.
- Stochastic optimization.

In this post, you discovered a gentle introduction to stochasticity in machine learning.

Specifically, you learned:

- A variable or process is stochastic if there is uncertainty or randomness involved in the outcomes.
- Stochastic is a synonym for random and probabilistic, although is different from non-deterministic.
- Many machine learning algorithms are stochastic because they explicitly use randomness during optimization or learning.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post What Does Stochastic Mean in Machine Learning? appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Maximum a Posteriori (MAP) for Machine Learning appeared first on Machine Learning Mastery.

]]>Typically, estimating the entire distribution is intractable, and instead, we are happy to have the expected value of the distribution, such as the mean or mode. Maximum a Posteriori or MAP for short is a Bayesian-based approach to estimating a distribution and model parameters that best explain an observed dataset.

This flexible probabilistic framework can be used to provide a Bayesian foundation for many machine learning algorithms, including important methods such as linear regression and logistic regression for predicting numeric values and class labels respectively, and unlike maximum likelihood estimation, explicitly allows prior belief about candidate models to be incorporated systematically.

In this post, you will discover a gentle introduction to Maximum a Posteriori estimation.

After reading this post, you will know:

- Maximum a Posteriori estimation is a probabilistic framework for solving the problem of density estimation.
- MAP involves calculating a conditional probability of observing the data given a model weighted by a prior probability or belief about the model.
- MAP provides an alternate probability framework to maximum likelihood estimation for machine learning.

**Kick-start your project** with my new book Probability for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into three parts; they are:

- Density Estimation
- Maximum a Posteriori (MAP)
- MAP and Machine Learning

A common modeling problem involves how to estimate a joint probability distribution for a dataset.

For example, given a sample of observation (*X*) from a domain (*x1, x2, x3, …, xn*), where each observation is drawn independently from the domain with the same probability distribution (so-called independent and identically distributed, i.i.d., or close to it).

Density estimation involves selecting a probability distribution function and the parameters of that distribution that best explains the joint probability distribution of the observed data (*X*).

Often estimating the density is too challenging; instead, we are happy with a point estimate from the target distribution, such as the mean.

There are many techniques for solving this problem, although two common approaches are:

- Maximum a Posteriori (MAP), a Bayesian method.
- Maximum Likelihood Estimation (MLE), a frequentist method.

Both approaches frame the problem as optimization and involve searching for a distribution and set of parameters for the distribution that best describes the observed data.

In Maximum Likelihood Estimation, we wish to maximize the probability of observing the data from the joint probability distribution given a specific probability distribution and its parameters, stated formally as:

- P(X ; theta)

or

- P(x1, x2, x3, …, xn ; theta)

This resulting conditional probability is referred to as the likelihood of observing the data given the model parameters.

The objective of Maximum Likelihood Estimation is to find the set of parameters (*theta*) that maximize the likelihood function, e.g. result in the largest likelihood value.

- maximize P(X ; theta)

An alternative and closely related approach is to consider the optimization problem from the perspective of Bayesian probability.

A popular replacement for maximizing the likelihood is maximizing the Bayesian posterior probability density of the parameters instead.

— Page 306, Information Theory, Inference and Learning Algorithms, 2003.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Recall that the Bayes theorem provides a principled way of calculating a conditional probability.

It involves calculating the conditional probability of one outcome given another outcome, using the inverse of this relationship, stated as follows:

- P(A | B) = (P(B | A) * P(A)) / P(B)

The quantity that we are calculating is typically referred to as the posterior probability of *A* given *B* and *P(A)* is referred to as the prior probability of *A*.

The normalizing constant of *P(B)* can be removed, and the posterior can be shown to be proportional to the probability of *B* given *A* multiplied by the prior.

- P(A | B) is proportional to P(B | A) * P(A)

Or, simply:

- P(A | B) = P(B | A) * P(A)

This is a helpful simplification as we are not interested in estimating a probability, but instead in optimizing a quantity. A proportional quantity is good enough for this purpose.

We can now relate this calculation to our desire to estimate a distribution and parameters (*theta*) that best explains our dataset (*X*), as we described in the previous section. This can be stated as:

- P(theta | X) = P(X | theta) * P(theta)

Maximizing this quantity over a range of theta solves an optimization problem for estimating the central tendency of the posterior probability (e.g. the model of the distribution). As such, this technique is referred to as “*maximum a posteriori estimation*,” or MAP estimation for short, and sometimes simply “*maximum posterior estimation*.”

- maximize P(X | theta) * P(theta)

We are typically not calculating the full posterior probability distribution, and in fact, this may not be tractable for many problems of interest.

… finding MAP hypotheses is often much easier than Bayesian learning, because it requires solving an optimization problem instead of a large summation (or integration) problem.

— Page 804, Artificial Intelligence: A Modern Approach, 3rd edition, 2009.

Instead, we are calculating a point estimation such as a moment of the distribution, like the mode, the most common value, which is the same as the mean for the normal distribution.

One common reason for desiring a point estimate is that most operations involving the Bayesian posterior for most interesting models are intractable, and a point estimate offers a tractable approximation.

— Page 139, Deep Learning, 2016.

**Note**: this is very similar to Maximum Likelihood Estimation, with the addition of the prior probability over the distribution and parameters.

In fact, if we assume that all values of *theta* are equally likely because we don’t have any prior information (e.g. a uniform prior), then both calculations are equivalent.

Because of this equivalence, both MLE and MAP often converge to the same optimization problem for many machine learning algorithms. This is not always the case; if the calculation of the MLE and MAP optimization problem differ, the MLE and MAP solution found for an algorithm may also differ.

… the maximum likelihood hypothesis might not be the MAP hypothesis, but if one assumes uniform prior probabilities over the hypotheses then it is.

— Page 167, Machine Learning, 1997.

In machine learning, Maximum a Posteriori optimization provides a Bayesian probability framework for fitting model parameters to training data and an alternative and sibling to the perhaps more common Maximum Likelihood Estimation framework.

Maximum a posteriori (MAP) learning selects a single most likely hypothesis given the data. The hypothesis prior is still used and the method is often more tractable than full Bayesian learning.

— Page 825, Artificial Intelligence: A Modern Approach, 3rd edition, 2009.

One framework is not better than another, and as mentioned, in many cases, both frameworks frame the same optimization problem from different perspectives.

Instead, MAP is appropriate for those problems where there is some prior information, e.g. where a meaningful prior can be set to weigh the choice of different distributions and parameters or model parameters. MLE is more appropriate where there is no such prior.

Bayesian methods can be used to determine the most probable hypothesis given the data-the maximum a posteriori (MAP) hypothesis. This is the optimal hypothesis in the sense that no other hypothesis is more likely.

— Page 197, Machine Learning, 1997.

In fact, the addition of the prior to the MLE can be thought of as a type of regularization of the MLE calculation. This insight allows other regularization methods (e.g. L2 norm in models that use a weighted sum of inputs) to be interpreted under a framework of MAP Bayesian inference. For example, L2 is a bias or prior that assumes that a set of coefficients or weights have a small sum squared value.

… in particular, L2 regularization is equivalent to MAP Bayesian inference with a Gaussian prior on the weights.

— Page 236, Deep Learning, 2016.

We can make the relationship between MAP and machine learning clearer by re-framing the optimization problem as being performed over candidate modeling hypotheses (*h* in *H*) instead of the more abstract distribution and parameters (*theta*); for example:

- maximize P(X | h) * P(h)

Here, we can see that we want a model or hypothesis (*h*) that best explains the observed training dataset (*X*) and that the prior (*P(h)*) is our belief about how useful a hypothesis is expected to be, generally, regardless of the training data. The optimization problem involves estimating the posterior probability for each candidate hypothesis.

We can determine the MAP hypotheses by using Bayes theorem to calculate the posterior probability of each candidate hypothesis.

— Page 157, Machine Learning, 1997.

Like MLE, solving the optimization problem depends on the choice of model. For simpler models, like linear regression, there are analytical solutions. For more complex models like logistic regression, numerical optimization is required that makes use of first- and second-order derivatives. For the more prickly problems, stochastic optimization algorithms may be required.

This section provides more resources on the topic if you are looking to go deeper.

- Chapter 6 Bayesian Learning, Machine Learning, 1997.
- Chapter 12 Maximum Entropy Models, Foundations of Machine Learning, 2018.
- Chapter 9 Probabilistic methods, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.
- Chapter 5 Machine Learning Basics, Deep Learning, 2016.
- Chapter 13 MAP Inference, Probabilistic Graphical Models: Principles and Techniques, 2009.

In this post, you discovered a gentle introduction to Maximum a Posteriori estimation.

Specifically, you learned:

- Maximum a Posteriori estimation is a probabilistic framework for solving the problem of density estimation.
- MAP involves calculating a conditional probability of observing the data given a model weighted by a prior probability or belief about the model.
- MAP provides an alternate probability framework to maximum likelihood estimation for machine learning.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Maximum a Posteriori (MAP) for Machine Learning appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Markov Chain Monte Carlo for Probability appeared first on Machine Learning Mastery.

]]>Often, directly inferring values is not tractable with probabilistic models, and instead, approximation methods must be used.

Markov Chain Monte Carlo sampling provides a class of algorithms for systematic random sampling from high-dimensional probability distributions. Unlike Monte Carlo sampling methods that are able to draw independent samples from the distribution, Markov Chain Monte Carlo methods draw samples where the next sample is dependent on the existing sample, called a Markov Chain. This allows the algorithms to narrow in on the quantity that is being approximated from the distribution, even with a large number of random variables.

In this post, you will discover a gentle introduction to Markov Chain Monte Carlo for machine learning.

After reading this post, you will know:

- Monte Carlo sampling is not effective and may be intractable for high-dimensional probabilistic models.
- Markov Chain Monte Carlo provides an alternate approach to random sampling a high-dimensional probability distribution where the next sample is dependent upon the current sample.
- Gibbs Sampling and the more general Metropolis-Hastings algorithm are the two most common approaches to Markov Chain Monte Carlo sampling.

**Kick-start your project** with my new book Probability for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into three parts; they are:

- Challenge of Probabilistic Inference
- What Is Markov Chain Monte Carlo
- Markov Chain Monte Carlo Algorithms

Calculating a quantity from a probabilistic model is referred to more generally as probabilistic inference, or simply inference.

For example, we may be interested in calculating an expected probability, estimating the density, or other properties of the probability distribution. This is the goal of the probabilistic model, and the name of the inference performed often takes on the name of the probabilistic model, e.g. Bayesian Inference is performed with a Bayesian probabilistic model.

The direct calculation of the desired quantity from a model of interest is intractable for all but the most trivial probabilistic models. Instead, the expected probability or density must be approximated by other means.

For most probabilistic models of practical interest, exact inference is intractable, and so we have to resort to some form of approximation.

— Page 523, Pattern Recognition and Machine Learning, 2006.

The desired calculation is typically a sum of a discrete distribution of many random variables or integral of a continuous distribution of many variables and is intractable to calculate. This problem exists in both schools of probability, although is perhaps more prevalent or common with Bayesian probability and integrating over a posterior distribution for a model.

Bayesians, and sometimes also frequentists, need to integrate over possibly high-dimensional probability distributions to make inference about model parameters or to make predictions. Bayesians need to integrate over the posterior distribution of model parameters given the data, and frequentists may need to integrate over the distribution of observables given parameter values.

— Page 1, Markov Chain Monte Carlo in Practice, 1996.

The typical solution is to draw independent samples from the probability distribution, then repeat this process many times to approximate the desired quantity. This is referred to as Monte Carlo sampling or Monte Carlo integration, named for the city in Monaco that has many casinos.

The problem with Monte Carlo sampling is that it does not work well in high-dimensions. This is firstly because of the curse of dimensionality, where the volume of the sample space increases exponentially with the number of parameters (dimensions).

Secondly, and perhaps most critically, this is because Monte Carlo sampling assumes that each random sample drawn from the target distribution is independent and can be independently drawn. This is typically not the case or intractable for inference with Bayesian structured or graphical probabilistic models.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The solution to sampling probability distributions in high-dimensions is to use Markov Chain Monte Carlo, or MCMC for short.

The most popular method for sampling from high-dimensional distributions is Markov chain Monte Carlo or MCMC

— Page 837, Machine Learning: A Probabilistic Perspective, 2012.

Like Monte Carlo methods, Markov Chain Monte Carlo was first developed around the same time as the development of the first computers and was used in calculations for particle physics required as part of the Manhattan project for developing the atomic bomb.

Monte Carlo is a technique for randomly sampling a probability distribution and approximating a desired quantity.

Monte Carlo algorithms, [….] are used in many branches of science to estimate quantities that are difficult to calculate exactly.

— Page 530, Artificial Intelligence: A Modern Approach, 3rd edition, 2009.

Monte Carlo methods typically assume that we can efficiently draw samples from the target distribution. From the samples that are drawn, we can then estimate the sum or integral quantity as the mean or variance of the drawn samples.

A useful way to think about a Monte Carlo sampling process is to consider a complex two-dimensional shape, such as a spiral. We cannot easily define a function to describe the spiral, but we may be able to draw samples from the domain and determine if they are part of the spiral or not. Together, a large number of samples drawn from the domain will allow us to summarize the shape (probability density) of the spiral.

Markov chain is a systematic method for generating a sequence of random variables where the current value is probabilistically dependent on the value of the prior variable. Specifically, selecting the next variable is only dependent upon the last variable in the chain.

A Markov chain is a special type of stochastic process, which deals with characterization of sequences of random variables. Special interest is paid to the dynamic and the limiting behaviors of the sequence.

— Page 113, Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference, 2006.

Consider a board game that involves rolling dice, such as snakes and ladders (or chutes and ladders). The roll of a die has a uniform probability distribution across 6 stages (integers 1 to 6). You have a position on the board, but your next position on the board is only based on the current position and the random roll of the dice. Your specific positions on the board form a Markov chain.

Another example of a Markov chain is a random walk in one dimension, where the possible moves are 1, -1, chosen with equal probability, and the next point on the number line in the walk is only dependent upon the current position and the randomly chosen move.

At a high level, a Markov chain is defined in terms of a graph of states over which the sampling algorithm takes a random walk.

— Page 507, Probabilistic Graphical Models: Principles and Techniques, 2009.

Combining these two methods, Markov Chain and Monte Carlo, allows random sampling of high-dimensional probability distributions that honors the probabilistic dependence between samples by constructing a Markov Chain that comprise the Monte Carlo sample.

MCMC is essentially Monte Carlo integration using Markov chains. […] Monte Carlo integration draws samples from the the required distribution, and then forms sample averages to approximate expectations. Markov chain Monte Carlo draws these samples by running a cleverly constructed Markov chain for a long time.

— Page 1, Markov Chain Monte Carlo in Practice, 1996.

Specifically, MCMC is for performing inference (e.g. estimating a quantity or a density) for probability distributions where independent samples from the distribution cannot be drawn, or cannot be drawn easily.

As such, Monte Carlo sampling cannot be used.

Instead, samples are drawn from the probability distribution by constructing a Markov Chain, where the next sample that is drawn from the probability distribution is dependent upon the last sample that was drawn. The idea is that the chain will settle on (find equilibrium) on the desired quantity we are inferring.

Yet, we are still sampling from the target probability distribution with the goal of approximating a desired quantity, so it is appropriate to refer to the resulting collection of samples as a Monte Carlo sample, e.g. extent of samples drawn often forms one long Markov chain.

The idea of imposing a dependency between samples may seem odd at first, but may make more sense if we consider domains like the random walk or snakes and ladders games, where such dependency between samples is required.

There are many Markov Chain Monte Carlo algorithms that mostly define different ways of constructing the Markov Chain when performing each Monte Carlo sample.

The random walk provides a good metaphor for the construction of the Markov chain of samples, yet it is very inefficient. Consider the case where we may want to calculate the expected probability; it is more efficient to zoom in on that quantity or density, rather than wander around the domain. Markov Chain Monte Carlo algorithms are attempts at carefully harnessing properties of the problem in order to construct the chain efficiently.

This sequence is constructed so that, although the first sample may be generated from the prior, successive samples are generated from distributions that provably get closer and closer to the desired posterior.

— Page 505, Probabilistic Graphical Models: Principles and Techniques, 2009.

MCMC algorithms are sensitive to their starting point, and often require a warm-up phase or burn-in phase to move in towards a fruitful part of the search space, after which prior samples can be discarded and useful samples can be collected.

Additionally, it can be challenging to know whether a chain has converged and collected a sufficient number of steps. Often a very large number of samples are required and a run is stopped given a fixed number of steps.

… it is necessary to discard some of the initial samples until the Markov chain has burned in, or entered its stationary distribution.

— Page 838, Machine Learning: A Probabilistic Perspective, 2012.

The most common general Markov Chain Monte Carlo algorithm is called Gibbs Sampling; a more general version of this sampler is called the Metropolis-Hastings algorithm.

Let’s take a closer look at both methods.

The Gibbs Sampling algorithm is an approach to constructing a Markov chain where the probability of the next sample is calculated as the conditional probability given the prior sample.

Samples are constructed by changing one random variable at a time, meaning that subsequent samples are very close in the search space, e.g. local. As such, there is some risk of the chain getting stuck.

The idea behind Gibbs sampling is that we sample each variable in turn, conditioned on the values of all the other variables in the distribution.

— Page 838, Machine Learning: A Probabilistic Perspective, 2012.

Gibbs Sampling is appropriate for those probabilistic models where this conditional probability can be calculated, e.g. the distribution is discrete rather than continuous.

… Gibbs sampling is applicable only in certain circumstances; in particular, we must be able to sample from the distribution P(Xi | x-i). Although this sampling step is easy for discrete graphical models, in continuous models, the conditional distribution may not be one that has a parametric form that allows sampling, so that Gibbs is not applicable.

— Page 515, Probabilistic Graphical Models: Principles and Techniques, 2009.

The Metropolis-Hastings Algorithm is appropriate for those probabilistic models where we cannot directly sample the so-called next state probability distribution, such as the conditional probability distribution used by Gibbs Sampling.

Unlike the Gibbs chain, the algorithm does not assume that we can generate next-state samples from a particular target distribution.

— Page 517, Probabilistic Graphical Models: Principles and Techniques, 2009.

Instead, the Metropolis-Hastings algorithm involves using a surrogate or proposal probability distribution that is sampled (sometimes called the kernel), then an acceptance criterion that decides whether the new sample is accepted into the chain or discarded.

They are based on a Markov chain whose dependence on the predecessor is split into two parts: a proposal and an acceptance of the proposal. The proposals suggest an arbitrary next step in the trajectory of the chain and the acceptance makes sure the appropriate limiting direction is maintained by rejecting unwanted moves of the chain.

— Page 6, Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference, 2006.

The acceptance criterion is probabilistic based on how likely the proposal distribution differs from the true next-state probability distribution.

The Metropolis-Hastings Algorithm is a more general and flexible Markov Chain Monte Carlo algorithm, subsuming many other methods.

For example, if the next-step conditional probability distribution is used as the proposal distribution, then the Metropolis-Hastings is generally equivalent to the Gibbs Sampling Algorithm. If a symmetric proposal distribution is used like a Gaussian, the algorithm is equivalent to another MCMC method called the Metropolis algorithm.

This section provides more resources on the topic if you are looking to go deeper.

- Markov Chain Monte Carlo in Practice, 1996.
- Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference, 2006.
- Handbook of Markov Chain Monte Carlo, 2011.
- Probabilistic Graphical Models: Principles and Techniques, 2009.

- Chapter 24 Markov chain Monte Carlo (MCMC) inference, Machine Learning: A Probabilistic Perspective, 2012.
- Section 11.2. Markov Chain Monte Carlo, Pattern Recognition and Machine Learning, 2006.
- Section 17.3 Markov Chain Monte Carlo Methods, Deep Learning, 2016.

- Monte Carlo method, Wikipedia.
- Markov chain, Wikipedia.
- Markov chain Monte Carlo, Wikipedia.
- Gibbs sampling, Wikipedia.
- Metropolis–Hastings algorithm, Wikipedia.
- MCMC sampling for dummies, 2015.
- How would you explain Markov Chain Monte Carlo (MCMC) to a layperson?

In this post, you discovered a gentle introduction to Markov Chain Monte Carlo for machine learning.

Specifically, you learned:

- Monte Carlo sampling is not effective and may be intractable for high-dimensional probabilistic models.
- Markov Chain Monte Carlo provides an alternate approach to random sampling a high-dimensional probability distribution where the next sample is dependent upon the current sample.
- Gibbs Sampling and the more general Metropolis-Hastings algorithm are the two most common approaches to Markov Chain Monte Carlo sampling.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Markov Chain Monte Carlo for Probability appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Monte Carlo Sampling for Probability appeared first on Machine Learning Mastery.

]]>There are many problem domains where describing or estimating the probability distribution is relatively straightforward, but calculating a desired quantity is intractable. This may be due to many reasons, such as the stochastic nature of the domain or an exponential number of random variables.

Instead, a desired quantity can be approximated by using random sampling, referred to as Monte Carlo methods. These methods were initially used around the time that the first computers were created and remain pervasive through all fields of science and engineering, including artificial intelligence and machine learning.

In this post, you will discover Monte Carlo methods for sampling probability distributions.

After reading this post, you will know:

- Often, we cannot calculate a desired quantity in probability, but we can define the probability distributions for the random variables directly or indirectly.
- Monte Carlo sampling a class of methods for randomly sampling from a probability distribution.
- Monte Carlo sampling provides the foundation for many machine learning methods such as resampling, hyperparameter tuning, and ensemble learning.

**Kick-start your project** with my new book Probability for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into three parts; they are:

- Need for Sampling
- What Are Monte Carlo Methods?
- Examples of Monte Carlo Methods

There are many problems in probability, and more broadly in machine learning, where we cannot calculate an analytical solution directly.

In fact, there may be an argument that exact inference may be intractable for most practical probabilistic models.

For most probabilistic models of practical interest, exact inference is intractable, and so we have to resort to some form of approximation.

— Page 523, Pattern Recognition and Machine Learning, 2006.

The desired calculation is typically a sum of a discrete distribution or integral of a continuous distribution and is intractable to calculate. The calculation may be intractable for many reasons, such as the large number of random variables, the stochastic nature of the domain, noise in the observations, the lack of observations, and more.

In problems of this kind, it is often possible to define or estimate the probability distributions for the random variables involved, either directly or indirectly via a computational simulation.

Instead of calculating the quantity directly, sampling can be used.

Sampling provides a flexible way to approximate many sums and integrals at reduced cost.

— Page 590, Deep Learning, 2016.

Samples can be drawn randomly from the probability distribution and used to approximate the desired quantity.

This general class of techniques for random sampling from a probability distribution is referred to as Monte Carlo methods.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Monte Carlo methods, or MC for short, are a class of techniques for randomly sampling a probability distribution.

There are three main reasons to use Monte Carlo methods to randomly sample a probability distribution; they are:

**Estimate density**, gather samples to approximate the distribution of a target function.**Approximate a quantity**, such as the mean or variance of a distribution.**Optimize a function**, locate a sample that maximizes or minimizes the target function.

Monte Carlo methods are named for the casino in Monaco and were first developed to solve problems in particle physics at around the time of the development of the first computers and the Manhattan project for developing the first atomic bomb.

This is called a Monte Carlo approximation, named after a city in Europe known for its plush gambling casinos. Monte Carlo techniques were first developed in the area of statistical physics – in particular, during development of the atomic bomb – but are now widely used in statistics and machine learning as well.

— Page 52, Machine Learning: A Probabilistic Perspective, 2012.

Drawing a sample may be as simple as calculating the probability for a randomly selected event, or may be as complex as running a computational simulation, with the latter often referred to as a Monte Carlo simulation.

Multiple samples are collected and used to approximate the desired quantity.

Given the law of large numbers from statistics, the more random trials that are performed, the more accurate the approximated quantity will become.

… the law of large numbers states that if the samples x(i) are i.i.d., then the average converges almost surely to the expected value

— Page 591, Deep Learning, 2016.

As such, the number of samples provides control over the precision of the quantity that is being approximated, often limited by the computational complexity of drawing a sample.

By generating enough samples, we can achieve any desired level of accuracy we like. The main issue is: how do we efficiently generate samples from a probability distribution, particularly in high dimensions?

— Page 815, Machine Learning: A Probabilistic Perspective, 2012.

Additionally, given the central limit theorem, the distribution of the samples will form a Normal distribution, the mean of which can be taken as the approximated quantity and the variance used to provide a confidence interval for the quantity.

The central limit theorem tells us that the distribution of the average […], converges to a normal distribution […] This allows us to estimate confidence intervals around the estimate […], using the cumulative distribution of the normal density.

— Page 592, Deep Learning, 2016.

Monte Carlo methods are defined in terms of the way that samples are drawn or the constraints imposed on the sampling process.

Some examples of Monte Carlo sampling methods include: direct sampling, importance sampling, and rejection sampling.

**Direct Sampling**. Sampling the distribution directly without prior information.**Importance Sampling**. Sampling from a simpler approximation of the target distribution.**Rejection Sampling**. Sampling from a broader distribution and only considering samples within a region of the sampled distribution.

It’s a huge topic with many books dedicated to it. Next, let’s make the idea of Monte Carlo sampling concrete with some familiar examples.

We use Monte Carlo methods all the time without thinking about it.

For example, when we define a Bernoulli distribution for a coin flip and simulate flipping a coin by sampling from this distribution, we are performing a Monte Carlo simulation. Additionally, when we sample from a uniform distribution for the integers {1,2,3,4,5,6} to simulate the roll of a dice, we are performing a Monte Carlo simulation.

We are also using the Monte Carlo method when we gather a random sample of data from the domain and estimate the probability distribution of the data using a histogram or density estimation method.

There are many examples of the use of Monte Carlo methods across a range of scientific disciplines.

For example, Monte Carlo methods can be used for:

- Calculating the probability of a move by an opponent in a complex game.
- Calculating the probability of a weather event in the future.
- Calculating the probability of a vehicle crash under specific conditions.

The methods are used to address difficult inference in problems in applied probability, such as sampling from probabilistic graphical models.

Related is the idea of sequential Monte Carlo methods used in Bayesian models that are often referred to as particle filters.

Particle filtering (PF) is a Monte Carlo, or simulation based, algorithm for recursive Bayesian inference.

— Page 823, Machine Learning: A Probabilistic Perspective, 2012.

Monte Carlo methods are also pervasive in artificial intelligence and machine learning.

Many important technologies used to accomplish machine learning goals are based on drawing samples from some probability distribution and using these samples to form a Monte Carlo estimate of some desired quantity.

— Page 590, Deep Learning, 2016.

They provide the basis for estimating the likelihood of outcomes in artificial intelligence problems via simulation, such as robotics. More simply, Monte Carlo methods are used to solve intractable integration problems, such as firing random rays in path tracing for computer graphics when rendering a computer-generated scene.

In machine learning, Monte Carlo methods provide the basis for resampling techniques like the bootstrap method for estimating a quantity, such as the accuracy of a model on a limited dataset.

The bootstrap is a simple Monte Carlo technique to approximate the sampling distribution. This is particularly useful in cases where the estimator is a complex function of the true parameters.

— Page 192, Machine Learning: A Probabilistic Perspective, 2012.

Random sampling of model hyperparameters when tuning a model is a Monte Carlo method, as are ensemble models used to overcome challenges such as the limited size and noise in a small data sample and the stochastic variance in a learning algorithm.

- Resampling algorithms.
- Random hyperparameter tuning.
- Ensemble learning algorithms.

Monte Carlo methods also provide the basis for randomized or stochastic optimization algorithms, such as the popular Simulated Annealing optimization technique.

Monte Carlo algorithms, of which simulated annealing is an example, are used in many branches of science to estimate quantities that are difficult to calculate exactly.

— Page 530, Artificial Intelligence: A Modern Approach, 3rd edition, 2009.

- Stochastic optimization algorithms.

We can make Monte Carlo sampling concrete with a worked example.

In this case, we will have a function that defines the probability distribution of a random variable. We will use a Gaussian distribution with a mean of 50 and a standard deviation of 5 and draw random samples from this distribution.

Let’s pretend we don’t know the form of the probability distribution for this random variable and we want to sample the function to get an idea of the probability density. We can draw a sample of a given size and plot a histogram to estimate the density.

The normal() NumPy function can be used to randomly draw samples from a Gaussian distribution with the specified mean (*mu*), standard deviation (*sigma*), and sample size.

To make the example more interesting, we will repeat this experiment four times with different sized samples. We would expect that as the size of the sample is increased, the probability density will better approximate the true density of the target function, given the law of large numbers.

The complete example is listed below.

# example of effect of size on monte carlo sample from numpy.random import normal from matplotlib import pyplot # define the distribution mu = 50 sigma = 5 # generate monte carlo samples of differing size sizes = [10, 50, 100, 1000] for i in range(len(sizes)): # generate sample sample = normal(mu, sigma, sizes[i]) # plot histogram of sample pyplot.subplot(2, 2, i+1) pyplot.hist(sample, bins=20) pyplot.title('%d samples' % sizes[i]) pyplot.xticks([]) # show the plot pyplot.show()

Running the example creates four differently sized samples and plots a histogram for each.

We can see that the small sample sizes of 10 and 50 do not effectively capture the density of the target function. We can see that 100 samples is better, but it is not until 1,000 samples that we clearly see the familiar bell-shape of the Gaussian probability distribution.

This highlights the need to draw many samples, even for a simple random variable, and the benefit of increased accuracy of the approximation with the number of samples drawn.

This section provides more resources on the topic if you are looking to go deeper.

- Chapter 29 Monte Carlo Methods, Information Theory, Inference and Learning Algorithms, 2003.
- Chapter 27 Sampling, Bayesian Reasoning and Machine Learning, 2011.
- Section 14.5 Approximate Inference In Bayesian Networks, Artificial Intelligence: A Modern Approach, 3rd edition, 2009.
- Chapter 23 Monte Carlo inference, Machine Learning: A Probabilistic Perspective, 2012.
- Chapter 11 Sampling Methods, Pattern Recognition and Machine Learning, 2006.
- Chapter 17 Monte Carlo Methods, Deep Learning, 2016.

- Sampling (statistics), Wikipedia.
- Monte Carlo method, Wikipedia.
- Monte Carlo integration, Wikipedia.
- Importance sampling, Wikipedia.
- Rejection sampling, Wikipedia.

In this post, you discovered Monte Carlo methods for sampling probability distributions.

Specifically, you learned:

- Often, we cannot calculate a desired quantity in probability, but we can define the probability distributions for the random variables directly or indirectly.
- Monte Carlo sampling a class of methods for randomly sampling from a probability distribution.
- Monte Carlo sampling provides the foundation for many machine learning methods such as resampling, hyperparameter tuning, and ensemble learning.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Monte Carlo Sampling for Probability appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Expectation-Maximization (EM Algorithm) appeared first on Machine Learning Mastery.

]]>It is a general and effective approach that underlies many machine learning algorithms, although it requires that the training dataset is complete, e.g. all relevant interacting random variables are present. Maximum likelihood becomes intractable if there are variables that interact with those in the dataset but were hidden or not observed, so-called latent variables.

The expectation-maximization algorithm is an approach for performing maximum likelihood estimation in the presence of latent variables. It does this by first estimating the values for the latent variables, then optimizing the model, then repeating these two steps until convergence. It is an effective and general approach and is most commonly used for density estimation with missing data, such as clustering algorithms like the Gaussian Mixture Model.

In this post, you will discover the expectation-maximization algorithm.

After reading this post, you will know:

- Maximum likelihood estimation is challenging on data in the presence of latent variables.
- Expectation maximization provides an iterative solution to maximum likelihood estimation with latent variables.
- Gaussian mixture models are an approach to density estimation where the parameters of the distributions are fit using the expectation-maximization algorithm.

**Kick-start your project** with my new book Probability for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Update Nov/2019**: Fixed typo in code comment (thanks Daniel)

This tutorial is divided into four parts; they are:

- Problem of Latent Variables for Maximum Likelihood
- Expectation-Maximization Algorithm
- Gaussian Mixture Model and the EM Algorithm
- Example of Gaussian Mixture Model

A common modeling problem involves how to estimate a joint probability distribution for a dataset.

Density estimation involves selecting a probability distribution function and the parameters of that distribution that best explain the joint probability distribution of the observed data.

There are many techniques for solving this problem, although a common approach is called maximum likelihood estimation, or simply “*maximum likelihood*.”

Maximum Likelihood Estimation involves treating the problem as an optimization or search problem, where we seek a set of parameters that results in the best fit for the joint probability of the data sample.

A limitation of maximum likelihood estimation is that it assumes that the dataset is complete, or fully observed. This does not mean that the model has access to all data; instead, it assumes that all variables that are relevant to the problem are present.

This is not always the case. There may be datasets where only some of the relevant variables can be observed, and some cannot, and although they influence other random variables in the dataset, they remain hidden.

More generally, these unobserved or hidden variables are referred to as latent variables.

Many real-world problems have hidden variables (sometimes called latent variables), which are not observable in the data that are available for learning.

— Page 816, Artificial Intelligence: A Modern Approach, 3rd edition, 2009.

Conventional maximum likelihood estimation does not work well in the presence of latent variables.

… if we have missing data and/or latent variables, then computing the [maximum likelihood] estimate becomes hard.

— Page 349, Machine Learning: A Probabilistic Perspective, 2012.

Instead, an alternate formulation of maximum likelihood is required for searching for the appropriate model parameters in the presence of latent variables.

The Expectation-Maximization algorithm is one such approach.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The Expectation-Maximization Algorithm, or EM algorithm for short, is an approach for maximum likelihood estimation in the presence of latent variables.

A general technique for finding maximum likelihood estimators in latent variable models is the expectation-maximization (EM) algorithm.

— Page 424, Pattern Recognition and Machine Learning, 2006.

The EM algorithm is an iterative approach that cycles between two modes. The first mode attempts to estimate the missing or latent variables, called the estimation-step or E-step. The second mode attempts to optimize the parameters of the model to best explain the data, called the maximization-step or M-step.

**E-Step**. Estimate the missing variables in the dataset.**M-Step**. Maximize the parameters of the model in the presence of the data.

The EM algorithm can be applied quite widely, although is perhaps most well known in machine learning for use in unsupervised learning problems, such as density estimation and clustering.

Perhaps the most discussed application of the EM algorithm is for clustering with a mixture model.

A mixture model is a model comprised of an unspecified combination of multiple probability distribution functions.

A statistical procedure or learning algorithm is used to estimate the parameters of the probability distributions to best fit the density of a given training dataset.

The Gaussian Mixture Model, or GMM for short, is a mixture model that uses a combination of Gaussian (Normal) probability distributions and requires the estimation of the mean and standard deviation parameters for each.

There are many techniques for estimating the parameters for a GMM, although a maximum likelihood estimate is perhaps the most common.

Consider the case where a dataset is comprised of many points that happen to be generated by two different processes. The points for each process have a Gaussian probability distribution, but the data is combined and the distributions are similar enough that it is not obvious to which distribution a given point may belong.

The processes used to generate the data point represents a latent variable, e.g. process 0 and process 1. It influences the data but is not observable. As such, the EM algorithm is an appropriate approach to use to estimate the parameters of the distributions.

In the EM algorithm, the estimation-step would estimate a value for the process latent variable for each data point, and the maximization step would optimize the parameters of the probability distributions in an attempt to best capture the density of the data. The process is repeated until a good set of latent values and a maximum likelihood is achieved that fits the data.

**E-Step**. Estimate the expected value for each latent variable.**M-Step**. Optimize the parameters of the distribution using maximum likelihood.

We can imagine how this optimization procedure could be constrained to just the distribution means, or generalized to a mixture of many different Gaussian distributions.

We can make the application of the EM algorithm to a Gaussian Mixture Model concrete with a worked example.

First, let’s contrive a problem where we have a dataset where points are generated from one of two Gaussian processes. The points are one-dimensional, the mean of the first distribution is 20, the mean of the second distribution is 40, and both distributions have a standard deviation of 5.

We will draw 3,000 points from the first process and 7,000 points from the second process and mix them together.

... # generate a sample X1 = normal(loc=20, scale=5, size=3000) X2 = normal(loc=40, scale=5, size=7000) X = hstack((X1, X2))

We can then plot a histogram of the points to give an intuition for the dataset. We expect to see a bimodal distribution with a peak for each of the means of the two distributions.

The complete example is listed below.

# example of a bimodal constructed from two gaussian processes from numpy import hstack from numpy.random import normal from matplotlib import pyplot # generate a sample X1 = normal(loc=20, scale=5, size=3000) X2 = normal(loc=40, scale=5, size=7000) X = hstack((X1, X2)) # plot the histogram pyplot.hist(X, bins=50, density=True) pyplot.show()

Running the example creates the dataset and then creates a histogram plot for the data points.

The plot clearly shows the expected bimodal distribution with a peak for the first process around 20 and a peak for the second process around 40.

We can see that for many of the points in the middle of the two peaks that it is ambiguous as to which distribution they were drawn from.

We can model the problem of estimating the density of this dataset using a Gaussian Mixture Model.

The GaussianMixture scikit-learn class can be used to model this problem and estimate the parameters of the distributions using the expectation-maximization algorithm.

The class allows us to specify the suspected number of underlying processes used to generate the data via the *n_components* argument when defining the model. We will set this to 2 for the two processes or distributions.

If the number of processes was not known, a range of different numbers of components could be tested and the model with the best fit could be chosen, where models could be evaluated using scores such as Akaike or Bayesian Information Criterion (AIC or BIC).

There are also many ways we can configure the model to incorporate other information we may know about the data, such as how to estimate initial values for the distributions. In this case, we will randomly guess the initial parameters, by setting the *init_params* argument to ‘random’.

... # fit model model = GaussianMixture(n_components=2, init_params='random') model.fit(X)

Once the model is fit, we can access the learned parameters via arguments on the model, such as the means, covariances, mixing weights, and more.

More usefully, we can use the fit model to estimate the latent parameters for existing and new data points.

For example, we can estimate the latent variable for the points in the training dataset and we would expect the first 3,000 points to belong to one process (e.g. *value=1*) and the next 7,000 data points to belong to a different process (e.g. *value=0*).

... # predict latent values yhat = model.predict(X) # check latent value for first few points print(yhat[:100]) # check latent value for last few points print(yhat[-100:])

Tying all of this together, the complete example is listed below.

# example of fitting a gaussian mixture model with expectation maximization from numpy import hstack from numpy.random import normal from sklearn.mixture import GaussianMixture # generate a sample X1 = normal(loc=20, scale=5, size=3000) X2 = normal(loc=40, scale=5, size=7000) X = hstack((X1, X2)) # reshape into a table with one column X = X.reshape((len(X), 1)) # fit model model = GaussianMixture(n_components=2, init_params='random') model.fit(X) # predict latent values yhat = model.predict(X) # check latent value for first few points print(yhat[:100]) # check latent value for last few points print(yhat[-100:])

Running the example fits the Gaussian mixture model on the prepared dataset using the EM algorithm. Once fit, the model is used to predict the latent variable values for the examples in the training dataset.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that at least for the first few and last few examples in the dataset, that the model mostly predicts the correct value for the latent variable. It’s a generally challenging problem and it is expected that the points between the peaks of the distribution will remain ambiguous and assigned to one process or another holistically.

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1] [0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

This section provides more resources on the topic if you are looking to go deeper.

- Section 8.5 The EM Algorithm, The Elements of Statistical Learning, 2016.
- Chapter 9 Mixture Models and EM, Pattern Recognition and Machine Learning, 2006.
- Section 6.12 The EM Algorithm, Machine Learning, 1997.
- Chapter 11 Mixture models and the EM algorithm, Machine Learning: A Probabilistic Perspective, 2012.
- Section 9.3 Clustering And Probability Density Estimation, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.
- Section 20.3 Learning With Hidden Variables: The EM Algorithm, Artificial Intelligence: A Modern Approach, 3rd edition, 2009.

- Maximum likelihood estimation, Wikipedia.
- Expectation-maximization algorithm, Wikipedia.
- Mixture model, Wikipedia.

In this post, you discovered the expectation-maximization algorithm.

Specifically, you learned:

- Maximum likelihood estimation is challenging on data in the presence of latent variables.
- Expectation maximization provides an iterative solution to maximum likelihood estimation with latent variables.
- Gaussian mixture models are an approach to density estimation where the parameters of the distributions are fit using the expectation-maximization algorithm.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Expectation-Maximization (EM Algorithm) appeared first on Machine Learning Mastery.

]]>The post Probabilistic Model Selection with AIC, BIC, and MDL appeared first on Machine Learning Mastery.

]]>It is common to choose a model that performs the best on a hold-out test dataset or to estimate model performance using a resampling technique, such as k-fold cross-validation.

An alternative approach to model selection involves using probabilistic statistical measures that attempt to quantify both the model performance on the training dataset and the complexity of the model. Examples include the Akaike and Bayesian Information Criterion and the Minimum Description Length.

The benefit of these information criterion statistics is that they do not require a hold-out test set, although a limitation is that they do not take the uncertainty of the models into account and may end-up selecting models that are too simple.

In this post, you will discover probabilistic statistics for machine learning model selection.

After reading this post, you will know:

- Model selection is the challenge of choosing one among a set of candidate models.
- Akaike and Bayesian Information Criterion are two ways of scoring a model based on its log-likelihood and complexity.
- Minimum Description Length provides another scoring method from information theory that can be shown to be equivalent to BIC.

**Kick-start your project** with my new book Probability for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into five parts; they are:

- The Challenge of Model Selection
- Probabilistic Model Selection
- Akaike Information Criterion
- Bayesian Information Criterion
- Minimum Description Length

Model selection is the process of fitting multiple models on a given dataset and choosing one over all others.

Model selection: estimating the performance of different models in order to choose the best one.

— Page 222, The Elements of Statistical Learning, 2016.

This may apply in unsupervised learning, e.g. choosing a clustering model, or supervised learning, e.g. choosing a predictive model for a regression or classification task. It may also be a sub-task of modeling, such as feature selection for a given model.

There are many common approaches that may be used for model selection. For example, in the case of supervised learning, the three most common approaches are:

- Train, Validation, and Test datasets.
- Resampling Methods.
- Probabilistic Statistics.

The simplest reliable method of model selection involves fitting candidate models on a training set, tuning them on the validation dataset, and selecting a model that performs the best on the test dataset according to a chosen metric, such as accuracy or error. A problem with this approach is that it requires a lot of data.

Resampling techniques attempt to achieve the same as the train/val/test approach to model selection, although using a small dataset. An example is k-fold cross-validation where a training set is split into many train/test pairs and a model is fit and evaluated on each. This is repeated for each model and a model is selected with the best average score across the k-folds. A problem with this and the prior approach is that only model performance is assessed, regardless of model complexity.

A third approach to model selection attempts to combine the complexity of the model with the performance of the model into a score, then select the model that minimizes or maximizes the score.

We can refer to this approach as statistical or probabilistic model selection as the scoring method uses a probabilistic framework.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Probabilistic model selection (or “information criteria”) provides an analytical technique for scoring and choosing among candidate models.

Models are scored both on their performance on the training dataset and based on the complexity of the model.

**Model Performance**. How well a candidate model has performed on the training dataset.**Model Complexity**. How complicated the trained candidate model is after training.

Model performance may be evaluated using a probabilistic framework, such as log-likelihood under the framework of maximum likelihood estimation. Model complexity may be evaluated as the number of degrees of freedom or parameters in the model.

Historically various ‘information criteria’ have been proposed that attempt to correct for the bias of maximum likelihood by the addition of a penalty term to compensate for the over-fitting of more complex models.

— Page 33, Pattern Recognition and Machine Learning, 2006.

A benefit of probabilistic model selection methods is that a test dataset is not required, meaning that all of the data can be used to fit the model, and the final model that will be used for prediction in the domain can be scored directly.

A limitation of probabilistic model selection methods is that the same general statistic cannot be calculated across a range of different types of models. Instead, the metric must be carefully derived for each model.

It should be noted that the AIC statistic is designed for preplanned comparisons between models (as opposed to comparisons of many models during automated searches).

— Page 493, Applied Predictive Modeling, 2013.

A further limitation of these selection methods is that they do not take the uncertainty of the model into account.

Such criteria do not take account of the uncertainty in the model parameters, however, and in practice they tend to favour overly simple models.

— Page 33, Pattern Recognition and Machine Learning, 2006.

There are three statistical approaches to estimating how well a given model fits a dataset and how complex the model is. And each can be shown to be equivalent or proportional to each other, although each was derived from a different framing or field of study.

They are:

- Akaike Information Criterion (AIC). Derived from frequentist probability.
- Bayesian Information Criterion (BIC). Derived from Bayesian probability.
- Minimum Description Length (MDL). Derived from information theory.

Each statistic can be calculated using the log-likelihood for a model and the data. Log-likelihood comes from Maximum Likelihood Estimation, a technique for finding or optimizing the parameters of a model in response to a training dataset.

In Maximum Likelihood Estimation, we wish to maximize the conditional probability of observing the data (*X*) given a specific probability distribution and its parameters (*theta*), stated formally as:

- P(X ; theta)

Where *X* is, in fact, the joint probability distribution of all observations from the problem domain from 1 to n.

- P(x1, x2, x3, …, xn ; theta)

The joint probability distribution can be restated as the multiplication of the conditional probability for observing each example given the distribution parameters. Multiplying many small probabilities together can be unstable; as such, it is common to restate this problem as the sum of the natural log conditional probability.

- sum i to n log(P(xi ; theta))

Given the frequent use of log in the likelihood function, it is commonly referred to as a log-likelihood function.

The log-likelihood function for common predictive modeling problems include the mean squared error for regression (e.g. linear regression) and log loss (binary cross-entropy) for binary classification (e.g. logistic regression).

We will take a closer look at each of the three statistics, AIC, BIC, and MDL, in the following sections.

The Akaike Information Criterion, or AIC for short, is a method for scoring and selecting a model.

It is named for the developer of the method, Hirotugu Akaike, and may be shown to have a basis in information theory and frequentist-based inference.

This is derived from a frequentist framework, and cannot be interpreted as an approximation to the marginal likelihood.

— Page 162, Machine Learning: A Probabilistic Perspective, 2012.

The AIC statistic is defined for logistic regression as follows (taken from “The Elements of Statistical Learning“):

- AIC = -2/N * LL + 2 * k/N

Where *N* is the number of examples in the training dataset, *LL* is the log-likelihood of the model on the training dataset, and *k* is the number of parameters in the model.

The score, as defined above, is minimized, e.g. the model with the lowest AIC is selected.

To use AIC for model selection, we simply choose the model giving smallest AIC over the set of models considered.

— Page 231, The Elements of Statistical Learning, 2016.

Compared to the BIC method (below), the AIC statistic penalizes complex models less, meaning that it may put more emphasis on model performance on the training dataset, and, in turn, select more complex models.

We see that the penalty for AIC is less than for BIC. This causes AIC to pick more complex models.

— Page 162, Machine Learning: A Probabilistic Perspective, 2012.

The Bayesian Information Criterion, or BIC for short, is a method for scoring and selecting a model.

It is named for the field of study from which it was derived: Bayesian probability and inference. Like AIC, it is appropriate for models fit under the maximum likelihood estimation framework.

The BIC statistic is calculated for logistic regression as follows (taken from “The Elements of Statistical Learning“):

- BIC = -2 * LL + log(N) * k

Where *log()* has the base-e called the natural logarithm, *LL* is the log-likelihood of the model, *N* is the number of examples in the training dataset, and *k* is the number of parameters in the model.

The score as defined above is minimized, e.g. the model with the lowest BIC is selected.

The quantity calculated is different from AIC, although can be shown to be proportional to the AIC. Unlike the AIC, the BIC penalizes the model more for its complexity, meaning that more complex models will have a worse (larger) score and will, in turn, be less likely to be selected.

Note that, compared to AIC […], this penalizes model complexity more heavily.

— Page 217, Pattern Recognition and Machine Learning, 2006.

Importantly, the derivation of BIC under the Bayesian probability framework means that if a selection of candidate models includes a true model for the dataset, then the probability that BIC will select the true model increases with the size of the training dataset. This cannot be said for the AIC score.

… given a family of models, including the true model, the probability that BIC will select the correct model approaches one as the sample size N -> infinity.

— Page 235, The Elements of Statistical Learning, 2016.

A downside of BIC is that for smaller, less representative training datasets, it is more likely to choose models that are too simple.

The Minimum Description Length, or MDL for short, is a method for scoring and selecting a model.

It is named for the field of study from which it was derived, namely information theory.

Information theory is concerned with the representation and transmission of information on a noisy channel, and as such, measures quantities like entropy, which is the average number of bits required to represent an event from a random variable or probability distribution.

From an information theory perspective, we may want to transmit both the predictions (or more precisely, their probability distributions) and the model used to generate them. Both the predicted target variable and the model can be described in terms of the number of bits required to transmit them on a noisy channel.

The Minimum Description Length is the minimum number of bits, or the minimum of the sum of the number of bits required to represent the data and the model.

The Minimum Description Length (MDL) principle recommends choosing the hypothesis that minimizes the sum of these two description lengths.

— Page 173, Machine Learning, 1997.

The MDL statistic is calculated as follows (taken from “Machine Learning“):

- MDL = L(h) + L(D | h)

Where *h* is the model, *D* is the predictions made by the model, *L(h)* is the number of bits required to represent the model, and *L(D | h)* is the number of bits required to represent the predictions from the model on the training dataset.

The score as defined above is minimized, e.g. the model with the lowest MDL is selected.

The number of bits required to encode (*D | h*) and the number of bits required to encode (*h*) can be calculated as the negative log-likelihood; for example (taken from “The Elements of Statistical Learning“):

- MDL = -log(P(theta)) – log(P(y | X, theta))

Or the negative log-likelihood of the model parameters (*theta*) and the negative log-likelihood of the target values (*y*) given the input values (*X*) and the model parameters (*theta*).

This desire to minimize the encoding of the model and its predictions is related to the notion of Occam’s Razor that seeks the simplest (least complex) explanation: in this context, the least complex model that predicts the target variable.

The MDL principle takes the stance that the best theory for a body of data is one that minimizes the size of the theory plus the amount of information necessary to specify the exceptions relative to the theory …

— Page 198, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.

The MDL calculation is very similar to BIC and can be shown to be equivalent in some situations.

Hence the BIC criterion, derived as approximation to log-posterior probability, can also be viewed as a device for (approximate) model choice by minimum description length.

— Page 236, The Elements of Statistical Learning, 2016.

We can make the calculation of AIC and BIC concrete with a worked example.

In this section, we will use a test problem and fit a linear regression model, then evaluate the model using the AIC and BIC metrics.

Importantly, the specific functional form of AIC and BIC for a linear regression model has previously been derived, making the example relatively straightforward. In adapting these examples for your own algorithms, it is important to either find an appropriate derivation of the calculation for your model and prediction problem or look into deriving the calculation yourself.

In this example, we will use a test regression problem provided by the make_regression() scikit-learn function. The problem will have two input variables and require the prediction of a target numerical value.

... # generate dataset X, y = make_regression(n_samples=100, n_features=2, noise=0.1) # define and fit the model on all data

We will fit a LinearRegression() model on the entire dataset directly.

... # define and fit the model on all data model = LinearRegression() model.fit(X, y)

Once fit, we can report the number of parameters in the model, which, given the definition of the problem, we would expect to be three (two coefficients and one intercept).

... # number of parameters num_params = len(model.coef_) + 1 print('Number of parameters: %d' % (num_params))

The likelihood function for a linear regression model can be shown to be identical to the least squares function; therefore, we can estimate the maximum likelihood of the model via the mean squared error metric.

First, the model can be used to estimate an outcome for each example in the training dataset, then the mean_squared_error() scikit-learn function can be used to calculate the mean squared error for the model.

... # predict the training set yhat = model.predict(X) # calculate the error mse = mean_squared_error(y, yhat) print('MSE: %.3f' % mse)

Tying this all together, the complete example of defining the dataset, fitting the model, and reporting the number of parameters and maximum likelihood estimate of the model is listed below.

# generate a test dataset and fit a linear regression model from sklearn.datasets import make_regression from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error # generate dataset X, y = make_regression(n_samples=100, n_features=2, noise=0.1) # define and fit the model on all data model = LinearRegression() model.fit(X, y) # number of parameters num_params = len(model.coef_) + 1 print('Number of parameters: %d' % (num_params)) # predict the training set yhat = model.predict(X) # calculate the error mse = mean_squared_error(y, yhat) print('MSE: %.3f' % mse)

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example first reports the number of parameters in the model as 3, as we expected, then reports the MSE as about 0.01.

Number of parameters: 3 MSE: 0.010

Next, we can adapt the example to calculate the AIC for the model.

Skipping the derivation, the AIC calculation for an ordinary least squares linear regression model can be calculated as follows (taken from “A New Look At The Statistical Identification Model“, 1974.):

- AIC = n * LL + 2 * k

Where *n* is the number of examples in the training dataset, *LL* is the log-likelihood for the model using the natural logarithm (e.g. the log of the MSE), and *k* is the number of parameters in the model.

The *calculate_aic()* function below implements this, taking *n*, the raw mean squared error (*mse*), and *k* as arguments.

# calculate aic for regression def calculate_aic(n, mse, num_params): aic = n * log(mse) + 2 * num_params return aic

The example can then be updated to make use of this new function and calculate the AIC for the model.

The complete example is listed below.

# calculate akaike information criterion for a linear regression model from math import log from sklearn.datasets import make_regression from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error # calculate aic for regression def calculate_aic(n, mse, num_params): aic = n * log(mse) + 2 * num_params return aic # generate dataset X, y = make_regression(n_samples=100, n_features=2, noise=0.1) # define and fit the model on all data model = LinearRegression() model.fit(X, y) # number of parameters num_params = len(model.coef_) + 1 print('Number of parameters: %d' % (num_params)) # predict the training set yhat = model.predict(X) # calculate the error mse = mean_squared_error(y, yhat) print('MSE: %.3f' % mse) # calculate the aic aic = calculate_aic(len(y), mse, num_params) print('AIC: %.3f' % aic)

Running the example reports the number of parameters and MSE as before and then reports the AIC.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, the AIC is reported to be a value of about -451.616. This value can be minimized in order to choose better models.

Number of parameters: 3 MSE: 0.010 AIC: -451.616

We can also explore the same example with the calculation of BIC instead of AIC.

Skipping the derivation, the BIC calculation for an ordinary least squares linear regression model can be calculated as follows (taken from here):

- BIC = n * LL + k * log(n)

Where n is the number of examples in the training dataset, *LL* is the log-likelihood for the model using the natural logarithm (e.g. log of the mean squared error), and *k* is the number of parameters in the model, and *log()* is the natural logarithm.

The *calculate_bic()* function below implements this, taking *n*, the raw mean squared error (*mse*), and *k* as arguments.

# calculate bic for regression def calculate_bic(n, mse, num_params): bic = n * log(mse) + num_params * log(n) return bic

The example can then be updated to make use of this new function and calculate the BIC for the model.

The complete example is listed below.

# calculate bayesian information criterion for a linear regression model from math import log from sklearn.datasets import make_regression from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error # calculate bic for regression def calculate_bic(n, mse, num_params): bic = n * log(mse) + num_params * log(n) return bic # generate dataset X, y = make_regression(n_samples=100, n_features=2, noise=0.1) # define and fit the model on all data model = LinearRegression() model.fit(X, y) # number of parameters num_params = len(model.coef_) + 1 print('Number of parameters: %d' % (num_params)) # predict the training set yhat = model.predict(X) # calculate the error mse = mean_squared_error(y, yhat) print('MSE: %.3f' % mse) # calculate the bic bic = calculate_bic(len(y), mse, num_params) print('BIC: %.3f' % bic)

Running the example reports the number of parameters and MSE as before and then reports the BIC.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, the BIC is reported to be a value of about -450.020, which is very close to the AIC value of -451.616. Again, this value can be minimized in order to choose better models.

Number of parameters: 3 MSE: 0.010 BIC: -450.020

This section provides more resources on the topic if you are looking to go deeper.

- Chapter 7 Model Assessment and Selection, The Elements of Statistical Learning, 2016.
- Section 1.3 Model Selection, Pattern Recognition and Machine Learning, 2006.
- Section 4.4.1 Model comparison and BIC, Pattern Recognition and Machine Learning, 2006.
- Section 6.6 Minimum Description Length Principle, Machine Learning, 1997.
- Section 5.3.2.4 BIC approximation to log marginal likelihood, Machine Learning: A Probabilistic Perspective, 2012.
- Applied Predictive Modeling, 2013.
- Section 28.3 Minimum description length (MDL), Information Theory, Inference and Learning Algorithms, 2003.
- Section 5.10 The MDL Principle, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.

- sklearn.datasets.make_regression API.
- sklearn.linear_model.LinearRegression API.
- sklearn.metrics.mean_squared_error API.

- Akaike information criterion, Wikipedia.
- Bayesian information criterion, Wikipedia.
- Minimum description length, Wikipedia.

In this post, you discovered probabilistic statistics for machine learning model selection.

Specifically, you learned:

- Model selection is the challenge of choosing one among a set of candidate models.
- Akaike and Bayesian Information Criterion are two ways of scoring a model based on its log-likelihood and complexity.
- Minimum Description Length provides another scoring method from information theory that can be shown to be equivalent to BIC.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Probabilistic Model Selection with AIC, BIC, and MDL appeared first on Machine Learning Mastery.

]]>