Last Updated on

Probability is a field of mathematics that quantifies uncertainty.

It is undeniably a pillar of the field of machine learning, and many recommend it as a prerequisite subject to study prior to getting started. This is misleading advice, as probability makes more sense to a practitioner once they have the context of the applied machine learning process in which to interpret it.

In this post, you will discover why machine learning practitioners should study probabilities to improve their skills and capabilities.

After reading this post, you will know:

- Not everyone should learn probability; it depends where you are in your journey of learning machine learning.
- Many algorithms are designed using the tools and techniques from probability, such as Naive Bayes and Probabilistic Graphical Models.
- The maximum likelihood framework that underlies the training of many machine learning algorithms comes from the field of probability.

Discover bayes opimization, naive bayes, maximum likelihood, distributions, cross entropy, and much more in my new book, with 28 step-by-step tutorials and full Python source code.

Let’s get started.

## Overview

This tutorial is divided into seven parts; they are:

- Reasons to NOT Learn Probability
- Class Membership Requires Predicting a Probability
- Some Algorithms Are Designed Using Probability
- Models Are Trained Using a Probabilistic Framework
- Models Can Be Tuned With a Probabilistic Framework
- Probabilistic Measures Are Used to Evaluate Model Skill
- One More Reason

## Reasons to NOT Learn Probability

Before we go through the reasons that you should learn probability, let’s start off by taking a small look at the reason why you should not.

I think you should not study probability if you are just getting started with applied machine learning.

**It’s not required**. Having an appreciation for the abstract theory that underlies some machine learning algorithms is not required in order to use machine learning as a tool to solve problems.**It’s slow**. Taking months to years to study an entire related field before starting machine learning will delay you achieving your goals of being able to work through predictive modeling problems.**It’s a huge field**. Not all of probability is relevant to theoretical machine learning, let alone applied machine learning.

I recommend a breadth-first approach to getting started in applied machine learning.

I call this the results-first approach. It is where you start by learning and practicing the steps for working through a predictive modeling problem end-to-end (e.g. how to get results) with a tool (such as scikit-learn and Pandas in Python).

This process then provides the skeleton and context for progressively deepening your knowledge, such as how algorithms work and, eventually, the math that underlies them.

After you know how to work through a predictive modeling problem, let’s look at why you should deepen your understanding of probability.

### Want to Learn Probability for Machine Learning

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

## 1. Class Membership Requires Predicting a Probability

Classification predictive modeling problems are those where an example is assigned a given label.

An example that you may be familiar with is the iris flowers dataset where we have four measurements of a flower and the goal is to assign one of three different known species of iris flower to the observation.

We can model the problem as directly assigning a class label to each observation.

**Input**: Measurements of a flower.**Output**: One iris species.

A more common approach is to frame the problem as a probabilistic class membership, where the probability of an observation belonging to each known class is predicted.

**Input**: Measurements of a flower.**Output**: Probability of membership to each iris species.

Framing the problem as a prediction of class membership simplifies the modeling problem and makes it easier for a model to learn. It allows the model to capture ambiguity in the data, which allows a process downstream, such as the user to interpret the probabilities in the context of the domain.

The probabilities can be transformed into a crisp class label by choosing the class with the largest probability. The probabilities can also be scaled or transformed using a probability calibration process.

This choice of a class membership framing of the problem interpretation of the predictions made by the model requires a basic understanding of probability.

## 2. Models Are Designed Using Probability

There are algorithms that are specifically designed to harness the tools and methods from probability.

These range from individual algorithms, like Naive Bayes algorithm, which is constructed using Bayes Theorem with some simplifying assumptions.

The linear regression algorithm can be seen as a probabilistic model that minimizes the mean squared error of predictions, and the logistic regression algorithm can be seen as a probabilistic model that minimizes the negative log likelihood of predicting the positive class label.

- Linear Regression
- Logistic Regression

It also extends to whole fields of study, such as probabilistic graphical models, often called graphical models or PGM for short, and designed around Bayes Theorem.

A notable graphical model is Bayesian Belief Networks or Bayes Nets, which are capable of capturing the conditional dependencies between variables.

## 3. Models Are Trained With Probabilistic Frameworks

Many machine learning models are trained using an iterative algorithm designed under a probabilistic framework.

Some examples of general probabilsitic modeling frameworks are:

- Maximum Likelihood Estimation (Frequentist).
- Maximum a Posteriori Estimation (Bayesian).

Perhaps the most common is the framework of maximum likelihood estimation, sometimes shorted as MLE. This is a framework for estimating model parameters (e.g. weights) given observed data.

This is the framework that underlies the ordinary least squares estimate of a linear regression model and the log loss estimate for logistic regression.

The expectation-maximization algorithm, or EM for short, is an approach for maximum likelihood estimation often used for unsupervised data clustering, e.g. estimating k means for k clusters, also known as the k-Means clustering algorithm.

For models that predict class membership, maximum likelihood estimation provides the framework for minimizing the difference or divergence between an observed and predicted probability distribution. This is used in classification algorithms like logistic regression as well as deep learning neural networks.

It is common to measure this difference in probability distribution during training using entropy, e.g. via cross-entropy. Entropy, and differences between distributions measured via KL divergence, and cross-entropy are from the field of information theory that directly build upon probability theory. For example, entropy is calculated directly as the negative log of the probability.

As such, these tools from information theory such as minimising cross-entropy loss can be seen as another probabilistic framework for model estimation.

- Minimum Cross-Entropy Loss Estimation

## 4. Models Are Tuned With a Probabilistic Framework

It is common to tune the hyperparameters of a machine learning model, such as k for kNN or the learning rate in a neural network.

Typical approaches include grid searching ranges of hyperparameters or randomly sampling hyperparameter combinations.

Bayesian optimization is a more efficient to hyperparameter optimization that involves a directed search of the space of possible configurations based on those configurations that are most likely to result in better performance. As its name suggests, the approach was devised from and harnesses Bayes Theorem when sampling the space of possible configurations.

For more on Bayesian optimization, see the tutorial:

## 5. Models Are Evaluated With Probabilistic Measures

For those algorithms where a prediction of probabilities is made, evaluation measures are required to summarize the performance of the model.

There are many measures used to summarize the performance of a model based on predicted probabilities. Common examples include:

- Log Loss (also called cross-entropy).
- Brier Score, and the Brier Skill Score

For more on metrics for evaluating predicted probabilities, see the tutorial:

For binary classification tasks where a single probability score is predicted, Receiver Operating Characteristic, or ROC, curves can be constructed to explore different cut-offs that can be used when interpreting the prediction that, in turn, result in different trade-offs. The area under the ROC curve, or ROC AUC, can also be calculated as an aggregate measure. A related method that couses on the positive class is the Precision-Recall Curve and area under curve.

- ROC Curve and ROC AUC
- Precision-Recall Curve and AUC

For more on these curves and when to use them see the tutorial:

Choice and interpretation of these scoring methods require a foundational understanding of probability theory.

## One More Reason

If I could give one more reason, it would be: Because it is fun.

Seriously.

Learning probability, at least the way I teach it with practical examples and executable code, is a lot of fun. Once you can see how the operations work on real data, it is hard to avoid developing a strong intuition for a subject that is often quite unintuitive.

Do you have more reasons why it is critical for an intermediate machine learning practitioner to learn probability?

Let me know in the comments below.

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

### Books

- Pattern Recognition and Machine Learning, 2006.
- Machine Learning: A Probabilistic Perspective, 2012.
- Machine Learning, 1997.

### Posts

- A Gentle Introduction to Probability Scoring Methods in Python
- How and When to Use ROC Curves and Precision-Recall Curves for Classification in Python
- How to Choose Loss Functions When Training Deep Learning Neural Networks

### Articles

- Graphical model, Wikipedia.
- Maximum likelihood estimation, Wikipedia.
- Expectation-maximization algorithm, Wikipedia.
- Cross entropy, Wikipedia.
- Kullback-Leibler divergence, Wikipedia.
- Bayesian optimization, Wikipedia.

## Summary

In this post, you discovered why, as a machine learning practitioner, you should deepen your understanding of probability.

Specifically, you learned:

- Not everyone should learn probability; it depends where you are in your journey of learning machine learning.
- Many algorithms are designed using the tools and techniques from probability, such as Naive Bayes and Probabilistic Graphical Models.
- The maximum likelihood framework that underlies the training of many machine learning algorithms comes from the field of probability.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

Hello Jason, from Kenya here, I just want to say thank you for making me a lazy academic but a ruthless Applied Machine learning engineer, i learn tonnes from you.

if i can put in a request, could you still put up content in an app? i dont know if it will serve many more but an app with a daily update other than my email would go a long way, i can sync and read later, even set a notification on when i can read it, or sometimes need to read or confirm something from your content etc.

I do not know its feasibility, it may only be me having this feeling but could you try it out or ask your audience and see if its a good idea? thank you so much again.

Thanks!

Thanks for the suggestion.

How will reading the tutorials in the app help you exactly?

you are doing great work sir.

Thanks!

Thank you for the wonderful post, I enjoy reading your posts. I was just getting overwhelmed with the math/probability that I need to master before starting machine learning courses. Your post has really helped me to forge ahead. However, I doing a linear algebra course before starting on Machine learning probably next month. Is it possible to write something on linear algebra and how one should go about it, the way you have done for probability. Thank you.

He has already written about Linear algebra. I am at same boat as yours. Started with LA and now thinking of doing Probability before cranking machine learning.

https://machinelearningmastery.com/linear-algebra-machine-learning/

Thanks.

Yes, the best place to start with linear algebra is right here:

https://machinelearningmastery.com/start-here/#linear_algebra

Thanks for giving the insights and the motivation to learn probability.

You’re welcome.

Yes, you can get started with linear algebra here:

https://machinelearningmastery.com/start-here/#linear_algebra

in your heading ” Probabilistic Measures Are Used to Evaluate Model Skill ”

i guess you missed CONFUSION MATRIX which is also used in Probablity based classifiers performance…

I would not consider a confusion matrix as useful for evaluating probabilities.

I would instead recommend logloss, cross entropy and brier score.

Dear Dr Jason,

Thank you for your article. In section 3 you mention the “Bayesian Belief Network” (‘BBN’) . I had a look at the Wikipedia article particularly the example of the conditional conditional (yes I wrote it twice and it is the first time I saw a conditional conditional) probability of the grass getting wet either by the sprinkler and/or rain or both.

I have seen reference to ‘BBNs’ on your site. Do your books cover a practical example of ‘BBNs’ and/or do you intend to do an example of a BBN in python?

Thank you,

Anthony of Sydney

I don’t have tutorials on BBN. I will have a little more in the future, and one day I will have a book on probabilistic graphical models.

Until then, you can start here:

Probabilistic Graphical Models: Principles and Techniques https://amzn.to/324l0tT

I don’t really agree with your statement that probability isn’t necessary for ML. In fact, I didn’t really like your section on why NOT to learn probability. How can people justify that a model is going to be stable, give consistent results in the future, etc. if you can’t even define what MLE is? Or have some understanding of how you got the predicted values you did?

Just my opinion, interested to hear what you think.

Thanks for sharing your opinion Sam.

I think it is better to get started and learn the basic process of working through a problem first, then circle back to probability.

You can understand concepts like mean and variance broadly as part of that first step.

Probability is the scaffold for machine learning, like computability/discrete math for programming. You can do programming without it, but you get much better after learning about it. And if you start with it, you will give up.

Thanks for the response Jason. I appreciate that it’s always good to get going as quickly as possible, I just worry that in today’s day and age, people will create models that could have real impact on people’s decisions. If we don’t fundamentally understand how we got a given prediction/recommendation, then it certain edge cases we could have problematic results. Here lies the importance of understanding the fundamentals of what you are doing.

Perhaps.

It is no more or less dangerous than developers writing software used by thousands of people where those developers have little background as engineers.

We develop structures around the project that add guard rails, like TDD, user testing, system testing, etc.

Fair enough. I think it’s less common to write software with no experience as an engineer than it is to create models without any fundamental probability/ML understanding, but I understand your point.

Appreciate your thoughts.

By the way, I’ve been reading your posts for a while now and really enjoy them—just thought this deserved some attention.

Thanks Sam.

I don’t think we can be black and white on these topics, the industry is full of all types of curious and creative people looking to deliver value.

Probability for Machine Learning is a good book but it’s pure theory (wich I’m sure it’s really important) but there’s no examples about real world aplications on real datasets. Just some simple examples with random generated values or arbitrary values.

Thanks for your thoughts.

It is not theory, e.g. like a textbook, it is applied like a field guide.

Do have a suggestion of a better way to learn the concepts Daniel?