Probabilistic models can define relationships between variables and be used to calculate probabilities.

For example, fully conditional models may require an enormous amount of data to cover all possible cases, and probabilities may be intractable to calculate in practice. Simplifying assumptions such as the conditional independence of all random variables can be effective, such as in the case of Naive Bayes, although it is a drastically simplifying step.

An alternative is to develop a model that preserves known conditional dependence between random variables and conditional independence in all other cases. Bayesian networks are a probabilistic graphical model that explicitly capture the known conditional dependence with directed edges in a graph model. All missing connections define the conditional independencies in the model.

As such Bayesian Networks provide a useful tool to visualize the probabilistic model for a domain, review all of the relationships between the random variables, and reason about causal probabilities for scenarios given available evidence.

In this post, you will discover a gentle introduction to Bayesian Networks.

After reading this post, you will know:

- Bayesian networks are a type of probabilistic graphical model comprised of nodes and directed edges.
- Bayesian network models capture both conditionally dependent and conditionally independent relationships between random variables.
- Models can be prepared by experts or learned from data, then used for inference to estimate the probabilities for causal or subsequent events.

**Kick-start your project** with my new book Probability for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

## Overview

This tutorial is divided into five parts; they are:

- Challenge of Probabilistic Modeling
- Bayesian Belief Network as a Probabilistic Model
- How to Develop and Use a Bayesian Network
- Example of a Bayesian Network
- Bayesian Networks in Python

## Challenge of Probabilistic Modeling

Probabilistic models can be challenging to design and use.

Most often, the problem is the lack of information about the domain required to fully specify the conditional dependence between random variables. If available, calculating the full conditional probability for an event can be impractical.

A common approach to addressing this challenge is to add some simplifying assumptions, such as assuming that all random variables in the model are conditionally independent. This is a drastic assumption, although it proves useful in practice, providing the basis for the Naive Bayes classification algorithm.

An alternative approach is to develop a probabilistic model of a problem with some conditional independence assumptions. This provides an intermediate approach between a fully conditional model and a fully conditionally independent model.

Bayesian belief networks are one example of a probabilistic model where some variables are conditionally independent.

Thus, Bayesian belief networks provide an intermediate approach that is less constraining than the global assumption of conditional independence made by the naive Bayes classifier, but more tractable than avoiding conditional independence assumptions altogether.

— Page 184, Machine Learning, 1997.

### Want to Learn Probability for Machine Learning

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

## Bayesian Belief Network as a Probabilistic Model

A Bayesian belief network is a type of probabilistic graphical model.

### Probabilistic Graphical Models

A probabilistic graphical model (PGM), or simply “*graphical model*” for short, is a way of representing a probabilistic model with a graph structure.

The nodes in the graph represent random variables and the edges that connect the nodes represent the relationships between the random variables.

A graph comprises nodes (also called vertices) connected by links (also known as edges or arcs). In a probabilistic graphical model, each node represents a random variable (or group of random variables), and the links express probabilistic relationships between these variables.

— Page 360, Pattern Recognition and Machine Learning, 2006.

**Nodes**: Random variables in a graphical model.**Edges**: Relationships between random variables in a graphical model.

There are many different types of graphical models, although the two most commonly described are the Hidden Markov Model and the Bayesian Network.

The Hidden Markov Model (HMM) is a graphical model where the edges of the graph are undirected, meaning the graph contains cycles. Bayesian Networks are more restrictive, where the edges of the graph are directed, meaning they can only be navigated in one direction. This means that cycles are not possible, and the structure can be more generally referred to as a directed acyclic graph (DAG).

Directed graphs are useful for expressing causal relationships between random variables, whereas undirected graphs are better suited to expressing soft constraints between random variables.

— Page 360, Pattern Recognition and Machine Learning, 2006.

### Bayesian Belief Networks

A Bayesian Belief Network, or simply “*Bayesian Network*,” provides a simple way of applying Bayes Theorem to complex problems.

The networks are not exactly Bayesian by definition, although given that both the probability distributions for the random variables (nodes) and the relationships between the random variables (edges) are specified subjectively, the model can be thought to capture the “*belief*” about a complex domain.

Bayesian probability is the study of subjective probabilities or belief in an outcome, compared to the frequentist approach where probabilities are based purely on the past occurrence of the event.

A Bayesian Network captures the joint probabilities of the events represented by the model.

A Bayesian belief network describes the joint probability distribution for a set of variables.

— Page 185, Machine Learning, 1997.

Central to the Bayesian network is the notion of conditional independence.

Independence refers to a random variable that is unaffected by all other variables. A dependent variable is a random variable whose probability is conditional on one or more other random variables.

Conditional independence describes the relationship among multiple random variables, where a given variable may be conditionally independent of one or more other random variables. This does not mean that the variable is independent per se; instead, it is a clear definition that the variable is independent of specific other known random variables.

A probabilistic graphical model, such as a Bayesian Network, provides a way of defining a probabilistic model for a complex problem by stating all of the conditional independence assumptions for the known variables, whilst allowing the presence of unknown (latent) variables.

As such, both the presence and the absence of edges in the graphical model are important in the interpretation of the model.

A graphical model (GM) is a way to represent a joint distribution by making [Conditional Independence] CI assumptions. In particular, the nodes in the graph represent random variables, and the (lack of) edges represent CI assumptions. (A better name for these models would in fact be “independence diagrams” …

— Page 308, Machine Learning: A Probabilistic Perspective, 2012.

Bayesian networks provide useful benefits as a probabilistic model.

For example:

**Visualization**. The model provides a direct way to visualize the structure of the model and motivate the design of new models.**Relationships**. Provides insights into the presence and absence of the relationships between random variables.**Computations**. Provides a way to structure complex probability calculations.

## How to Develop and Use a Bayesian Network

Designing a Bayesian Network requires defining at least three things:

**Random Variables**. What are the random variables in the problem?**Conditional Relationships**. What are the conditional relationships between the variables?**Probability Distributions**. What are the probability distributions for each variable?

It may be possible for an expert in the problem domain to specify some or all of these aspects in the design of the model.

In many cases, the architecture or topology of the graphical model can be specified by an expert, but the probability distributions must be estimated from data from the domain.

Both the probability distributions and the graph structure itself can be estimated from data, although it can be a challenging process. As such, it is common to use learning algorithms for this purpose; for example, assuming a Gaussian distribution for continuous random variables gradient ascent for estimating the distribution parameters.

Once a Bayesian Network has been prepared for a domain, it can be used for reasoning, e.g. making decisions.

Reasoning is achieved via inference with the model for a given situation. For example, the outcome for some events is known and plugged into the random variables. The model can be used to estimate the probability of causes for the events or possible further outcomes.

Reasoning (inference) is then performed by introducing evidence that sets variables in known states, and subsequently computing probabilities of interest, conditioned on this evidence.

— Page 13, Bayesian Reasoning and Machine Learning, 2012.

Practical examples of using Bayesian Networks in practice include medicine (symptoms and diseases), bioinformatics (traits and genes), and speech recognition (utterances and time).

## Example of a Bayesian Network

We can make Bayesian Networks concrete with a small example.

Consider a problem with three random variables: A, B, and C. A is dependent upon B, and C is dependent upon B.

We can state the conditional dependencies as follows:

- A is conditionally dependent upon B, e.g. P(A|B)
- C is conditionally dependent upon B, e.g. P(C|B)

We know that C and A have no effect on each other.

We can also state the conditional independencies as follows:

- A is conditionally independent from C: P(A|B, C)
- C is conditionally independent from A: P(C|B, A)

Notice that the conditional dependence is stated in the presence of the conditional independence. That is, A is conditionally independent of C, or A is conditionally dependent upon B in the presence of C.

We might also state the conditional independence of A given C as the conditional dependence of A given B, as A is unaffected by C and can be calculated from A given B alone.

- P(A|C, B) = P(A|B)

We can see that B is unaffected by A and C and has no parents; we can simply state the conditional independence of B from A and C as P(B, P(A|B), P(C|B)) or P(B).

We can also write the joint probability of A and C given B or conditioned on B as the product of two conditional probabilities; for example:

- P(A, C | B) = P(A|B) * P(C|B)

The model summarizes the joint probability of P(A, B, C), calculated as:

- P(A, B, C) = P(A|B) * P(C|B) * P(B)

We can draw the graph as follows:

Notice that the random variables are each assigned a node, and the conditional probabilities are stated as directed connections between the nodes. Also notice that it is not possible to navigate the graph in a cycle, e.g. no loops are possible when navigating from node to node via the edges.

Also notice that the graph is useful even at this point where we don’t know the probability distributions for the variables.

You might want to extend this example by using contrived probabilities for discrete events for each random variable and practice some simple inference for different scenarios.

## Bayesian Networks in Python

Bayesian Networks can be developed and used for inference in Python.

A popular library for this is called PyMC and provides a range of tools for Bayesian modeling, including graphical models like Bayesian Networks.

The most recent version of the library is called PyMC3, named for Python version 3, and was developed on top of the Theano mathematical computation library that offers fast automatic differentiation.

PyMC3 is a new open source probabilistic programming framework written in Python that uses Theano to compute gradients via automatic differentiation as well as compile probabilistic programs on-the-fly to C for increased speed.

— Probabilistic programming in Python using PyMC3, 2016.

More generally, the use of probabilistic graphical models in computer software used for inference is referred to “*probabilistic programming*“.

This type of programming is called probabilistic programming, […] it is probabilistic in the sense that we create probability models using programming variables as the model’s components. Model components are first-class primitives within the PyMC framework.

— Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference, 2015.

For an excellent primer on Bayesian methods generally with PyMC, see the free book by Cameron Davidson-Pilon titled “Bayesian Methods for Hackers.”

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

### Books

- Bayesian Reasoning and Machine Learning, 2012.
- Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference, 2015.

### Book Chapters

- Chapter 6: Bayesian Learning, Machine Learning, 1997.
- Chapter 8: Graphical Models, Pattern Recognition and Machine Learning, 2006.
- Chapter 10: Directed graphical models (Bayes nets), Machine Learning: A Probabilistic Perspective, 2012.
- Chapter 14: Probabilistic Reasoning, Artificial Intelligence: A Modern Approach, 3rd edition, 2009.

### Papers

### Code

### Articles

- Graphical model, Wikipedia.
- Hidden Markov model, Wikipedia.
- Bayesian network, Wikipedia.
- Conditional independence, Wikipedia.
- Probabilistic Programming & Bayesian Methods for Hackers

## Summary

In this post, you discovered a gentle introduction to Bayesian Networks.

Specifically, you learned:

- Bayesian networks are a type of probabilistic graphical model comprised of nodes and directed edges.
- Bayesian network models capture both conditionally dependent and conditionally independent relationships between random variables.
- Models can be prepared by experts or learned from data, then used for inference to estimate the probabilities for causal or subsequent events.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

What is the difference between Bayesian Belief Networks and Bayesian Neural Networks?

BBN are a probabilistic graphical model. I don’t know much abut Bayesian neural nets, sorry, but they are a neural net – not a graphical model.

@Leo actually there bayesian neural networks do exist, but they are trained in a different way than the usual neural networks. A standard vanilla neural network has matrices of parameters that are fixed or constant. These parameters are estimated using backpropagation, and gradient descent. A bayesian neural network actually estimates a distribution over each parameter or weight in the neural network matrix. Instead of saying that the weights are [1 2 3; 4 5 6], a bayesian NN will assign distributions to these parameters like [Normal(1.0, sigma^2) Normal(2.0, sigma^2) Normal(3.0, sigma^2 ….].

Of course you can’t train this kind of model using standard backpropagation. You would use something similar to variational inference. One interesting approach is a method called Stochastic Gradient Langevin Descent, which is again like variational inference.

It’d be great if you could give some use cases of BN applications in a ML problem statement.or links to the same

Thanks for the suggestion.

Please is it essential to use algorithm if we want to apply BN in research task??

For instance, when we use data mining concept, we have to appy some algorithms based upon its techniques.

I just confused with these two concepts.

Please let me get clear explanation .

Yes, you can use a library to implement the model, such as pymc3.

What model diagnostics are required for structural learning under your package … possibly consistency?

For instance, I am working on structural learning estimation for a set of observed binary survey variables using:

I. ML estimation,

II. BIC goodness-of-fit criteria, and

III. Tabu optimization.

This is under R’s bnlearn package by Scutari and is all performed in the background with little explanation. I am very interested in possibly reimplementing my work in Python for practice on the subject. Thank you!

Sorry, I don’t know about “structural learning”, not sure if I can give you good advice off the cuff.

In Bayesian Belief Network (BBN) two things are learnt from data, one is the structure of the network ( means which nodes are connected to which nodes with one restriction of no cycling as BBN is Directed Acyclic Graph (DAG)) another is Node Probability Table (NPT) for each node which you may call Conditional Probability Table (CPT) if it is Child node or both parent and child node. For only parent node NPT is not conditional distribution. Now if you estimate both the structure and NPT by data mining as it is common in machine learning it is called fully automated learning. If the structure is determined by subject / domain expert and NPT is learnt from data by data mining, it is called semi-automated learning. If both are specified by domain / subject expert it is called manual learning. The estimation of structure of the network is known as Structural Learning.

Can we use bayesian network for image classification?

If can, how to do it?

Thank you

Probably not. It does not sound like an appropriate use of the technique to me.

It seems to me that A is not conditionally independent of C unless all outcomes of A and C are uncorrelated. Their mutual dependence on B suggests that some combinations of A and C may well be more likely than others. Does this mean that conditional independence does not rule out correlations?

Correlation isnt the same as causality

Thank you for the write-up. Please, can we use BN in fault localization of complex systems?

Perhaps try it and see.

Hi Jason,

I would like to ask your personal opinion/suggestion to use Bayesian network for the following scenario (ps: deeply appreciate if you suggest a program):

About 150 different aquatic sites collected for aquatic invertabrates along with zooplankton and phytoplanktons while water and ecology-related variables (e.g., temperature, pH, …) are collected.

Can we apply BN?

I don’t know off hand, perhaps try it and see.

Can you give some idea on structure learning for Bayesian network in the case of time-series data (multivariate). I am facing some difficulty in finding libraries that can give good results for this problem. pgmpy has structure learning framework for gaussian variables, which raise second question, How much accuracy will be affected if we take multivariate time-series data and put it in an algorithm which treats it as iid gaussian

Sorry I cannot.

Would you be able to find probabilities of categorical data using Bayesian network in python??

Not sure what you mean.

You can estimate the probability of values for a variable using an empirical distribution. No bayes net required.

> The Hidden Markov Model (HMM) is a graphical model where the edges of the graph are undirected, meaning the graph contains cycles. Bayesian Networks are more restrictive, where the edges of the graph are directed, meaning they can only be navigated in one direction. This means that cycles are not possible, and the structure can be more generally referred to as a directed acyclic graph (DAG).

That’s not how that works. Just because a graph is undirected does not necessarily mean that it contains a cycle. Likewise, it is easy to create a directed graph that has a cycle. A cyclic graph just means that starting at any node, there is a path (in the graph theory sense) you can follow to get back to your original node. A path must consist of a subset of the graph’s vertices, so you cannot use the same node twice (this is why all undirected graphs are not cyclic — going from A to B and straight back to A doesn’t count).

Thank you for your feedback Audemar!

“The Hidden Markov Model (HMM) is a graphical model where the edges of the graph are undirected, meaning the graph contains cycles.”

I think this statement is false. An HMM is visually two sequences (x_n, y_n), where the observation y_n has a single arrow pointing to it coming from x_n. The state x_n has a directed arrow to it from the previous state and emits an arrow to the next state x_{n+1}. Thus I don’t see how the edges are undirected and the graph contains cycles. Care to explain?

Hi statstudent…Thank you for your feedback! The following resource may add clarity:

https://web.math.princeton.edu/~rvan/orf557/hmm080728.pdf