The post 14 Different Types of Learning in Machine Learning appeared first on Machine Learning Mastery.

]]>Machine learning is a large field of study that overlaps with and inherits ideas from many related fields such as artificial intelligence.

The focus of the field is learning, that is, acquiring skills or knowledge from experience. Most commonly, this means synthesizing useful concepts from historical data.

As such, there are many different types of learning that you may encounter as a practitioner in the field of machine learning: from whole fields of study to specific techniques.

In this post, you will discover a gentle introduction to the different types of learning that you may encounter in the field of machine learning.

After reading this post, you will know:

- Fields of study, such as supervised, unsupervised, and reinforcement learning.
- Hybrid types of learning, such as semi-supervised and self-supervised learning.
- Broad techniques, such as active, online, and transfer learning.

Let’s get started.

Given that the focus of the field of machine learning is “*learning*,” there are many types that you may encounter as a practitioner.

Some types of learning describe whole subfields of study comprised of many different types of algorithms such as “*supervised learning*.” Others describe powerful techniques that you can use on your projects, such as “*transfer learning*.”

There are perhaps 14 types of learning that you must be familiar with as a machine learning practitioner; they are:

**Learning Problems**

- 1. Supervised Learning
- 2. Unsupervised Learning
- 3. Reinforcement Learning

**Hybrid Learning Problems**

- 4. Semi-Supervised Learning
- 5. Self-Supervised Learning
- 6. Multi-Instance Learning

**Statistical Inference**

- 7. Inductive Learning
- 8. Deductive Inference
- 9. Transductive Learning

**Learning Techniques**

- 10. Multi-Task Learning
- 11. Active Learning
- 12. Online Learning
- 13. Transfer Learning
- 14. Ensemble Learning

In the following sections, we will take a closer look at each in turn.

**Did I miss an important type of learning?**

Let me know in the comments below.

First, we will take a closer look at three main types of learning problems in machine learning: supervised, unsupervised, and reinforcement learning.

Supervised learning describes a class of problem that involves using a model to learn a mapping between input examples and the target variable.

Applications in which the training data comprises examples of the input vectors along with their corresponding target vectors are known as supervised learning problems.

— Page 3, Pattern Recognition and Machine Learning, 2006.

Models are fit on training data comprised of inputs and outputs and used to make predictions on test sets where only the inputs are provided and the outputs from the model are compared to the withheld target variables and used to estimate the skill of the model.

Learning is a search through the space of possible hypotheses for one that will perform well, even on new examples beyond the training set. To measure the accuracy of a hypothesis we give it a test set of examples that are distinct from the training set.

— Page 695, Artificial Intelligence: A Modern Approach, 3rd edition, 2015.

There are two main types of supervised learning problems: they are classification that involves predicting a class label and regression that involves predicting a numerical value.

**Classification**: Supervised learning problem that involves predicting a class label.**Regression**: Supervised learning problem that involves predicting a numerical label.

Both classification and regression problems may have one or more input variables and input variables may be any data type, such as numerical or categorical.

An example of a classification problem would be the MNIST handwritten digits dataset where the inputs are images of handwritten digits (pixel data) and the output is a class label for what digit the image represents (numbers 0 to 9).

An example of a regression problem would be the Boston house prices dataset where the inputs are variables that describe a neighborhood and the output is a house price in dollars.

Some machine learning algorithms are described as “*supervised*” machine learning algorithms as they are designed for supervised machine learning problems. Popular examples include: decision trees, support vector machines, and many more.

Our goal is to find a useful approximation f(x) to the function f(x) that underlies the predictive relationship between the inputs and outputs

— Page 28, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edition, 2016.

Algorithms are referred to as “*supervised*” because they learn by making predictions given examples of input data, and the models are supervised and corrected via an algorithm to better predict the expected target outputs in the training dataset.

The term supervised learning originates from the view of the target y being provided by an instructor or teacher who shows the machine learning system what to do.

— Page 105, Deep Learning, 2016.

Some algorithms may be specifically designed for classification (such as logistic regression) or regression (such as linear regression) and some may be used for both types of problems with minor modifications (such as artificial neural networks).

Unsupervised learning describes a class of problems that involves using a model to describe or extract relationships in data.

Compared to supervised learning, unsupervised learning operates upon only the input data without outputs or target variables. As such, unsupervised learning does not have a teacher correcting the model, as in the case of supervised learning.

In unsupervised learning, there is no instructor or teacher, and the algorithm must learn to make sense of the data without this guide.

— Page 105, Deep Learning, 2016.

There are many types of unsupervised learning, although there are two main problems that are often encountered by a practitioner: they are clustering that involves finding groups in the data and density estimation that involves summarizing the distribution of data.

**Clustering: Unsupervised**learning problem that involves finding groups in data.**Density Estimation**: Unsupervised learning problem that involves summarizing the distribution of data.

An example of a clustering algorithm is k-Means where *k* refers to the number of clusters to discover in the data. An example of a density estimation algorithm is Kernel Density Estimation that involves using small groups of closely related data samples to estimate the distribution for new points in the problem space.

The most common unsupervised learning task is clustering: detecting potentially useful clusters of input examples. For example, a taxi agent might gradually develop a concept of “good traffic days” and “bad traffic days” without ever being given labeled examples of each by a teacher.

— Pages 694-695, Artificial Intelligence: A Modern Approach, 3rd edition, 2015.

Clustering and density estimation may be performed to learn about the patterns in the data.

Additional unsupervised methods may also be used, such as visualization that involves graphing or plotting data in different ways and projection methods that involves reducing the dimensionality of the data.

**Visualization**: Unsupervised learning problem that involves creating plots of data.**Projection**: Unsupervised learning problem that involves creating lower-dimensional representations of data.

An example of a visualization technique would be a scatter plot matrix that creates one scatter plot of each pair of variables in the dataset. An example of a projection method would be Principal Component Analysis that involves summarizing a dataset in terms of eigenvalues and eigenvectors, with linear dependencies removed.

The goal in such unsupervised learning problems may be to discover groups of similar examples within the data, where it is called clustering, or to determine the distribution of data within the input space, known as density estimation, or to project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization.

— Page 3, Pattern Recognition and Machine Learning, 2006.

Reinforcement learning describes a class of problems where an agent operates in an environment and must *learn* to operate using feedback.

Reinforcement learning is learning what to do — how to map situations to actions—so as to maximize a numerical reward signal. The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them.

— Page 1, Reinforcement Learning: An Introduction, 2nd edition, 2018.

The use of an environment means that there is no fixed training dataset, rather a goal or set of goals that an agent is required to achieve, actions they may perform, and feedback about performance toward the goal.

Some machine learning algorithms do not just experience a fixed dataset. For example, reinforcement learning algorithms interact with an environment, so there is a feedback loop between the learning system and its experiences.

— Page 105, Deep Learning, 2016.

It is similar to supervised learning in that the model has some response from which to learn, although the feedback may be delayed and statistically noisy, making it challenging for the agent or model to connect cause and effect.

An example of a reinforcement problem is playing a game where the agent has the goal of getting a high score and can make moves in the game and received feedback in terms of punishments or rewards.

In many complex domains, reinforcement learning is the only feasible way to train a program to perform at high levels. For example, in game playing, it is very hard for a human to provide accurate and consistent evaluations of large numbers of positions, which would be needed to train an evaluation function directly from examples. Instead, the program can be told when it has won or lost, and it can use this information to learn an evaluation function that gives reasonably accurate estimates of the probability of winning from any given position.

— Page 831, Artificial Intelligence: A Modern Approach, 3rd edition, 2015.

Impressive recent results include the use of reinforcement in Google’s AlphaGo in out-performing the world’s top Go player.

Some popular examples of reinforcement learning algorithms include Q-learning, temporal-difference learning, and deep reinforcement learning.

The lines between unsupervised and supervised learning is blurry, and there are many hybrid approaches that draw from each field of study.

In this section, we will take a closer look at some of the more common hybrid fields of study: semi-supervised, self-supervised, and multi-instance learning.

Semi-supervised learning is supervised learning where the training data contains very few labeled examples and a large number of unlabeled examples.

The goal of a semi-supervised learning model is to make effective use of all of the available data, not just the labelled data like in supervised learning.

In semi-supervised learning we are given a few labeled examples and must make what we can of a large collection of unlabeled examples. Even the labels themselves may not be the oracular truths that we hope for.

— Page 695, Artificial Intelligence: A Modern Approach, 3rd edition, 2015.

Making effective use of unlabelled data may require the use of or inspiration from unsupervised methods such as clustering and density estimation. Once groups or patterns are discovered, supervised methods or ideas from supervised learning may be used to label the unlabeled examples or apply labels to unlabeled representations later used for prediction.

Unsupervised learning can provide useful cues for how to group examples in representation space. Examples that cluster tightly in the input space should be mapped to similar representations.

— Page 243, Deep Learning, 2016.

It is common for many real-world supervised learning problems to be examples of semi-supervised learning problems given the expense or computational cost for labeling examples. For example, classifying photographs requires a dataset of photographs that have already been labeled by human operators.

Many problems from the fields of computer vision (image data), natural language processing (text data), and automatic speech recognition (audio data) fall into this category and cannot be easily addressed using standard supervised learning methods.

… in many practical applications labeled data is very scarce but unlabeled data is plentiful. “Semisupervised” learning attempts to improve the accuracy of supervised learning by exploiting information in unlabeled data. This sounds like magic, but it can work!

— Page 467, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.

Self-supervised learning refers to an unsupervised learning problem that is framed as a supervised learning problem in order to apply supervised learning algorithms to solve it.

Supervised learning algorithms are used to solve an alternate or pretext task, the result of which is a model or representation that can be used in the solution of the original (actual) modeling problem.

The self-supervised learning framework requires only unlabeled data in order to formulate a pretext learning task such as predicting context or image rotation, for which a target objective can be computed without supervision.

— Revisiting Self-Supervised Visual Representation Learning, 2019.

A common example of self-supervised learning is computer vision where a corpus of unlabeled images is available and can be used to train a supervised model, such as making images grayscale and having a model predict a color representation (colorization) or removing blocks of the image and have a model predict the missing parts (inpainting).

In discriminative self-supervised learning, which is the main focus of this work, a model is trained on an auxiliary or ‘pretext’ task for which ground-truth is available for free. In most cases, the pretext task involves predicting some hidden portion of the data (for example, predicting color for gray-scale images

— Scaling and Benchmarking Self-Supervised Visual Representation Learning, 2019.

A general example of self-supervised learning algorithms are autoencoders. These are a type of neural network that is used to create a compact or compressed representation of an input sample. They achieve this via a model that has an encoder and a decoder element separated by a bottleneck that represents the internal compact representation of the input.

An autoencoder is a neural network that is trained to attempt to copy its input to its output. Internally, it has a hidden layer

hthat describes a code used to represent the input.

— Page 502, Deep Learning, 2016.

These autoencoder models are trained by providing the input to the model as both input and the target output, requiring that the model reproduce the input by first encoding it to a compressed representation then decoding it back to the original. Once trained, the decoder is discarded and the encoder is used as needed to create compact representations of input.

Although autoencoders are trained using a supervised learning method, they solve an unsupervised learning problem, namely, they are a type of projection method for reducing the dimensionality of input data.

Traditionally, autoencoders were used for dimensionality reduction or feature learning.

— Page 502, Deep Learning, 2016.

Another example of self-supervised learning is generative adversarial networks, or GANs. These are generative models that are most commonly used for creating synthetic photographs using only a collection of unlabeled examples from the target domain.

GAN models are trained indirectly via a separate discriminator model that classifies examples of photos from the domain as real or fake (generated), the result of which is fed back to update the GAN model and encourage it to generate more realistic photos on the next iteration.

The generator network directly produces samples […]. Its adversary, the discriminator network, attempts to distinguish between samples drawn from the training data and samples drawn from the generator. The discriminator emits a probability value given by d(x; θ(d)), indicating the probability that x is a real training example rather than a fake sample drawn from the model.

— Page 699, Deep Learning, 2016.

Multi-instance learning is a supervised learning problem where individual examples are unlabeled; instead, bags or groups of samples are labeled.

In multi-instance learning, an entire collection of examples is labeled as containing or not containing an example of a class, but the individual members of the collection are not labeled.

— Page 106, Deep Learning, 2016.

Instances are in “*bags*” rather than sets because a given instance may be present one or more times, e.g. duplicates.

Modeling involves using knowledge that one or some of the instances in a bag are associated with a target label, and to predict the label for new bags in the future given their composition of multiple unlabeled examples.

In supervised multi-instance learning, a class label is associated with each bag, and the goal of learning is to determine how the class can be inferred from the instances that make up the bag.

— Page 156, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.

Simple methods, such as assigning class labels to individual instances and using standard supervised learning algorithms, often work as a good first step.

Inference refers to reaching an outcome or decision.

In machine learning, fitting a model and making a prediction are both types of inference.

There are different paradigms for inference that may be used as a framework for understanding how some machine learning algorithms work or how some learning problems may be approached.

Some examples of approaches to learning are inductive, deductive, and transductive learning and inference.

Inductive learning involves using evidence to determine the outcome.

Inductive reasoning refers to using specific cases to determine general outcomes, e.g. specific to general.

Most machine learning models learn using a type of inductive inference or inductive reasoning where general rules (the model) are learned from specific historical examples (the data).

… the problem of induction, which is the problem of how to draw general conclusions about the future from specific observations from the past.

— Page 77, Machine Learning: A Probabilistic Perspective, 2012.

Fitting a machine learning model is a process of induction. The model is a generalization of the specific examples in the training dataset.

A model or hypothesis is made about the problem using the training data, and it is believed to hold over new unseen data later when the model is used.

Lacking any further information, our assumption is that the best hypothesis regarding unseen instances is the hypothesis that best fits the observed training data. This is the fundamental assumption of inductive learning …

— Page 23, Machine Learning, 1997.

Deduction or deductive inference refers to using general rules to determine specific outcomes.

We can better understand induction by contrasting it with deduction.

Deduction is the reverse of induction. If induction is going from the specific to the general, deduction is going from the general to the specific.

… the simple observation that induction is just the inverse of deduction!

— Page 291, Machine Learning, 1997.

Deduction is a top-down type of reasoning that seeks for all premises to be met before determining the conclusion, whereas induction is a bottom-up type of reasoning that uses available data as evidence for an outcome.

In the context of machine learning, once we use induction to fit a model on a training dataset, the model can be used to make predictions. The use of the model is a type of deduction or deductive inference.

Transduction or transductive learning is used in the field of statistical learning theory to refer to predicting specific examples given specific examples from a domain.

It is different from induction that involves learning general rules from specific examples, e.g. specific to specific.

Induction, deriving the function from the given data. Deduction, deriving the values of the given function for points of interest. Transduction, deriving the values of the unknown function for points of interest from the given data.

— Page 169, The Nature of Statistical Learning Theory, 1995.

Unlike induction, no generalization is required; instead, specific examples are used directly. This may, in fact, be a simpler problem than induction to solve.

The model of estimating the value of a function at a given point of interest describes a new concept of inference: moving from the particular to the particular. We call this type of inference transductive inference. Note that this concept of inference appears when one would like to get the best result from a restricted amount of information.

— Page 169, The Nature of Statistical Learning Theory, 1995.

A classical example of a transductive algorithm is the k-Nearest Neighbors algorithm that does not model the training data, but instead uses it directly each time a prediction is required.

For more on the topic of transduction, see the tutorial:

We can contrast these three types of inference in the context of machine learning.

For example:

**Induction**: Learning a general model from specific examples.**Deduction**: Using a model to make predictions.**Transduction**: Using specific examples to make predictions.

The image below summarizes these three different approaches nicely.

There are many techniques that are described as types of learning.

In this section, we will take a closer look at some of the more common methods.

This includes multi-task, active, online, transfer, and ensemble learning.

Multi-task learning is a type of supervised learning that involves fitting a model on one dataset that addresses multiple related problems.

It involves devising a model that can be trained on multiple related tasks in such a way that the performance of the model is improved by training across the tasks as compared to being trained on any single task.

Multi-task learning is a way to improve generalization by pooling the examples (which can be seen as soft constraints imposed on the parameters) arising out of several tasks.

— Page 244, Deep Learning, 2016.

Multi-task learning can be a useful approach to problem-solving when there is an abundance of input data labeled for one task that can be shared with another task with much less labeled data.

… we may want to learn multiple related models at the same time, which is known as multi-task learning. This will allow us to “borrow statistical strength” from tasks with lots of data and to share it with tasks with little data.

Page 231, Machine Learning: A Probabilistic Perspective, 2012.

For example, it is common for a multi-task learning problem to involve the same input patterns that may be used for multiple different outputs or supervised learning problems. In this setup, each output may be predicted by a different part of the model, allowing the core of the model to generalize across each task for the same inputs.

In the same way that additional training examples put more pressure on the parameters of the model towards values that generalize well, when part of a model is shared across tasks, that part of the model is more constrained towards good values (assuming the sharing is justified), often yielding better generalization.

— Page 244, Deep Learning, 2016.

A popular example of multi-task learning is where the same word embedding is used to learn a distributed representation of words in text that is then shared across multiple different natural language processing supervised learning tasks.

Active learning is a technique where the model is able to query a human user operator during the learning process in order to resolve ambiguity during the learning process.

Active learning: The learner adaptively or interactively collects training examples, typically by querying an oracle to request labels for new points.

— Page 7, Foundations of Machine Learning, 2nd edition, 2018.

Active learning is a type of supervised learning and seeks to achieve the same or better performance of so-called “*passive*” supervised learning, although by being more efficient about what data is collected or used by the model.

The key idea behind active learning is that a machine learning algorithm can achieve greater accuracy with fewer training labels if it is allowed to choose the data from which it learns. An active learner may pose queries, usually in the form of unlabeled data instances to be labeled by an oracle (e.g., a human annotator).

— Active Learning Literature Survey, 2009.

It is not unreasonable to view active learning as an approach to solving semi-supervised learning problems, or an alternative paradigm for the same types of problems.

… we see that active learning and semi-supervised learning attack the same problem from opposite directions. While semi-supervised methods exploit what the learner thinks it knows about the unlabeled data, active methods attempt to explore the unknown aspects. It is therefore natural to think about combining the two

— Active Learning Literature Survey, 2009.

Active learning is a useful approach when there is not much data available and new data is expensive to collect or label.

The active learning process allows the sampling of the domain to be directed in a way that minimizes the number of samples and maximizes the effectiveness of the model.

Active learning is often used in applications where labels are expensive to obtain, for example computational biology applications.

— Page 7, Foundations of Machine Learning, 2nd edition, 2018.

Online learning involves using the data available and updating the model directly before a prediction is required or after the last observation was made.

Online learning is appropriate for those problems where observations are provided over time and where the probability distribution of observations is expected to also change over time. Therefore, the model is expected to change just as frequently in order to capture and harness those changes.

Traditionally machine learning is performed offline, which means we have a batch of data, and we optimize an equation […] However, if we have streaming data, we need to perform online learning, so we can update our estimates as each new data point arrives rather than waiting until “the end” (which may never occur).

— Page 261, Machine Learning: A Probabilistic Perspective, 2012.

This approach is also used by algorithms where there may be more observations than can reasonably fit into memory, therefore, learning is performed incrementally over observations, such as a stream of data.

Online learning is helpful when the data may be changing rapidly over time. It is also useful for applications that involve a large collection of data that is constantly growing, even if changes are gradual.

— Page 753, Artificial Intelligence: A Modern Approach, 3rd edition, 2015.

Generally, online learning seeks to minimize “*regret*,” which is how well the model performed compared to how well it might have performed if all the available information was available as a batch.

In the theoretical machine learning community, the objective used in online learning is the regret, which is the averaged loss incurred relative to the best we could have gotten in hindsight using a single fixed parameter value

— Page 262, Machine Learning: A Probabilistic Perspective, 2012.

One example of online learning is so-called stochastic or online gradient descent used to fit an artificial neural network.

The fact that stochastic gradient descent minimizes generalization error is easiest to see in the online learning case, where examples or minibatches are drawn from a stream of data.

— Page 281, Deep Learning, 2016.

Transfer learning is a type of learning where a model is first trained on one task, then some or all of the model is used as the starting point for a related task.

In transfer learning, the learner must perform two or more different tasks, but we assume that many of the factors that explain the variations in P1 are relevant to the variations that need to be captured for learning P2.

— Page 536, Deep Learning, 2016.

It is a useful approach on problems where there is a task related to the main task of interest and the related task has a large amount of data.

It is different from multi-task learning as the tasks are learned sequentially in transfer learning, whereas multi-task learning seeks good performance on all considered tasks by a single model at the same time in parallel.

… pretrain a deep convolutional net with 8 layers of weights on a set of tasks (a subset of the 1000 ImageNet object categories) and then initialize a same-size network with the first k layers of the first net. All the layers of the second network (with the upper layers initialized randomly) are then jointly trained to perform a different set of tasks (another subset of the 1000 ImageNet object categories), with fewer training examples than for the first set of tasks.

— Page 325, Deep Learning, 2016.

An example is image classification, where a predictive model, such as an artificial neural network, can be trained on a large corpus of general images, and the weights of the model can be used as a starting point when training on a smaller more specific dataset, such as dogs and cats. The features already learned by the model on the broader task, such as extracting lines and patterns, will be helpful on the new related task.

If there is significantly more data in the first setting (sampled from P1), then that may help to learn representations that are useful to quickly generalize from only very few examples drawn from P2. Many visual categories share low-level notions of edges and visual shapes, the effects of geometric changes, changes in lighting, etc.

— Page 536, Deep Learning, 2016.

As noted, transfer learning is particularly useful with models that are incrementally trained and an existing model can be used as a starting point for continued training, such as deep learning networks.

For more on the topic of transfer learning, see the tutorial:

Ensemble learning is an approach where two or more modes are fit on the same data and the predictions from each model are combined.

The field of ensemble learning provides many ways of combining the ensemble members’ predictions, including uniform weighting and weights chosen on a validation set.

— Page 472, Deep Learning, 2016.

The objective of ensemble learning is to achieve better performance with the ensemble of models as compared to any individual model. This involves both deciding how to create models used in the ensemble and how to best combine the predictions from the ensemble members.

Ensemble learning can be broken down into two tasks: developing a population of base learners from the training data, and then combining them to form the composite predictor.

— Page 605, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edition, 2016.

Ensemble learning is a useful approach for improving the predictive skill on a problem domain and to reduce the variance of stochastic learning algorithms, such as artificial neural networks.

Some examples of popular ensemble learning algorithms include: weighted average, stacked generalization (stacking), and bootstrap aggregation (bagging).

Bagging, boosting, and stacking have been developed over the last couple of decades, and their performance is often astonishingly good. Machine learning researchers have struggled to understand why.

— Page 480, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.

For more on the topic of ensemble learning, see the tutorial:

This section provides more resources on the topic if you are looking to go deeper.

- Pattern Recognition and Machine Learning, 2006.
- Deep Learning, 2016.
- Reinforcement Learning: An Introduction, 2nd edition, 2018.
- Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edition, 2016.
- Machine Learning: A Probabilistic Perspective, 2012.
- Machine Learning, 1997.
- The Nature of Statistical Learning Theory, 1995.
- Foundations of Machine Learning, 2nd edition, 2018.
- Artificial Intelligence: A Modern Approach, 3rd edition, 2015.

- Revisiting Self-Supervised Visual Representation Learning, 2019.
- Active Learning Literature Survey, 2009.

- Supervised and Unsupervised Machine Learning Algorithms
- Why Do Machine Learning Algorithms Work on New Data?
- How Machine Learning Algorithms Work
- Gentle Introduction to Transduction in Machine Learning
- A Gentle Introduction to Transfer Learning for Deep Learning
- Ensemble Learning Methods for Deep Learning Neural Networks

- Supervised learning, Wikipedia.
- Unsupervised learning, Wikipedia.
- Reinforcement learning, Wikipedia.
- Semi-supervised learning, Wikipedia.
- Multi-task learning, Wikipedia.
- Multiple instance learning, Wikipedia.
- Inductive reasoning, Wikipedia.
- Deductive reasoning, Wikipedia.
- Transduction (machine learning), Wikipedia.
- Active learning (machine learning), Wikipedia.
- Online machine learning, Wikipedia.
- Transfer learning, Wikipedia.
- Ensemble learning, Wikipedia.

In this post, you discovered a gentle introduction to the different types of learning that you may encounter in the field of machine learning.

Specifically, you learned:

- Fields of study, such as supervised, unsupervised, and reinforcement learning.
- Hybrid types of learning, such as semi-supervised and self-supervised learning.
- Broad techniques, such as active, online, and transfer learning.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post 14 Different Types of Learning in Machine Learning appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Maximum a Posteriori (MAP) for Machine Learning appeared first on Machine Learning Mastery.

]]>Density estimation is the problem of estimating the probability distribution for a sample of observations from a problem domain.

Typically, estimating the entire distribution is intractable, and instead, we are happy to have the expected value of the distribution, such as the mean or mode. Maximum a Posteriori or MAP for short is a Bayesian-based approach to estimating a distribution and model parameters that best explain an observed dataset.

This flexible probabilistic framework can be used to provide a Bayesian foundation for many machine learning algorithms, including important methods such as linear regression and logistic regression for predicting numeric values and class labels respectively, and unlike maximum likelihood estimation, explicitly allows prior belief about candidate models to be incorporated systematically.

In this post, you will discover a gentle introduction to Maximum a Posteriori estimation.

After reading this post, you will know:

- Maximum a Posteriori estimation is a probabilistic framework for solving the problem of density estimation.
- MAP involves calculating a conditional probability of observing the data given a model weighted by a prior probability or belief about the model.
- MAP provides an alternate probability framework to maximum likelihood estimation for machine learning.

Discover bayes opimization, naive bayes, maximum likelihood, distributions, cross entropy, and much more in my new book, with 28 step-by-step tutorials and full Python source code.

Let’s get started.

This tutorial is divided into three parts; they are:

- Density Estimation
- Maximum a Posteriori (MAP)
- MAP and Machine Learning

A common modeling problem involves how to estimate a joint probability distribution for a dataset.

For example, given a sample of observation (*X*) from a domain (*x1, x2, x3, …, xn*), where each observation is drawn independently from the domain with the same probability distribution (so-called independent and identically distributed, i.i.d., or close to it).

Density estimation involves selecting a probability distribution function and the parameters of that distribution that best explains the joint probability distribution of the observed data (*X*).

Often estimating the density is too challenging; instead, we are happy with a point estimate from the target distribution, such as the mean.

There are many techniques for solving this problem, although two common approaches are:

- Maximum a Posteriori (MAP), a Bayesian method.
- Maximum Likelihood Estimation (MLE), a frequentist method.

Both approaches frame the problem as optimization and involve searching for a distribution and set of parameters for the distribution that best describes the observed data.

In Maximum Likelihood Estimation, we wish to maximize the probability of observing the data from the joint probability distribution given a specific probability distribution and its parameters, stated formally as:

- P(X ; theta)

or

- P(x1, x2, x3, …, xn ; theta)

This resulting conditional probability is referred to as the likelihood of observing the data given the model parameters.

The objective of Maximum Likelihood Estimation is to find the set of parameters (*theta*) that maximize the likelihood function, e.g. result in the largest likelihood value.

- maximize P(X ; theta)

An alternative and closely related approach is to consider the optimization problem from the perspective of Bayesian probability.

A popular replacement for maximizing the likelihood is maximizing the Bayesian posterior probability density of the parameters instead.

— Page 306, Information Theory, Inference and Learning Algorithms, 2003.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Recall that the Bayes theorem provides a principled way of calculating a conditional probability.

It involves calculating the conditional probability of one outcome given another outcome, using the inverse of this relationship, stated as follows:

- P(A | B) = (P(B | A) * P(A)) / P(B)

The quantity that we are calculating is typically referred to as the posterior probability of *A* given *B* and *P(A)* is referred to as the prior probability of *A*.

The normalizing constant of *P(B)* can be removed, and the posterior can be shown to be proportional to the probability of *B* given *A* multiplied by the prior.

- P(A | B) is proportional to P(B | A) * P(A)

Or, simply:

- P(A | B) = P(B | A) * P(A)

This is a helpful simplification as we are not interested in estimating a probability, but instead in optimizing a quantity. A proportional quantity is good enough for this purpose.

We can now relate this calculation to our desire to estimate a distribution and parameters (*theta*) that best explains our dataset (*X*), as we described in the previous section. This can be stated as:

- P(theta | X) = P(X | theta) * P(theta)

Maximizing this quantity over a range of theta solves an optimization problem for estimating the central tendency of the posterior probability (e.g. the model of the distribution). As such, this technique is referred to as “*maximum a posteriori estimation*,” or MAP estimation for short, and sometimes simply “*maximum posterior estimation*.”

- maximize P(X | theta) * P(theta)

We are typically not calculating the full posterior probability distribution, and in fact, this may not be tractable for many problems of interest.

… finding MAP hypotheses is often much easier than Bayesian learning, because it requires solving an optimization problem instead of a large summation (or integration) problem.

— Page 804, Artificial Intelligence: A Modern Approach, 3rd edition, 2009.

Instead, we are calculating a point estimation such as a moment of the distribution, like the mode, the most common value, which is the same as the mean for the normal distribution.

One common reason for desiring a point estimate is that most operations involving the Bayesian posterior for most interesting models are intractable, and a point estimate offers a tractable approximation.

— Page 139, Deep Learning, 2016.

**Note**: this is very similar to Maximum Likelihood Estimation, with the addition of the prior probability over the distribution and parameters.

In fact, if we assume that all values of *theta* are equally likely because we don’t have any prior information (e.g. a uniform prior), then both calculations are equivalent.

Because of this equivalence, both MLE and MAP often converge to the same optimization problem for many machine learning algorithms. This is not always the case; if the calculation of the MLE and MAP optimization problem differ, the MLE and MAP solution found for an algorithm may also differ.

… the maximum likelihood hypothesis might not be the MAP hypothesis, but if one assumes uniform prior probabilities over the hypotheses then it is.

— Page 167, Machine Learning, 1997.

In machine learning, Maximum a Posteriori optimization provides a Bayesian probability framework for fitting model parameters to training data and an alternative and sibling to the perhaps more common Maximum Likelihood Estimation framework.

Maximum a posteriori (MAP) learning selects a single most likely hypothesis given the data. The hypothesis prior is still used and the method is often more tractable than full Bayesian learning.

— Page 825, Artificial Intelligence: A Modern Approach, 3rd edition, 2009.

One framework is not better than another, and as mentioned, in many cases, both frameworks frame the same optimization problem from different perspectives.

Instead, MAP is appropriate for those problems where there is some prior information, e.g. where a meaningful prior can be set to weigh the choice of different distributions and parameters or model parameters. MLE is more appropriate where there is no such prior.

Bayesian methods can be used to determine the most probable hypothesis given the data-the maximum a posteriori (MAP) hypothesis. This is the optimal hypothesis in the sense that no other hypothesis is more likely.

— Page 197, Machine Learning, 1997.

In fact, the addition of the prior to the MLE can be thought of as a type of regularization of the MLE calculation. This insight allows other regularization methods (e.g. L2 norm in models that use a weighted sum of inputs) to be interpreted under a framework of MAP Bayesian inference. For example, L2 is a bias or prior that assumes that a set of coefficients or weights have a small sum squared value.

… in particular, L2 regularization is equivalent to MAP Bayesian inference with a Gaussian prior on the weights.

— Page 236, Deep Learning, 2016.

We can make the relationship between MAP and machine learning clearer by re-framing the optimization problem as being performed over candidate modeling hypotheses (*h* in *H*) instead of the more abstract distribution and parameters (*theta*); for example:

- maximize P(X | h) * P(h)

Here, we can see that we want a model or hypothesis (*h*) that best explains the observed training dataset (*X*) and that the prior (*P(h)*) is our belief about how useful a hypothesis is expected to be, generally, regardless of the training data. The optimization problem involves estimating the posterior probability for each candidate hypothesis.

We can determine the MAP hypotheses by using Bayes theorem to calculate the posterior probability of each candidate hypothesis.

— Page 157, Machine Learning, 1997.

Like MLE, solving the optimization problem depends on the choice of model. For simpler models, like linear regression, there are analytical solutions. For more complex models like logistic regression, numerical optimization is required that makes use of first- and second-order derivatives. For the more prickly problems, stochastic optimization algorithms may be required.

This section provides more resources on the topic if you are looking to go deeper.

- Chapter 6 Bayesian Learning, Machine Learning, 1997.
- Chapter 12 Maximum Entropy Models, Foundations of Machine Learning, 2018.
- Chapter 9 Probabilistic methods, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.
- Chapter 5 Machine Learning Basics, Deep Learning, 2016.
- Chapter 13 MAP Inference, Probabilistic Graphical Models: Principles and Techniques, 2009.

In this post, you discovered a gentle introduction to Maximum a Posteriori estimation.

Specifically, you learned:

- Maximum a Posteriori estimation is a probabilistic framework for solving the problem of density estimation.
- MAP involves calculating a conditional probability of observing the data given a model weighted by a prior probability or belief about the model.
- MAP provides an alternate probability framework to maximum likelihood estimation for machine learning.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Maximum a Posteriori (MAP) for Machine Learning appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Markov Chain Monte Carlo for Probability appeared first on Machine Learning Mastery.

]]>Probabilistic inference involves estimating an expected value or density using a probabilistic model.

Often, directly inferring values is not tractable with probabilistic models, and instead, approximation methods must be used.

Markov Chain Monte Carlo sampling provides a class of algorithms for systematic random sampling from high-dimensional probability distributions. Unlike Monte Carlo sampling methods that are able to draw independent samples from the distribution, Markov Chain Monte Carlo methods draw samples where the next sample is dependent on the existing sample, called a Markov Chain. This allows the algorithms to narrow in on the quantity that is being approximated from the distribution, even with a large number of random variables.

In this post, you will discover a gentle introduction to Markov Chain Monte Carlo for machine learning.

After reading this post, you will know:

- Monte Carlo sampling is not effective and may be intractable for high-dimensional probabilistic models.
- Markov Chain Monte Carlo provides an alternate approach to random sampling a high-dimensional probability distribution where the next sample is dependent upon the current sample.
- Gibbs Sampling and the more general Metropolis-Hastings algorithm are the two most common approaches to Markov Chain Monte Carlo sampling.

Discover bayes opimization, naive bayes, maximum likelihood, distributions, cross entropy, and much more in my new book, with 28 step-by-step tutorials and full Python source code.

Let’s get started.

This tutorial is divided into three parts; they are:

- Challenge of Probabilistic Inference
- What Is Markov Chain Monte Carlo
- Markov Chain Monte Carlo Algorithms

Calculating a quantity from a probabilistic model is referred to more generally as probabilistic inference, or simply inference.

For example, we may be interested in calculating an expected probability, estimating the density, or other properties of the probability distribution. This is the goal of the probabilistic model, and the name of the inference performed often takes on the name of the probabilistic model, e.g. Bayesian Inference is performed with a Bayesian probabilistic model.

The direct calculation of the desired quantity from a model of interest is intractable for all but the most trivial probabilistic models. Instead, the expected probability or density must be approximated by other means.

For most probabilistic models of practical interest, exact inference is intractable, and so we have to resort to some form of approximation.

— Page 523, Pattern Recognition and Machine Learning, 2006.

The desired calculation is typically a sum of a discrete distribution of many random variables or integral of a continuous distribution of many variables and is intractable to calculate. This problem exists in both schools of probability, although is perhaps more prevalent or common with Bayesian probability and integrating over a posterior distribution for a model.

Bayesians, and sometimes also frequentists, need to integrate over possibly high-dimensional probability distributions to make inference about model parameters or to make predictions. Bayesians need to integrate over the posterior distribution of model parameters given the data, and frequentists may need to integrate over the distribution of observables given parameter values.

— Page 1, Markov Chain Monte Carlo in Practice, 1996.

The typical solution is to draw independent samples from the probability distribution, then repeat this process many times to approximate the desired quantity. This is referred to as Monte Carlo sampling or Monte Carlo integration, named for the city in Monaco that has many casinos.

The problem with Monte Carlo sampling is that it does not work well in high-dimensions. This is firstly because of the curse of dimensionality, where the volume of the sample space increases exponentially with the number of parameters (dimensions).

Secondly, and perhaps most critically, this is because Monte Carlo sampling assumes that each random sample drawn from the target distribution is independent and can be independently drawn. This is typically not the case or intractable for inference with Bayesian structured or graphical probabilistic models.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The solution to sampling probability distributions in high-dimensions is to use Markov Chain Monte Carlo, or MCMC for short.

The most popular method for sampling from high-dimensional distributions is Markov chain Monte Carlo or MCMC

— Page 837, Machine Learning: A Probabilistic Perspective, 2012.

Like Monte Carlo methods, Markov Chain Monte Carlo was first developed around the same time as the development of the first computers and was used in calculations for particle physics required as part of the Manhattan project for developing the atomic bomb.

Monte Carlo is a technique for randomly sampling a probability distribution and approximating a desired quantity.

Monte Carlo algorithms, [….] are used in many branches of science to estimate quantities that are difficult to calculate exactly.

— Page 530, Artificial Intelligence: A Modern Approach, 3rd edition, 2009.

Monte Carlo methods typically assume that we can efficiently draw samples from the target distribution. From the samples that are drawn, we can then estimate the sum or integral quantity as the mean or variance of the drawn samples.

A useful way to think about a Monte Carlo sampling process is to consider a complex two-dimensional shape, such as a spiral. We cannot easily define a function to describe the spiral, but we may be able to draw samples from the domain and determine if they are part of the spiral or not. Together, a large number of samples drawn from the domain will allow us to summarize the shape (probability density) of the spiral.

Markov chain is a systematic method for generating a sequence of random variables where the current value is probabilistically dependent on the value of the prior variable. Specifically, selecting the next variable is only dependent upon the last variable in the chain.

A Markov chain is a special type of stochastic process, which deals with characterization of sequences of random variables. Special interest is paid to the dynamic and the limiting behaviors of the sequence.

— Page 113, Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference, 2006.

Consider a board game that involves rolling dice, such as snakes and ladders (or chutes and ladders). The roll of a die has a uniform probability distribution across 6 stages (integers 1 to 6). You have a position on the board, but your next position on the board is only based on the current position and the random roll of the dice. Your specific positions on the board form a Markov chain.

Another example of a Markov chain is a random walk in one dimension, where the possible moves are 1, -1, chosen with equal probability, and the next point on the number line in the walk is only dependent upon the current position and the randomly chosen move.

At a high level, a Markov chain is defined in terms of a graph of states over which the sampling algorithm takes a random walk.

— Page 507, Probabilistic Graphical Models: Principles and Techniques, 2009.

Combining these two methods, Markov Chain and Monte Carlo, allows random sampling of high-dimensional probability distributions that honors the probabilistic dependence between samples by constructing a Markov Chain that comprise the Monte Carlo sample.

MCMC is essentially Monte Carlo integration using Markov chains. […] Monte Carlo integration draws samples from the the required distribution, and then forms sample averages to approximate expectations. Markov chain Monte Carlo draws these samples by running a cleverly constructed Markov chain for a long time.

— Page 1, Markov Chain Monte Carlo in Practice, 1996.

Specifically, MCMC is for performing inference (e.g. estimating a quantity or a density) for probability distributions where independent samples from the distribution cannot be drawn, or cannot be drawn easily.

As such, Monte Carlo sampling cannot be used.

Instead, samples are drawn from the probability distribution by constructing a Markov Chain, where the next sample that is drawn from the probability distribution is dependent upon the last sample that was drawn. The idea is that the chain will settle on (find equilibrium) on the desired quantity we are inferring.

Yet, we are still sampling from the target probability distribution with the goal of approximating a desired quantity, so it is appropriate to refer to the resulting collection of samples as a Monte Carlo sample, e.g. extent of samples drawn often forms one long Markov chain.

The idea of imposing a dependency between samples may seem odd at first, but may make more sense if we consider domains like the random walk or snakes and ladders games, where such dependency between samples is required.

There are many Markov Chain Monte Carlo algorithms that mostly define different ways of constructing the Markov Chain when performing each Monte Carlo sample.

The random walk provides a good metaphor for the construction of the Markov chain of samples, yet it is very inefficient. Consider the case where we may want to calculate the expected probability; it is more efficient to zoom in on that quantity or density, rather than wander around the domain. Markov Chain Monte Carlo algorithms are attempts at carefully harnessing properties of the problem in order to construct the chain efficiently.

This sequence is constructed so that, although the first sample may be generated from the prior, successive samples are generated from distributions that provably get closer and closer to the desired posterior.

— Page 505, Probabilistic Graphical Models: Principles and Techniques, 2009.

MCMC algorithms are sensitive to their starting point, and often require a warm-up phase or burn-in phase to move in towards a fruitful part of the search space, after which prior samples can be discarded and useful samples can be collected.

Additionally, it can be challenging to know whether a chain has converged and collected a sufficient number of steps. Often a very large number of samples are required and a run is stopped given a fixed number of steps.

… it is necessary to discard some of the initial samples until the Markov chain has burned in, or entered its stationary distribution.

— Page 838, Machine Learning: A Probabilistic Perspective, 2012.

The most common general Markov Chain Monte Carlo algorithm is called Gibbs Sampling; a more general version of this sampler is called the Metropolis-Hastings algorithm.

Let’s take a closer look at both methods.

The Gibbs Sampling algorithm is an approach to constructing a Markov chain where the probability of the next sample is calculated as the conditional probability given the prior sample.

Samples are constructed by changing one random variable at a time, meaning that subsequent samples are very close in the search space, e.g. local. As such, there is some risk of the chain getting stuck.

The idea behind Gibbs sampling is that we sample each variable in turn, conditioned on the values of all the other variables in the distribution.

— Page 838, Machine Learning: A Probabilistic Perspective, 2012.

Gibbs Sampling is appropriate for those probabilistic models where this conditional probability can be calculated, e.g. the distribution is discrete rather than continuous.

… Gibbs sampling is applicable only in certain circumstances; in particular, we must be able to sample from the distribution P(Xi | x-i). Although this sampling step is easy for discrete graphical models, in continuous models, the conditional distribution may not be one that has a parametric form that allows sampling, so that Gibbs is not applicable.

— Page 515, Probabilistic Graphical Models: Principles and Techniques, 2009.

The Metropolis-Hastings Algorithm is appropriate for those probabilistic models where we cannot directly sample the so-called next state probability distribution, such as the conditional probability distribution used by Gibbs Sampling.

Unlike the Gibbs chain, the algorithm does not assume that we can generate next-state samples from a particular target distribution.

— Page 517, Probabilistic Graphical Models: Principles and Techniques, 2009.

Instead, the Metropolis-Hastings algorithm involves using a surrogate or proposal probability distribution that is sampled (sometimes called the kernel), then an acceptance criterion that decides whether the new sample is accepted into the chain or discarded.

They are based on a Markov chain whose dependence on the predecessor is split into two parts: a proposal and an acceptance of the proposal. The proposals suggest an arbitrary next step in the trajectory of the chain and the acceptance makes sure the appropriate limiting direction is maintained by rejecting unwanted moves of the chain.

— Page 6, Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference, 2006.

The acceptance criterion is probabilistic based on how likely the proposal distribution differs from the true next-state probability distribution.

The Metropolis-Hastings Algorithm is a more general and flexible Markov Chain Monte Carlo algorithm, subsuming many other methods.

For example, if the next-step conditional probability distribution is used as the proposal distribution, then the Metropolis-Hastings is generally equivalent to the Gibbs Sampling Algorithm. If a symmetric proposal distribution is used like a Gaussian, the algorithm is equivalent to another MCMC method called the Metropolis algorithm.

This section provides more resources on the topic if you are looking to go deeper.

- Markov Chain Monte Carlo in Practice, 1996.
- Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference, 2006.
- Handbook of Markov Chain Monte Carlo, 2011.
- Probabilistic Graphical Models: Principles and Techniques, 2009.

- Chapter 24 Markov chain Monte Carlo (MCMC) inference, Machine Learning: A Probabilistic Perspective, 2012.
- Section 11.2. Markov Chain Monte Carlo, Pattern Recognition and Machine Learning, 2006.
- Section 17.3 Markov Chain Monte Carlo Methods, Deep Learning, 2016.

- Monte Carlo method, Wikipedia.
- Markov chain, Wikipedia.
- Markov chain Monte Carlo, Wikipedia.
- Gibbs sampling, Wikipedia.
- Metropolis–Hastings algorithm, Wikipedia.
- MCMC sampling for dummies, 2015.
- How would you explain Markov Chain Monte Carlo (MCMC) to a layperson?

In this post, you discovered a gentle introduction to Markov Chain Monte Carlo for machine learning.

Specifically, you learned:

- Monte Carlo sampling is not effective and may be intractable for high-dimensional probabilistic models.
- Markov Chain Monte Carlo provides an alternate approach to random sampling a high-dimensional probability distribution where the next sample is dependent upon the current sample.
- Gibbs Sampling and the more general Metropolis-Hastings algorithm are the two most common approaches to Markov Chain Monte Carlo sampling.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Markov Chain Monte Carlo for Probability appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Monte Carlo Sampling for Probability appeared first on Machine Learning Mastery.

]]>Monte Carlo methods are a class of techniques for randomly sampling a probability distribution.

There are many problem domains where describing or estimating the probability distribution is relatively straightforward, but calculating a desired quantity is intractable. This may be due to many reasons, such as the stochastic nature of the domain or an exponential number of random variables.

Instead, a desired quantity can be approximated by using random sampling, referred to as Monte Carlo methods. These methods were initially used around the time that the first computers were created and remain pervasive through all fields of science and engineering, including artificial intelligence and machine learning.

In this post, you will discover Monte Carlo methods for sampling probability distributions.

After reading this post, you will know:

- Often, we cannot calculate a desired quantity in probability, but we can define the probability distributions for the random variables directly or indirectly.
- Monte Carlo sampling a class of methods for randomly sampling from a probability distribution.
- Monte Carlo sampling provides the foundation for many machine learning methods such as resampling, hyperparameter tuning, and ensemble learning.

Discover bayes opimization, naive bayes, maximum likelihood, distributions, cross entropy, and much more in my new book, with 28 step-by-step tutorials and full Python source code.

Let’s get started.

This tutorial is divided into three parts; they are:

- Need for Sampling
- What Are Monte Carlo Methods?
- Examples of Monte Carlo Methods

There are many problems in probability, and more broadly in machine learning, where we cannot calculate an analytical solution directly.

In fact, there may be an argument that exact inference may be intractable for most practical probabilistic models.

For most probabilistic models of practical interest, exact inference is intractable, and so we have to resort to some form of approximation.

— Page 523, Pattern Recognition and Machine Learning, 2006.

The desired calculation is typically a sum of a discrete distribution or integral of a continuous distribution and is intractable to calculate. The calculation may be intractable for many reasons, such as the large number of random variables, the stochastic nature of the domain, noise in the observations, the lack of observations, and more.

In problems of this kind, it is often possible to define or estimate the probability distributions for the random variables involved, either directly or indirectly via a computational simulation.

Instead of calculating the quantity directly, sampling can be used.

Sampling provides a flexible way to approximate many sums and integrals at reduced cost.

— Page 590, Deep Learning, 2016.

Samples can be drawn randomly from the probability distribution and used to approximate the desired quantity.

This general class of techniques for random sampling from a probability distribution is referred to as Monte Carlo methods.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Monte Carlo methods, or MC for short, are a class of techniques for randomly sampling a probability distribution.

There are three main reasons to use Monte Carlo methods to randomly sample a probability distribution; they are:

**Estimate density**, gather samples to approximate the distribution of a target function.**Approximate a quantity**, such as the mean or variance of a distribution.**Optimize a function**, locate a sample that maximizes or minimizes the target function.

Monte Carlo methods are named for the casino in Monaco and were first developed to solve problems in particle physics at around the time of the development of the first computers and the Manhattan project for developing the first atomic bomb.

This is called a Monte Carlo approximation, named after a city in Europe known for its plush gambling casinos. Monte Carlo techniques were first developed in the area of statistical physics – in particular, during development of the atomic bomb – but are now widely used in statistics and machine learning as well.

— Page 52, Machine Learning: A Probabilistic Perspective, 2012.

Drawing a sample may be as simple as calculating the probability for a randomly selected event, or may be as complex as running a computational simulation, with the latter often referred to as a Monte Carlo simulation.

Multiple samples are collected and used to approximate the desired quantity.

Given the law of large numbers from statistics, the more random trials that are performed, the more accurate the approximated quantity will become.

… the law of large numbers states that if the samples x(i) are i.i.d., then the average converges almost surely to the expected value

— Page 591, Deep Learning, 2016.

As such, the number of samples provides control over the precision of the quantity that is being approximated, often limited by the computational complexity of drawing a sample.

By generating enough samples, we can achieve any desired level of accuracy we like. The main issue is: how do we efficiently generate samples from a probability distribution, particularly in high dimensions?

— Page 815, Machine Learning: A Probabilistic Perspective, 2012.

Additionally, given the central limit theorem, the distribution of the samples will form a Normal distribution, the mean of which can be taken as the approximated quantity and the variance used to provide a confidence interval for the quantity.

The central limit theorem tells us that the distribution of the average […], converges to a normal distribution […] This allows us to estimate confidence intervals around the estimate […], using the cumulative distribution of the normal density.

— Page 592, Deep Learning, 2016.

Monte Carlo methods are defined in terms of the way that samples are drawn or the constraints imposed on the sampling process.

Some examples of Monte Carlo sampling methods include: direct sampling, importance sampling, and rejection sampling.

**Direct Sampling**. Sampling the distribution directly without prior information.**Importance Sampling**. Sampling from a simpler approximation of the target distribution.**Rejection Sampling**. Sampling from a broader distribution and only considering samples within a region of the sampled distribution.

It’s a huge topic with many books dedicated to it. Next, let’s make the idea of Monte Carlo sampling concrete with some familiar examples.

We use Monte Carlo methods all the time without thinking about it.

For example, when we define a Bernoulli distribution for a coin flip and simulate flipping a coin by sampling from this distribution, we are performing a Monte Carlo simulation. Additionally, when we sample from a uniform distribution for the integers {1,2,3,4,5,6} to simulate the roll of a dice, we are performing a Monte Carlo simulation.

We are also using the Monte Carlo method when we gather a random sample of data from the domain and estimate the probability distribution of the data using a histogram or density estimation method.

There are many examples of the use of Monte Carlo methods across a range of scientific disciplines.

For example, Monte Carlo methods can be used for:

- Calculating the probability of a move by an opponent in a complex game.
- Calculating the probability of a weather event in the future.
- Calculating the probability of a vehicle crash under specific conditions.

The methods are used to address difficult inference in problems in applied probability, such as sampling from probabilistic graphical models.

Related is the idea of sequential Monte Carlo methods used in Bayesian models that are often referred to as particle filters.

Particle filtering (PF) is a Monte Carlo, or simulation based, algorithm for recursive Bayesian inference.

— Page 823, Machine Learning: A Probabilistic Perspective, 2012.

Monte Carlo methods are also pervasive in artificial intelligence and machine learning.

Many important technologies used to accomplish machine learning goals are based on drawing samples from some probability distribution and using these samples to form a Monte Carlo estimate of some desired quantity.

— Page 590, Deep Learning, 2016.

They provide the basis for estimating the likelihood of outcomes in artificial intelligence problems via simulation, such as robotics. More simply, Monte Carlo methods are used to solve intractable integration problems, such as firing random rays in path tracing for computer graphics when rendering a computer-generated scene.

In machine learning, Monte Carlo methods provide the basis for resampling techniques like the bootstrap method for estimating a quantity, such as the accuracy of a model on a limited dataset.

The bootstrap is a simple Monte Carlo technique to approximate the sampling distribution. This is particularly useful in cases where the estimator is a complex function of the true parameters.

— Page 192, Machine Learning: A Probabilistic Perspective, 2012.

Random sampling of model hyperparameters when tuning a model is a Monte Carlo method, as are ensemble models used to overcome challenges such as the limited size and noise in a small data sample and the stochastic variance in a learning algorithm.

- Resampling algorithms.
- Random hyperparameter tuning.
- Ensemble learning algorithms.

Monte Carlo methods also provide the basis for randomized or stochastic optimization algorithms, such as the popular Simulated Annealing optimization technique.

Monte Carlo algorithms, of which simulated annealing is an example, are used in many branches of science to estimate quantities that are difficult to calculate exactly.

— Page 530, Artificial Intelligence: A Modern Approach, 3rd edition, 2009.

- Stochastic optimization algorithms.

We can make Monte Carlo sampling concrete with a worked example.

In this case, we will have a function that defines the probability distribution of a random variable. We will use a Gaussian distribution with a mean of 50 and a standard deviation of 5 and draw random samples from this distribution.

Let’s pretend we don’t know the form of the probability distribution for this random variable and we want to sample the function to get an idea of the probability density. We can draw a sample of a given size and plot a histogram to estimate the density.

The normal() NumPy function can be used to randomly draw samples from a Gaussian distribution with the specified mean (*mu*), standard deviation (*sigma*), and sample size.

To make the example more interesting, we will repeat this experiment four times with different sized samples. We would expect that as the size of the sample is increased, the probability density will better approximate the true density of the target function, given the law of large numbers.

The complete example is listed below.

# example of effect of size on monte carlo sample from numpy.random import normal from matplotlib import pyplot # define the distribution mu = 50 sigma = 5 # generate monte carlo samples of differing size sizes = [10, 50, 100, 1000] for i in range(len(sizes)): # generate sample sample = normal(mu, sigma, sizes[i]) # plot histogram of sample pyplot.subplot(2, 2, i+1) pyplot.hist(sample, bins=20) pyplot.title('%d samples' % sizes[i]) pyplot.xticks([]) # show the plot pyplot.show()

Running the example creates four differently sized samples and plots a histogram for each.

We can see that the small sample sizes of 10 and 50 do not effectively capture the density of the target function. We can see that 100 samples is better, but it is not until 1,000 samples that we clearly see the familiar bell-shape of the Gaussian probability distribution.

This highlights the need to draw many samples, even for a simple random variable, and the benefit of increased accuracy of the approximation with the number of samples drawn.

This section provides more resources on the topic if you are looking to go deeper.

- Chapter 29 Monte Carlo Methods, Information Theory, Inference and Learning Algorithms, 2003.
- Chapter 27 Sampling, Bayesian Reasoning and Machine Learning, 2011.
- Section 14.5 Approximate Inference In Bayesian Networks, Artificial Intelligence: A Modern Approach, 3rd edition, 2009.
- Chapter 23 Monte Carlo inference, Machine Learning: A Probabilistic Perspective, 2012.
- Chapter 11 Sampling Methods, Pattern Recognition and Machine Learning, 2006.
- Chapter 17 Monte Carlo Methods, Deep Learning, 2016.

- Sampling (statistics), Wikipedia.
- Monte Carlo method, Wikipedia.
- Monte Carlo integration, Wikipedia.
- Importance sampling, Wikipedia.
- Rejection sampling, Wikipedia.

In this post, you discovered Monte Carlo methods for sampling probability distributions.

Specifically, you learned:

- Often, we cannot calculate a desired quantity in probability, but we can define the probability distributions for the random variables directly or indirectly.
- Monte Carlo sampling a class of methods for randomly sampling from a probability distribution.
- Monte Carlo sampling provides the foundation for many machine learning methods such as resampling, hyperparameter tuning, and ensemble learning.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Monte Carlo Sampling for Probability appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Expectation-Maximization (EM Algorithm) appeared first on Machine Learning Mastery.

]]>Maximum likelihood estimation is an approach to density estimation for a dataset by searching across probability distributions and their parameters.

It is a general and effective approach that underlies many machine learning algorithms, although it requires that the training dataset is complete, e.g. all relevant interacting random variables are present. Maximum likelihood becomes intractable if there are variables that interact with those in the dataset but were hidden or not observed, so-called latent variables.

The expectation-maximization algorithm is an approach for performing maximum likelihood estimation in the presence of latent variables. It does this by first estimating the values for the latent variables, then optimizing the model, then repeating these two steps until convergence. It is an effective and general approach and is most commonly used for density estimation with missing data, such as clustering algorithms like the Gaussian Mixture Model.

In this post, you will discover the expectation-maximization algorithm.

After reading this post, you will know:

- Maximum likelihood estimation is challenging on data in the presence of latent variables.
- Expectation maximization provides an iterative solution to maximum likelihood estimation with latent variables.
- Gaussian mixture models are an approach to density estimation where the parameters of the distributions are fit using the expectation-maximization algorithm.

Let’s get started.

**Update Nov/2019**: Fixed typo in code comment (thanks Daniel)

This tutorial is divided into four parts; they are:

- Problem of Latent Variables for Maximum Likelihood
- Expectation-Maximization Algorithm
- Gaussian Mixture Model and the EM Algorithm
- Example of Gaussian Mixture Model

A common modeling problem involves how to estimate a joint probability distribution for a dataset.

Density estimation involves selecting a probability distribution function and the parameters of that distribution that best explain the joint probability distribution of the observed data.

There are many techniques for solving this problem, although a common approach is called maximum likelihood estimation, or simply “*maximum likelihood*.”

Maximum Likelihood Estimation involves treating the problem as an optimization or search problem, where we seek a set of parameters that results in the best fit for the joint probability of the data sample.

A limitation of maximum likelihood estimation is that it assumes that the dataset is complete, or fully observed. This does not mean that the model has access to all data; instead, it assumes that all variables that are relevant to the problem are present.

This is not always the case. There may be datasets where only some of the relevant variables can be observed, and some cannot, and although they influence other random variables in the dataset, they remain hidden.

More generally, these unobserved or hidden variables are referred to as latent variables.

Many real-world problems have hidden variables (sometimes called latent variables), which are not observable in the data that are available for learning.

— Page 816, Artificial Intelligence: A Modern Approach, 3rd edition, 2009.

Conventional maximum likelihood estimation does not work well in the presence of latent variables.

… if we have missing data and/or latent variables, then computing the [maximum likelihood] estimate becomes hard.

— Page 349, Machine Learning: A Probabilistic Perspective, 2012.

Instead, an alternate formulation of maximum likelihood is required for searching for the appropriate model parameters in the presence of latent variables.

The Expectation-Maximization algorithm is one such approach.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The Expectation-Maximization Algorithm, or EM algorithm for short, is an approach for maximum likelihood estimation in the presence of latent variables.

A general technique for finding maximum likelihood estimators in latent variable models is the expectation-maximization (EM) algorithm.

— Page 424, Pattern Recognition and Machine Learning, 2006.

The EM algorithm is an iterative approach that cycles between two modes. The first mode attempts to estimate the missing or latent variables, called the estimation-step or E-step. The second mode attempts to optimize the parameters of the model to best explain the data, called the maximization-step or M-step.

**E-Step**. Estimate the missing variables in the dataset.**M-Step**. Maximize the parameters of the model in the presence of the data.

The EM algorithm can be applied quite widely, although is perhaps most well known in machine learning for use in unsupervised learning problems, such as density estimation and clustering.

Perhaps the most discussed application of the EM algorithm is for clustering with a mixture model.

A mixture model is a model comprised of an unspecified combination of multiple probability distribution functions.

A statistical procedure or learning algorithm is used to estimate the parameters of the probability distributions to best fit the density of a given training dataset.

The Gaussian Mixture Model, or GMM for short, is a mixture model that uses a combination of Gaussian (Normal) probability distributions and requires the estimation of the mean and standard deviation parameters for each.

There are many techniques for estimating the parameters for a GMM, although a maximum likelihood estimate is perhaps the most common.

Consider the case where a dataset is comprised of many points that happen to be generated by two different processes. The points for each process have a Gaussian probability distribution, but the data is combined and the distributions are similar enough that it is not obvious to which distribution a given point may belong.

The processes used to generate the data point represents a latent variable, e.g. process 0 and process 1. It influences the data but is not observable. As such, the EM algorithm is an appropriate approach to use to estimate the parameters of the distributions.

In the EM algorithm, the estimation-step would estimate a value for the process latent variable for each data point, and the maximization step would optimize the parameters of the probability distributions in an attempt to best capture the density of the data. The process is repeated until a good set of latent values and a maximum likelihood is achieved that fits the data.

**E-Step**. Estimate the expected value for each latent variable.**M-Step**. Optimize the parameters of the distribution using maximum likelihood.

We can imagine how this optimization procedure could be constrained to just the distribution means, or generalized to a mixture of many different Gaussian distributions.

We can make the application of the EM algorithm to a Gaussian Mixture Model concrete with a worked example.

First, let’s contrive a problem where we have a dataset where points are generated from one of two Gaussian processes. The points are one-dimensional, the mean of the first distribution is 20, the mean of the second distribution is 40, and both distributions have a standard deviation of 5.

We will draw 3,000 points from the first process and 7,000 points from the second process and mix them together.

... # generate a sample X1 = normal(loc=20, scale=5, size=3000) X2 = normal(loc=40, scale=5, size=7000) X = hstack((X1, X2))

We can then plot a histogram of the points to give an intuition for the dataset. We expect to see a bimodal distribution with a peak for each of the means of the two distributions.

The complete example is listed below.

# example of a bimodal constructed from two gaussian processes from numpy import hstack from numpy.random import normal from matplotlib import pyplot # generate a sample X1 = normal(loc=20, scale=5, size=3000) X2 = normal(loc=40, scale=5, size=7000) X = hstack((X1, X2)) # plot the histogram pyplot.hist(X, bins=50, density=True) pyplot.show()

Running the example creates the dataset and then creates a histogram plot for the data points.

The plot clearly shows the expected bimodal distribution with a peak for the first process around 20 and a peak for the second process around 40.

We can see that for many of the points in the middle of the two peaks that it is ambiguous as to which distribution they were drawn from.

We can model the problem of estimating the density of this dataset using a Gaussian Mixture Model.

The GaussianMixture scikit-learn class can be used to model this problem and estimate the parameters of the distributions using the expectation-maximization algorithm.

The class allows us to specify the suspected number of underlying processes used to generate the data via the *n_components* argument when defining the model. We will set this to 2 for the two processes or distributions.

If the number of processes was not known, a range of different numbers of components could be tested and the model with the best fit could be chosen, where models could be evaluated using scores such as Akaike or Bayesian Information Criterion (AIC or BIC).

There are also many ways we can configure the model to incorporate other information we may know about the data, such as how to estimate initial values for the distributions. In this case, we will randomly guess the initial parameters, by setting the *init_params* argument to ‘random’.

... # fit model model = GaussianMixture(n_components=2, init_params='random') model.fit(X)

Once the model is fit, we can access the learned parameters via arguments on the model, such as the means, covariances, mixing weights, and more.

More usefully, we can use the fit model to estimate the latent parameters for existing and new data points.

For example, we can estimate the latent variable for the points in the training dataset and we would expect the first 3,000 points to belong to one process (e.g. *value=1*) and the next 7,000 data points to belong to a different process (e.g. *value=0*).

... # predict latent values yhat = model.predict(X) # check latent value for first few points print(yhat[:100]) # check latent value for last few points print(yhat[-100:])

Tying all of this together, the complete example is listed below.

# example of fitting a gaussian mixture model with expectation maximization from numpy import hstack from numpy.random import normal from sklearn.mixture import GaussianMixture # generate a sample X1 = normal(loc=20, scale=5, size=3000) X2 = normal(loc=40, scale=5, size=7000) X = hstack((X1, X2)) # reshape into a table with one column X = X.reshape((len(X), 1)) # fit model model = GaussianMixture(n_components=2, init_params='random') model.fit(X) # predict latent values yhat = model.predict(X) # check latent value for first few points print(yhat[:100]) # check latent value for last few points print(yhat[-100:])

Running the example fits the Gaussian mixture model on the prepared dataset using the EM algorithm. Once fit, the model is used to predict the latent variable values for the examples in the training dataset.

Your specific results may vary given the stochastic nature of the learning algorithm.

In this case, we can see that at least for the first few and last few examples in the dataset, that the model mostly predicts the correct value for the latent variable. It’s a generally challenging problem and it is expected that the points between the peaks of the distribution will remain ambiguous and assigned to one process or another holistically.

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1] [0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

This section provides more resources on the topic if you are looking to go deeper.

- Section 8.5 The EM Algorithm, The Elements of Statistical Learning, 2016.
- Chapter 9 Mixture Models and EM, Pattern Recognition and Machine Learning, 2006.
- Section 6.12 The EM Algorithm, Machine Learning, 1997.
- Chapter 11 Mixture models and the EM algorithm, Machine Learning: A Probabilistic Perspective, 2012.
- Section 9.3 Clustering And Probability Density Estimation, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.
- Section 20.3 Learning With Hidden Variables: The EM Algorithm, Artificial Intelligence: A Modern Approach, 3rd edition, 2009.

- Maximum likelihood estimation, Wikipedia.
- Expectation-maximization algorithm, Wikipedia.
- Mixture model, Wikipedia.

In this post, you discovered the expectation-maximization algorithm.

Specifically, you learned:

- Maximum likelihood estimation is challenging on data in the presence of latent variables.
- Expectation maximization provides an iterative solution to maximum likelihood estimation with latent variables.
- Gaussian mixture models are an approach to density estimation where the parameters of the distributions are fit using the expectation-maximization algorithm.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Expectation-Maximization (EM Algorithm) appeared first on Machine Learning Mastery.

]]>The post Probabilistic Model Selection with AIC, BIC, and MDL appeared first on Machine Learning Mastery.

]]>Model selection is the problem of choosing one from among a set of candidate models.

It is common to choose a model that performs the best on a hold-out test dataset or to estimate model performance using a resampling technique, such as k-fold cross-validation.

An alternative approach to model selection involves using probabilistic statistical measures that attempt to quantify both the model performance on the training dataset and the complexity of the model. Examples include the Akaike and Bayesian Information Criterion and the Minimum Description Length.

The benefit of these information criterion statistics is that they do not require a hold-out test set, although a limitation is that they do not take the uncertainty of the models into account and may end-up selecting models that are too simple.

In this post, you will discover probabilistic statistics for machine learning model selection.

After reading this post, you will know:

- Model selection is the challenge of choosing one among a set of candidate models.
- Akaike and Bayesian Information Criterion are two ways of scoring a model based on its log-likelihood and complexity.
- Minimum Description Length provides another scoring method from information theory that can be shown to be equivalent to BIC.

Let’s get started.

This tutorial is divided into five parts; they are:

- The Challenge of Model Selection
- Probabilistic Model Selection
- Akaike Information Criterion
- Bayesian Information Criterion
- Minimum Description Length

Model selection is the process of fitting multiple models on a given dataset and choosing one over all others.

Model selection: estimating the performance of different models in order to choose the best one.

— Page 222, The Elements of Statistical Learning, 2016.

This may apply in unsupervised learning, e.g. choosing a clustering model, or supervised learning, e.g. choosing a predictive model for a regression or classification task. It may also be a sub-task of modeling, such as feature selection for a given model.

There are many common approaches that may be used for model selection. For example, in the case of supervised learning, the three most common approaches are:

- Train, Validation, and Test datasets.
- Resampling Methods.
- Probabilistic Statistics.

The simplest reliable method of model selection involves fitting candidate models on a training set, tuning them on the validation dataset, and selecting a model that performs the best on the test dataset according to a chosen metric, such as accuracy or error. A problem with this approach is that it requires a lot of data.

Resampling techniques attempt to achieve the same as the train/val/test approach to model selection, although using a small dataset. An example is k-fold cross-validation where a training set is split into many train/test pairs and a model is fit and evaluated on each. This is repeated for each model and a model is selected with the best average score across the k-folds. A problem with this and the prior approach is that only model performance is assessed, regardless of model complexity.

A third approach to model selection attempts to combine the complexity of the model with the performance of the model into a score, then select the model that minimizes or maximizes the score.

We can refer to this approach as statistical or probabilistic model selection as the scoring method uses a probabilistic framework.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Probabilistic model selection (or “information criteria”) provides an analytical technique for scoring and choosing among candidate models.

Models are scored both on their performance on the training dataset and based on the complexity of the model.

**Model Performance**. How well a candidate model has performed on the training dataset.**Model Complexity**. How complicated the trained candidate model is after training.

Model performance may be evaluated using a probabilistic framework, such as log-likelihood under the framework of maximum likelihood estimation. Model complexity may be evaluated as the number of degrees of freedom or parameters in the model.

Historically various ‘information criteria’ have been proposed that attempt to correct for the bias of maximum likelihood by the addition of a penalty term to compensate for the over-fitting of more complex models.

— Page 33, Pattern Recognition and Machine Learning, 2006.

A benefit of probabilistic model selection methods is that a test dataset is not required, meaning that all of the data can be used to fit the model, and the final model that will be used for prediction in the domain can be scored directly.

A limitation of probabilistic model selection methods is that the same general statistic cannot be calculated across a range of different types of models. Instead, the metric must be carefully derived for each model.

It should be noted that the AIC statistic is designed for preplanned comparisons between models (as opposed to comparisons of many models during automated searches).

— Page 493, Applied Predictive Modeling, 2013.

A further limitation of these selection methods is that they do not take the uncertainty of the model into account.

Such criteria do not take account of the uncertainty in the model parameters, however, and in practice they tend to favour overly simple models.

— Page 33, Pattern Recognition and Machine Learning, 2006.

There are three statistical approaches to estimating how well a given model fits a dataset and how complex the model is. And each can be shown to be equivalent or proportional to each other, although each was derived from a different framing or field of study.

They are:

- Akaike Information Criterion (AIC). Derived from frequentist probability.
- Bayesian Information Criterion (BIC). Derived from Bayesian probability.
- Minimum Description Length (MDL). Derived from information theory.

Each statistic can be calculated using the log-likelihood for a model and the data. Log-likelihood comes from Maximum Likelihood Estimation, a technique for finding or optimizing the parameters of a model in response to a training dataset.

In Maximum Likelihood Estimation, we wish to maximize the conditional probability of observing the data (*X*) given a specific probability distribution and its parameters (*theta*), stated formally as:

- P(X ; theta)

Where *X* is, in fact, the joint probability distribution of all observations from the problem domain from 1 to n.

- P(x1, x2, x3, …, xn ; theta)

The joint probability distribution can be restated as the multiplication of the conditional probability for observing each example given the distribution parameters. Multiplying many small probabilities together can be unstable; as such, it is common to restate this problem as the sum of the natural log conditional probability.

- sum i to n log(P(xi ; theta))

Given the frequent use of log in the likelihood function, it is commonly referred to as a log-likelihood function.

The log-likelihood function for common predictive modeling problems include the mean squared error for regression (e.g. linear regression) and log loss (binary cross-entropy) for binary classification (e.g. logistic regression).

We will take a closer look at each of the three statistics, AIC, BIC, and MDL, in the following sections.

The Akaike Information Criterion, or AIC for short, is a method for scoring and selecting a model.

It is named for the developer of the method, Hirotugu Akaike, and may be shown to have a basis in information theory and frequentist-based inference.

This is derived from a frequentist framework, and cannot be interpreted as an approximation to the marginal likelihood.

— Page 162, Machine Learning: A Probabilistic Perspective, 2012.

The AIC statistic is defined for logistic regression as follows (taken from “The Elements of Statistical Learning“):

- AIC = -2/N * LL + 2 * k/N

Where *N* is the number of examples in the training dataset, *LL* is the log-likelihood of the model on the training dataset, and *k* is the number of parameters in the model.

The score, as defined above, is minimized, e.g. the model with the lowest AIC is selected.

To use AIC for model selection, we simply choose the model giving smallest AIC over the set of models considered.

— Page 231, The Elements of Statistical Learning, 2016.

Compared to the BIC method (below), the AIC statistic penalizes complex models less, meaning that it may put more emphasis on model performance on the training dataset, and, in turn, select more complex models.

We see that the penalty for AIC is less than for BIC. This causes AIC to pick more complex models.

— Page 162, Machine Learning: A Probabilistic Perspective, 2012.

The Bayesian Information Criterion, or BIC for short, is a method for scoring and selecting a model.

It is named for the field of study from which it was derived: Bayesian probability and inference. Like AIC, it is appropriate for models fit under the maximum likelihood estimation framework.

The BIC statistic is calculated for logistic regression as follows (taken from “The Elements of Statistical Learning“):

- BIC = -2 * LL + log(N) * k

Where *log()* has the base-e called the natural logarithm, *LL* is the log-likelihood of the model, *N* is the number of examples in the training dataset, and *k* is the number of parameters in the model.

The score as defined above is minimized, e.g. the model with the lowest BIC is selected.

The quantity calculated is different from AIC, although can be shown to be proportional to the AIC. Unlike the AIC, the BIC penalizes the model more for its complexity, meaning that more complex models will have a worse (larger) score and will, in turn, be less likely to be selected.

Note that, compared to AIC […], this penalizes model complexity more heavily.

— Page 217, Pattern Recognition and Machine Learning, 2006.

Importantly, the derivation of BIC under the Bayesian probability framework means that if a selection of candidate models includes a true model for the dataset, then the probability that BIC will select the true model increases with the size of the training dataset. This cannot be said for the AIC score.

… given a family of models, including the true model, the probability that BIC will select the correct model approaches one as the sample size N -> infinity.

— Page 235, The Elements of Statistical Learning, 2016.

A downside of BIC is that for smaller, less representative training datasets, it is more likely to choose models that are too simple.

The Minimum Description Length, or MDL for short, is a method for scoring and selecting a model.

It is named for the field of study from which it was derived, namely information theory.

Information theory is concerned with the representation and transmission of information on a noisy channel, and as such, measures quantities like entropy, which is the average number of bits required to represent an event from a random variable or probability distribution.

From an information theory perspective, we may want to transmit both the predictions (or more precisely, their probability distributions) and the model used to generate them. Both the predicted target variable and the model can be described in terms of the number of bits required to transmit them on a noisy channel.

The Minimum Description Length is the minimum number of bits, or the minimum of the sum of the number of bits required to represent the data and the model.

The Minimum Description Length (MDL) principle recommends choosing the hypothesis that minimizes the sum of these two description lengths.

— Page 173, Machine Learning, 1997.

The MDL statistic is calculated as follows (taken from “Machine Learning“):

- MDL = L(h) + L(D | h)

Where *h* is the model, *D* is the predictions made by the model, *L(h)* is the number of bits required to represent the model, and *L(D | h)* is the number of bits required to represent the predictions from the model on the training dataset.

The score as defined above is minimized, e.g. the model with the lowest MDL is selected.

The number of bits required to encode (*D | h*) and the number of bits required to encode (*h*) can be calculated as the negative log-likelihood; for example (taken from “The Elements of Statistical Learning“):

- MDL = -log(P(theta)) – log(P(y | X, theta))

Or the negative log-likelihood of the model parameters (*theta*) and the negative log-likelihood of the target values (*y*) given the input values (*X*) and the model parameters (*theta*).

This desire to minimize the encoding of the model and its predictions is related to the notion of Occam’s Razor that seeks the simplest (least complex) explanation: in this context, the least complex model that predicts the target variable.

The MDL principle takes the stance that the best theory for a body of data is one that minimizes the size of the theory plus the amount of information necessary to specify the exceptions relative to the theory …

— Page 198, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.

The MDL calculation is very similar to BIC and can be shown to be equivalent in some situations.

Hence the BIC criterion, derived as approximation to log-posterior probability, can also be viewed as a device for (approximate) model choice by minimum description length.

— Page 236, The Elements of Statistical Learning, 2016.

We can make the calculation of AIC and BIC concrete with a worked example.

In this section, we will use a test problem and fit a linear regression model, then evaluate the model using the AIC and BIC metrics.

Importantly, the specific functional form of AIC and BIC for a linear regression model has previously been derived, making the example relatively straightforward. In adapting these examples for your own algorithms, it is important to either find an appropriate derivation of the calculation for your model and prediction problem or look into deriving the calculation yourself.

In this example, we will use a test regression problem provided by the make_regression() scikit-learn function. The problem will have two input variables and require the prediction of a target numerical value.

... # generate dataset X, y = make_regression(n_samples=100, n_features=2, noise=0.1) # define and fit the model on all data

We will fit a LinearRegression() model on the entire dataset directly.

... # define and fit the model on all data model = LinearRegression() model.fit(X, y)

Once fit, we can report the number of parameters in the model, which, given the definition of the problem, we would expect to be three (two coefficients and one intercept).

... # number of parameters num_params = len(model.coef_) + 1 print('Number of parameters: %d' % (num_params))

The likelihood function for a linear regression model can be shown to be identical to the least squares function; therefore, we can estimate the maximum likelihood of the model via the mean squared error metric.

First, the model can be used to estimate an outcome for each example in the training dataset, then the mean_squared_error() scikit-learn function can be used to calculate the mean squared error for the model.

... # predict the training set yhat = model.predict(X) # calculate the error mse = mean_squared_error(y, yhat) print('MSE: %.3f' % mse)

Tying this all together, the complete example of defining the dataset, fitting the model, and reporting the number of parameters and maximum likelihood estimate of the model is listed below.

# generate a test dataset and fit a linear regression model from sklearn.datasets import make_regression from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error # generate dataset X, y = make_regression(n_samples=100, n_features=2, noise=0.1) # define and fit the model on all data model = LinearRegression() model.fit(X, y) # number of parameters num_params = len(model.coef_) + 1 print('Number of parameters: %d' % (num_params)) # predict the training set yhat = model.predict(X) # calculate the error mse = mean_squared_error(y, yhat) print('MSE: %.3f' % mse)

Running the example first reports the number of parameters in the model as 3, as we expected, then reports the MSE as about 0.01.

Your specific MSE value may vary given the stochastic nature of the learning algorithm.

Number of parameters: 3 MSE: 0.010

Next, we can adapt the example to calculate the AIC for the model.

Skipping the derivation, the AIC calculation for an ordinary least squares linear regression model can be calculated as follows (taken from “A New Look At The Statistical Identification Model“, 1974.):

- AIC = n * LL + 2 * k

Where *n* is the number of examples in the training dataset, *LL* is the log-likelihood for the model using the natural logarithm (e.g. the log of the MSE), and *k* is the number of parameters in the model.

The *calculate_aic()* function below implements this, taking *n*, the raw mean squared error (*mse*), and *k* as arguments.

# calculate aic for regression def calculate_aic(n, mse, num_params): aic = n * log(mse) + 2 * num_params return aic

The example can then be updated to make use of this new function and calculate the AIC for the model.

The complete example is listed below.

# calculate akaike information criterion for a linear regression model from math import log from sklearn.datasets import make_regression from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error # calculate aic for regression def calculate_aic(n, mse, num_params): aic = n * log(mse) + 2 * num_params return aic # generate dataset X, y = make_regression(n_samples=100, n_features=2, noise=0.1) # define and fit the model on all data model = LinearRegression() model.fit(X, y) # number of parameters num_params = len(model.coef_) + 1 print('Number of parameters: %d' % (num_params)) # predict the training set yhat = model.predict(X) # calculate the error mse = mean_squared_error(y, yhat) print('MSE: %.3f' % mse) # calculate the aic aic = calculate_aic(len(y), mse, num_params) print('AIC: %.3f' % aic)

Running the example reports the number of parameters and MSE as before and then reports the AIC.

Your specific results may vary given the stochastic nature of the learning algorithm.

In this case, the AIC is reported to be a value of about -451.616. This value can be minimized in order to choose better models.

Number of parameters: 3 MSE: 0.010 AIC: -451.616

We can also explore the same example with the calculation of BIC instead of AIC.

Skipping the derivation, the BIC calculation for an ordinary least squares linear regression model can be calculated as follows (taken from here):

- BIC = n * LL + k * log(n)

Where n is the number of examples in the training dataset, *LL* is the log-likelihood for the model using the natural logarithm (e.g. log of the mean squared error), and *k* is the number of parameters in the model, and *log()* is the natural logarithm.

The *calculate_bic()* function below implements this, taking *n*, the raw mean squared error (*mse*), and *k* as arguments.

# calculate bic for regression def calculate_bic(n, mse, num_params): bic = n * log(mse) + num_params * log(n) return bic

The example can then be updated to make use of this new function and calculate the BIC for the model.

The complete example is listed below.

# calculate bayesian information criterion for a linear regression model from math import log from sklearn.datasets import make_regression from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error # calculate bic for regression def calculate_bic(n, mse, num_params): bic = n * log(mse) + num_params * log(n) return bic # generate dataset X, y = make_regression(n_samples=100, n_features=2, noise=0.1) # define and fit the model on all data model = LinearRegression() model.fit(X, y) # number of parameters num_params = len(model.coef_) + 1 print('Number of parameters: %d' % (num_params)) # predict the training set yhat = model.predict(X) # calculate the error mse = mean_squared_error(y, yhat) print('MSE: %.3f' % mse) # calculate the bic bic = calculate_bic(len(y), mse, num_params) print('BIC: %.3f' % bic)

Running the example reports the number of parameters and MSE as before and then reports the BIC.

Your specific results may vary given the stochastic nature of the learning algorithm.

In this case, the BIC is reported to be a value of about -450.020, which is very close to the AIC value of -451.616. Again, this value can be minimized in order to choose better models.

Number of parameters: 3 MSE: 0.010 BIC: -450.020

This section provides more resources on the topic if you are looking to go deeper.

- Chapter 7 Model Assessment and Selection, The Elements of Statistical Learning, 2016.
- Section 1.3 Model Selection, Pattern Recognition and Machine Learning, 2006.
- Section 4.4.1 Model comparison and BIC, Pattern Recognition and Machine Learning, 2006.
- Section 6.6 Minimum Description Length Principle, Machine Learning, 1997.
- Section 5.3.2.4 BIC approximation to log marginal likelihood, Machine Learning: A Probabilistic Perspective, 2012.
- Applied Predictive Modeling, 2013.
- Section 28.3 Minimum description length (MDL), Information Theory, Inference and Learning Algorithms, 2003.
- Section 5.10 The MDL Principle, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.

- sklearn.datasets.make_regression API.
- sklearn.linear_model.LinearRegression API.
- sklearn.metrics.mean_squared_error API.

- Akaike information criterion, Wikipedia.
- Bayesian information criterion, Wikipedia.
- Minimum description length, Wikipedia.

In this post, you discovered probabilistic statistics for machine learning model selection.

Specifically, you learned:

- Model selection is the challenge of choosing one among a set of candidate models.
- Akaike and Bayesian Information Criterion are two ways of scoring a model based on its log-likelihood and complexity.
- Minimum Description Length provides another scoring method from information theory that can be shown to be equivalent to BIC.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Probabilistic Model Selection with AIC, BIC, and MDL appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Logistic Regression With Maximum Likelihood Estimation appeared first on Machine Learning Mastery.

]]>Logistic regression is a model for binary classification predictive modeling.

The parameters of a logistic regression model can be estimated by the probabilistic framework called maximum likelihood estimation. Under this framework, a probability distribution for the target variable (class label) must be assumed and then a likelihood function defined that calculates the probability of observing the outcome given the input data and the model. This function can then be optimized to find the set of parameters that results in the largest sum likelihood over the training dataset.

The maximum likelihood approach to fitting a logistic regression model both aids in better understanding the form of the logistic regression model and provides a template that can be used for fitting classification models more generally. This is particularly true as the negative of the log-likelihood function used in the procedure can be shown to be equivalent to cross-entropy loss function.

In this post, you will discover logistic regression with maximum likelihood estimation.

After reading this post, you will know:

- Logistic regression is a linear model for binary classification predictive modeling.
- The linear part of the model predicts the log-odds of an example belonging to class 1, which is converted to a probability via the logistic function.
- The parameters of the model can be estimated by maximizing a likelihood function that predicts the mean of a Bernoulli distribution for each example.

Let’s get started.

This tutorial is divided into four parts; they are:

- Logistic Regression
- Logistic Regression and Log-Odds
- Maximum Likelihood Estimation
- Logistic Regression as Maximum Likelihood

Logistic regression is a classical linear method for binary classification.

Classification predictive modeling problems are those that require the prediction of a class label (e.g. ‘*red*‘, ‘*green*‘, ‘*blue*‘) for a given set of input variables. Binary classification refers to those classification problems that have two class labels, e.g. true/false or 0/1.

Logistic regression has a lot in common with linear regression, although linear regression is a technique for predicting a numerical value, not for classification problems. Both techniques model the target variable with a line (or hyperplane, depending on the number of dimensions of input. Linear regression fits the line to the data, which can be used to predict a new quantity, whereas logistic regression fits a line to best separate the two classes.

The input data is denoted as *X* with n examples and the output is denoted *y* with one output for each input. The prediction of the model for a given input is denoted as *yhat*.

- yhat = model(X)

The model is defined in terms of parameters called coefficients (*beta*), where there is one coefficient per input and an additional coefficient that provides the intercept or bias.

For example, a problem with inputs *X* with m variables *x1, x2, …, xm* will have coefficients *beta1, beta2, …, betam*, and *beta0*. A given input is predicted as the weighted sum of the inputs for the example and the coefficients.

- yhat = beta0 + beta1 * x1 + beta2 * x2 + … + betam * xm

The model can also be described using linear algebra, with a vector for the coefficients (*Beta*) and a matrix for the input data (*X*) and a vector for the output (*y*).

- y = X * Beta

So far, this is identical to linear regression and is insufficient as the output will be a real value instead of a class label.

Instead, the model squashes the output of this weighted sum using a nonlinear function to ensure the outputs are a value between 0 and 1.

The logistic function (also called the sigmoid) is used, which is defined as:

- f(x) = 1 / (1 + exp(-x))

Where x is the input value to the function. In the case of logistic regression, x is replaced with the weighted sum.

For example:

- yhat = 1 / (1 + exp(-(X * Beta)))

The output is interpreted as a probability from a Binomial probability distribution function for the class labeled 1, if the two classes in the problem are labeled 0 and 1.

Notice that the output, being a number between 0 and 1, can be interpreted as a probability of belonging to the class labeled 1.

— Page 726, Artificial Intelligence: A Modern Approach, 3rd edition, 2009.

The examples in the training dataset are drawn from a broader population and as such, this sample is known to be incomplete. Additionally, there is expected to be measurement error or statistical noise in the observations.

The parameters of the model (*beta*) must be estimated from the sample of observations drawn from the domain.

There are many ways to estimate the parameters. There are two frameworks that are the most common; they are:

- Least Squares Optimization (iteratively reweighted least squares).
- Maximum Likelihood Estimation.

Both are optimization procedures that involve searching for different model parameters.

Maximum Likelihood Estimation is a frequentist probabilistic framework that seeks a set of parameters for the model that maximizes a likelihood function. We will take a closer look at this second approach in the subsequent sections.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Before we dive into how the parameters of the model are estimated from data, we need to understand what logistic regression is calculating exactly.

This might be the most confusing part of logistic regression, so we will go over it slowly.

The linear part of the model (the weighted sum of the inputs) calculates the log-odds of a successful event, specifically, the log-odds that a sample belongs to class 1.

- log-odds = beta0 + beta1 * x1 + beta2 * x2 + … + betam * xm

In effect, the model estimates the log-odds for class 1 for the input variables at each level (all observed values).

What are odds and log-odds?

Odds may be familiar from the field of gambling. Odds are often stated as wins to losses (wins : losses), e.g. a one to ten chance or ratio of winning is stated as 1 : 10.

Given the probability of success (*p*) predicted by the logistic regression model, we can convert it to odds of success as the probability of success divided by the probability of not success:

- odds of success = p / (1 – p)

The logarithm of the odds is calculated, specifically log base-e or the natural logarithm. This quantity is referred to as the log-odds and may be referred to as the logit (logistic unit), a unit of measure.

- log-odds = log(p / (1 – p)

Recall that this is what the linear part of the logistic regression is calculating:

- log-odds = beta0 + beta1 * x1 + beta2 * x2 + … + betam * xm

The log-odds of success can be converted back into an odds of success by calculating the exponential of the log-odds.

- odds = exp(log-odds)

Or

- odds = exp(beta0 + beta1 * x1 + beta2 * x2 + … + betam * xm)

The odds of success can be converted back into a probability of success as follows:

- p = odds / (odds + 1)

And this is close to the form of our logistic regression model, except we want to convert log-odds to odds as part of the calculation.

We can do this and simplify the calculation as follows:

- p = 1 / (1 + exp(-log-odds))

This shows how we go from log-odds to odds, to a probability of class 1 with the logistic regression model, and that this final functional form matches the logistic function, ensuring that the probability is between 0 and 1.

We can make these calculations of converting between probability, odds and log-odds concrete with some small examples in Python.

First, let’s define the probability of success at 80%, or 0.8, and convert it to odds then back to a probability again.

The complete example is listed below.

# example of converting between probability and odds from math import log from math import exp # define our probability of success prob = 0.8 print('Probability %.1f' % prob) # convert probability to odds odds = prob / (1 - prob) print('Odds %.1f' % odds) # convert back to probability prob = odds / (odds + 1) print('Probability %.1f' % prob)

Running the example shows that 0.8 is converted to the odds of success 4, and back to the correct probability again.

Probability 0.8 Odds 4.0 Probability 0.8

Let’s extend this example and convert the odds to log-odds and then convert the log-odds back into the original probability. This final conversion is effectively the form of the logistic regression model, or the logistic function.

The complete example is listed below.

# example of converting between probability and log-odds from math import log from math import exp # define our probability of success prob = 0.8 print('Probability %.1f' % prob) # convert probability to odds odds = prob / (1 - prob) print('Odds %.1f' % odds) # convert odds to log-odds logodds = log(odds) print('Log-Odds %.1f' % logodds) # convert log-odds to a probability prob = 1 / (1 + exp(-logodds)) print('Probability %.1f' % prob)

Running the example, we can see that our odds are converted into the log odds of about 1.4 and then correctly converted back into the 0.8 probability of success.

Probability 0.8 Odds 4.0 Log-Odds 1.4 Probability 0.8

Now that we have a handle on the probability calculated by logistic regression, let’s look at maximum likelihood estimation.

Maximum Likelihood Estimation, or MLE for short, is a probabilistic framework for estimating the parameters of a model.

In Maximum Likelihood Estimation, we wish to maximize the conditional probability of observing the data (*X*) given a specific probability distribution and its parameters (*theta*), stated formally as:

- P(X ; theta)

Where *X* is, in fact, the joint probability distribution of all observations from the problem domain from 1 to *n*.

- P(x1, x2, x3, …, xn ; theta)

This resulting conditional probability is referred to as the likelihood of observing the data given the model parameters and written using the notation *L()* to denote the likelihood function. For example:

- L(X ; theta)

The joint probability distribution can be restated as the multiplication of the conditional probability for observing each example given the distribution parameters. Multiplying many small probabilities together can be unstable; as such, it is common to restate this problem as the sum of the log conditional probability.

- sum i to n log(P(xi ; theta))

Given the frequent use of log in the likelihood function, it is referred to as a log-likelihood function. It is common in optimization problems to prefer to minimize the cost function rather than to maximize it. Therefore, the negative of the log-likelihood function is used, referred to generally as a Negative Log-Likelihood (NLL) function.

- minimize -sum i to n log(P(xi ; theta))

The Maximum Likelihood Estimation framework can be used as a basis for estimating the parameters of many different machine learning models for regression and classification predictive modeling. This includes the logistic regression model.

We can frame the problem of fitting a machine learning model as the problem of probability density estimation.

Specifically, the choice of model and model parameters is referred to as a modeling hypothesis *h*, and the problem involves finding *h* that best explains the data *X*. We can, therefore, find the modeling hypothesis that maximizes the likelihood function.

- maximize sum i to n log(P(xi ; h))

Supervised learning can be framed as a conditional probability problem of predicting the probability of the output given the input:

- P(y | X)

As such, we can define conditional maximum likelihood estimation for supervised machine learning as follows:

- maximize sum i to n log(P(yi|xi ; h))

Now we can replace *h* with our logistic regression model.

In order to use maximum likelihood, we need to assume a probability distribution. In the case of logistic regression, a Binomial probability distribution is assumed for the data sample, where each example is one outcome of a Bernoulli trial. The Bernoulli distribution has a single parameter: the probability of a successful outcome (*p*).

- P(y=1) = p
- P(y=0) = 1 – p

The probability distribution that is most often used when there are two classes is the binomial distribution.5 This distribution has a single parameter, p, that is the probability of an event or a specific class.

— Page 283, Applied Predictive Modeling, 2013.

The expected value (mean) of the Bernoulli distribution can be calculated as follows:

- mean = P(y=1) * 1 + P(y=0) * 0

Or, given p:

- mean = p * 1 + (1 – p) * 0

This calculation may seem redundant, but it provides the basis for the likelihood function for a specific input, where the probability is given by the model (*yhat*) and the actual label is given from the dataset.

- likelihood = yhat * y + (1 – yhat) * (1 – y)

This function will always return a large probability when the model is close to the matching class value, and a small value when it is far away, for both *y=0* and *y=1* cases.

We can demonstrate this with a small worked example for both outcomes and small and large probabilities predicted for each.

The complete example is listed below.

# test of Bernoulli likelihood function # likelihood function for Bernoulli distribution def likelihood(y, yhat): return yhat * y + (1 - yhat) * (1 - y) # test for y=1 y, yhat = 1, 0.9 print('y=%.1f, yhat=%.1f, likelihood: %.3f' % (y, yhat, likelihood(y, yhat))) y, yhat = 1, 0.1 print('y=%.1f, yhat=%.1f, likelihood: %.3f' % (y, yhat, likelihood(y, yhat))) # test for y=0 y, yhat = 0, 0.1 print('y=%.1f, yhat=%.1f, likelihood: %.3f' % (y, yhat, likelihood(y, yhat))) y, yhat = 0, 0.9 print('y=%.1f, yhat=%.1f, likelihood: %.3f' % (y, yhat, likelihood(y, yhat)))

Running the example prints the class labels (*y*) and predicted probabilities (*yhat*) for cases with close and far probabilities for each case.

We can see that the likelihood function is consistent in returning a probability for how well the model achieves the desired outcome.

y=1.0, yhat=0.9, likelihood: 0.900 y=1.0, yhat=0.1, likelihood: 0.100 y=0.0, yhat=0.1, likelihood: 0.900 y=0.0, yhat=0.9, likelihood: 0.100

We can update the likelihood function using the log to transform it into a log-likelihood function:

- log-likelihood = log(yhat) * y + log(1 – yhat) * (1 – y)

Finally, we can sum the likelihood function across all examples in the dataset to maximize the likelihood:

- maximize sum i to n log(yhat_i) * y_i + log(1 – yhat_i) * (1 – y_i)

It is common practice to minimize a cost function for optimization problems; therefore, we can invert the function so that we minimize the negative log-likelihood:

- minimize sum i to n -(log(yhat_i) * y_i + log(1 – yhat_i) * (1 – y_i))

Calculating the negative of the log-likelihood function for the Bernoulli distribution is equivalent to calculating the cross-entropy function for the Bernoulli distribution, where *p()* represents the probability of class 0 or class 1, and *q()* represents the estimation of the probability distribution, in this case by our logistic regression model.

- cross entropy = -(log(q(class0)) * p(class0) + log(q(class1)) * p(class1))

Unlike linear regression, there is not an analytical solution to solving this optimization problem. As such, an iterative optimization algorithm must be used.

Unlike linear regression, we can no longer write down the MLE in closed form. Instead, we need to use an optimization algorithm to compute it. For this, we need to derive the gradient and Hessian.

— Page 246, Machine Learning: A Probabilistic Perspective, 2012.

The function does provide some information to aid in the optimization (specifically a Hessian matrix can be calculated), meaning that efficient search procedures that exploit this information can be used, such as the BFGS algorithm (and variants).

This section provides more resources on the topic if you are looking to go deeper.

- A Gentle Introduction to Maximum Likelihood Estimation for Machine Learning
- How To Implement Logistic Regression From Scratch in Python
- Logistic Regression Tutorial for Machine Learning
- Logistic Regression for Machine Learning

- Section 4.4.1 Fitting Logistic Regression Models, The Elements of Statistical Learning, 2016.
- Section 4.3.2 Logistic regression, Pattern Recognition and Machine Learning, 2006.
- Chapter 8 Logistic regression, Machine Learning: A Probabilistic Perspective, 2012.
- Chapter 4 Algorithms: the basic methods, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.
- Section 18.6.4 Linear classification with logistic regression, Artificial Intelligence: A Modern Approach, 3rd edition, 2009.
- Section 12.2 Logistic Regression, Applied Predictive Modeling, 2013.
- Section 4.3 Logistic Regression, An Introduction to Statistical Learning with Applications in R, 2017.

- Maximum likelihood estimation, Wikipedia.
- Likelihood function, Wikipedia.
- Logistic regression, Wikipedia.
- Logistic function, Wikipedia.
- Odds, Wikipedia.

In this post, you discovered logistic regression with maximum likelihood estimation.

Specifically, you learned:

- Logistic regression is a linear model for binary classification predictive modeling.
- The linear part of the model predicts the log-odds of an example belonging to class 1, which is converted to a probability via the logistic function.
- The parameters of the model can be estimated by maximizing a likelihood function that predicts the mean of a Bernoulli distribution for each example.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Logistic Regression With Maximum Likelihood Estimation appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Linear Regression With Maximum Likelihood Estimation appeared first on Machine Learning Mastery.

]]>Linear regression is a classical model for predicting a numerical quantity.

The parameters of a linear regression model can be estimated using a least squares procedure or by a maximum likelihood estimation procedure. Maximum likelihood estimation is a probabilistic framework for automatically finding the probability distribution and parameters that best describe the observed data. Supervised learning can be framed as a conditional probability problem, and maximum likelihood estimation can be used to fit the parameters of a model that best summarizes the conditional probability distribution, so-called conditional maximum likelihood estimation.

A linear regression model can be fit under this framework and can be shown to derive an identical solution to a least squares approach.

In this post, you will discover linear regression with maximum likelihood estimation.

After reading this post, you will know:

- Linear regression is a model for predicting a numerical quantity and maximum likelihood estimation is a probabilistic framework for estimating model parameters.
- Coefficients of a linear regression model can be estimated using a negative log-likelihood function from maximum likelihood estimation.
- The negative log-likelihood function can be used to derive the least squares solution to linear regression.

Let’s get started.

**Update Nov/2019**: Fixed typo in MLE calculation, had x instead of y (thanks Norman).

This tutorial is divided into four parts; they are:

- Linear Regression
- Maximum Likelihood Estimation
- Linear Regression as Maximum Likelihood
- Least Squares and Maximum Likelihood

Linear regression is a standard modeling method from statistics and machine learning.

Linear regression is the “work horse” of statistics and (supervised) machine learning.

— Page 217, Machine Learning: A Probabilistic Perspective, 2012.

Generally, it is a model that maps one or more numerical inputs to a numerical output. In terms of predictive modeling, it is suited to regression type problems: that is, the prediction of a real-valued quantity.

The input data is denoted as *X* with *n* examples and the output is denoted *y* with one output for each input. The prediction of the model for a given input is denoted as *yhat*.

- yhat = model(X)

The model is defined in terms of parameters called coefficients (beta), where there is one coefficient per input and an additional coefficient that provides the intercept or bias.

For example, a problem with inputs *X* with m variables *x1, x2, …, xm* will have coefficients *beta1, beta2, …, betam* and *beta0*. A given input is predicted as the weighted sum of the inputs for the example and the coefficients.

- yhat = beta0 + beta1 * x1 + beta2 * x2 + … + betam * xm

The model can also be described using linear algebra, with a vector for the coefficients (*Beta*) and a matrix for the input data (*X*) and a vector for the output (*y*).

- y = X * Beta

The examples are drawn from a broader population and as such, the sample is known to be incomplete. Additionally, there is expected to be measurement error or statistical noise in the observations.

The parameters of the model (*beta*) must be estimated from the sample of observations drawn from the domain.

There are many ways to estimate the parameters given the study of the model for more than 100 years; nevertheless, there are two frameworks that are the most common. They are:

- Least Squares Optimization.
- Maximum Likelihood Estimation.

Both are optimization procedures that involve searching for different model parameters.

Least squares optimization is an approach to estimating the parameters of a model by seeking a set of parameters that results in the smallest squared error between the predictions of the model (*yhat*) and the actual outputs (*y*), averaged over all examples in the dataset, so-called mean squared error.

Maximum Likelihood Estimation is a frequentist probabilistic framework that seeks a set of parameters for the model that maximize a likelihood function. We will take a closer look at this second approach.

Under both frameworks, different optimization algorithms may be used, such as local search methods like the BFGS algorithm (or variants), and general optimization methods like stochastic gradient descent. The linear regression model is special in that an analytical solution also exists, meaning that the coefficients can be calculated directly using linear algebra, a topic that is out of the scope of this tutorial.

For more information, see:

Maximum Likelihood Estimation, or MLE for short, is a probabilistic framework for estimating the parameters of a model.

In Maximum Likelihood Estimation, we wish to maximize the conditional probability of observing the data (*X*) given a specific probability distribution and its parameters (*theta*), stated formally as:

- P(X ; theta)

Where *X* is, in fact, the joint probability distribution of all observations from the problem domain from 1 to *n*.

- P(x1, x2, x3, …, xn ; theta)

This resulting conditional probability is referred to as the likelihood of observing the data given the model parameters and written using the notation *L()* to denote the likelihood function. For example:

- L(X ; theta)

The joint probability distribution can be restated as the multiplication of the conditional probability for observing each example given the distribution parameters. Multiplying many small probabilities together can be unstable; as such, it is common to restate this problem as the sum of the natural log conditional probability.

- sum i to n log(P(xi ; theta))

Given the common use of log in the likelihood function, it is referred to as a log-likelihood function. It is also common in optimization problems to prefer to minimize the cost function rather than to maximize it. Therefore, the negative of the log-likelihood function is used, referred to generally as a Negative Log-Likelihood (NLL) function.

- minimize -sum i to n log(P(xi ; theta))

The Maximum Likelihood Estimation framework can be used as a basis for estimating the parameters of many different machine learning models for regression and classification predictive modeling. This includes the linear regression model.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

We can frame the problem of fitting a machine learning model as the problem of probability density estimation.

Specifically, the choice of model and model parameters is referred to as a modeling hypothesis *h*, and the problem involves finding *h* that best explains the data *X*. We can, therefore, find the modeling hypothesis that maximizes the likelihood function.

- maximize sum i to n log(P(xi ; h))

Supervised learning can be framed as a conditional probability problem of predicting the probability of the output given the input:

- P(y | X)

As such, we can define conditional maximum likelihood estimation for supervised machine learning as follows:

- maximize sum i to n log(P(yi|xi ; h))

Now we can replace *h* with our linear regression model.

We can make some reasonable assumptions, such as the observations in the dataset are independent and drawn from the same probability distribution (i.i.d.), and that the target variable (*y*) has statistical noise with a Gaussian distribution, zero mean, and the same variance for all examples.

With these assumptions, we can frame the problem of estimating *y* given *X* as estimating the mean value for *y* from a Gaussian probability distribution given *X*.

The analytical form of the Gaussian function is as follows:

- f(x) = (1 / sqrt(2 * pi * sigma^2)) * exp(- 1/(2 * sigma^2) * (y – mu)^2 )

Where *mu* is the mean of the distribution and *sigma^2* is the variance where the units are squared.

We can use this function as our likelihood function, where *mu* is defined as the prediction from the model with a given set of coefficients (*Beta*) and *sigma* is a fixed constant.

First, we can state the problem as the maximization of the product of the probabilities for each example in the dataset:

- maximize product i to n (1 / sqrt(2 * pi * sigma^2)) * exp(-1/(2 * sigma^2) * (yi – h(xi, Beta))^2)

Where *xi* is a given example and *Beta* refers to the coefficients of the linear regression model. We can transform this to a log-likelihood model as follows:

- maximize sum i to n log (1 / sqrt(2 * pi * sigma^2)) – (1/(2 * sigma^2) * (yi – h(xi, Beta))^2)

The calculation can be simplified further, but we will stop there for now.

It’s interesting that the prediction is the mean of a distribution. It suggests that we can very reasonably add a bound to the prediction to give a prediction interval based on the standard deviation of the distribution, which is indeed a common practice.

Although the model assumes a Gaussian distribution in the prediction (i.e. Gaussian noise function or error function), there is no such expectation for the inputs to the model (*X*).

[the model] considers noise only in the target value of the training example and does not consider noise in the attributes describing the instances themselves.

— Page 167, Machine Learning, 1997.

We can apply a search procedure to maximize this log likelihood function, or invert it by adding a negative sign to the beginning and minimize the negative log-likelihood function (more common).

This provides a solution to the linear regression model for a given dataset.

This framework is also more general and can be used for curve fitting and provides the basis for fitting other regression models, such as artificial neural networks.

Interestingly, the maximum likelihood solution to linear regression presented in the previous section can be shown to be identical to the least squares solution.

After derivation, the least squares equation to be minimized to fit a linear regression to a dataset looks as follows:

- minimize sum i to n (yi – h(xi, Beta))^2

Where we are summing the squared errors between each target variable (*yi*) and the prediction from the model for the associated input *h(xi, Beta)*. This is often referred to as ordinary least squares. More generally, if the value is normalized by the number of examples in the dataset (averaged) rather than summed, then the quantity is referred to as the mean squared error.

- mse = 1/n * sum i to n (yi – yhat)^2

Starting with the likelihood function defined in the previous section, we can show how we can remove constant elements to give the same equation as the least squares approach to solving linear regression.

Note: this derivation is based on the example given in Chapter 6 of Machine Learning by Tom Mitchell.

- maximize sum i to n log (1 / sqrt(2 * pi * sigma^2)) – (1/(2 * sigma^2) * (yi – h(xi, Beta))^2)

Key to removing constants is to focus on what does not change when different models are evaluated, e.g. when *h(xi, Beta)* is evaluated.

The first term of the calculation is independent of the model and can be removed to give:

- maximize sum i to n – (1/(2 * sigma^2) * (yi – h(xi, Beta))^2)

We can then remove the negative sign to minimize the positive quantity rather than maximize the negative quantity:

- minimize sum i to n (1/(2 * sigma^2) * (yi – h(xi, Beta))^2)

Finally, we can discard the remaining first term that is also independent of the model to give:

- minimize sum i to n (yi – h(xi, Beta))^2

We can see that this is identical to the least squares solution.

In fact, under reasonable assumptions, an algorithm that minimizes the squared error between the target variable and the model output also performs maximum likelihood estimation.

… under certain assumptions any learning algorithm that minimizes the squared error between the output hypothesis pre- dictions and the training data will output a maximum likelihood hypothesis.

— Page 164, Machine Learning, 1997.

This section provides more resources on the topic if you are looking to go deeper.

- How to Solve Linear Regression Using Linear Algebra
- How to Implement Linear Regression From Scratch in Python
- How To Implement Simple Linear Regression From Scratch With Python
- Linear Regression Tutorial Using Gradient Descent for Machine Learning
- Simple Linear Regression Tutorial for Machine Learning
- Linear Regression for Machine Learning

- Section 15.1 Least Squares as a Maximum Likelihood Estimator, Numerical Recipes in C: The Art of Scientific Computing, Second Edition, 1992.
- Chapter 5 Machine Learning Basics, Deep Learning, 2016.
- Section 2.6.3 Function Approximation, The Elements of Statistical Learning, 2016.
- Section 6.4 Maximum Likelihood and Least-Squares Error Hypotheses, Machine Learning, 1997.
- Section 3.1.1 Maximum likelihood and least squares, Pattern Recognition and Machine Learning, 2006.
- Section 7.3 Maximum likelihood estimation (least squares), Machine Learning: A Probabilistic Perspective, 2012.

- Maximum likelihood estimation, Wikipedia.
- Likelihood function, Wikipedia.
- Linear regression, Wikipedia.

In this post, you discovered linear regression with maximum likelihood estimation.

Specifically, you learned:

- Linear regression is a model for predicting a numerical quantity and maximum likelihood estimation is a probabilistic framework for estimating model parameters.
- Coefficients of a linear regression model can be estimated using a negative log-likelihood function from maximum likelihood estimation.
- The negative log-likelihood function can be used to derive the least squares solution to linear regression.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Linear Regression With Maximum Likelihood Estimation appeared first on Machine Learning Mastery.

]]>The post Develop k-Nearest Neighbors in Python From Scratch appeared first on Machine Learning Mastery.

]]>In this tutorial you are going to learn about the **k-Nearest Neighbors algorithm** including how it works and how to implement it from scratch in Python (*without libraries*).

A simple but powerful approach for making predictions is to use the most similar historical examples to the new data. This is the principle behind the k-Nearest Neighbors algorithm.

After completing this tutorial you will know:

- How to code the k-Nearest Neighbors algorithm step-by-step.
- How to evaluate k-Nearest Neighbors on a real dataset.
- How to use k-Nearest Neighbors to make a prediction for new data.

Discover how to code ML algorithms from scratch including kNN, decision trees, neural nets, ensembles and much more in my new book, with full Python code and no fancy libraries.

Let’s get started.

**Updated Sep/2014**: Original version of the tutorial.**Updated Oct/2019**: Complete rewritten from the ground up.

This section will provide a brief background on the k-Nearest Neighbors algorithm that we will implement in this tutorial and the Abalone dataset to which we will apply it.

The k-Nearest Neighbors algorithm or KNN for short is a very simple technique.

The entire training dataset is stored. When a prediction is required, the k-most similar records to a new record from the training dataset are then located. From these neighbors, a summarized prediction is made.

Similarity between records can be measured many different ways. A problem or data-specific method can be used. Generally, with tabular data, a good starting point is the Euclidean distance.

Once the neighbors are discovered, the summary prediction can be made by returning the most common outcome or taking the average. As such, KNN can be used for classification or regression problems.

There is no model to speak of other than holding the entire training dataset. Because no work is done until a prediction is required, KNN is often referred to as a lazy learning method.

In this tutorial we will use the Iris Flower Species Dataset.

The Iris Flower Dataset involves predicting the flower species given measurements of iris flowers.

It is a multiclass classification problem. The number of observations for each class is balanced. There are 150 observations with 4 input variables and 1 output variable. The variable names are as follows:

- Sepal length in cm.
- Sepal width in cm.
- Petal length in cm.
- Petal width in cm.
- Class

A sample of the first 5 rows is listed below.

5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 5.0,3.6,1.4,0.2,Iris-setosa ...

The baseline performance on the problem is approximately 33%.

Download the dataset and save it into your current working directory with the filename “*iris.csv*“.

First we will develop each piece of the algorithm in this section, then we will tie all of the elements together into a working implementation applied to a real dataset in the next section.

This k-Nearest Neighbors tutorial is broken down into 3 parts:

**Step 1**: Calculate Euclidean Distance.**Step 2**: Get Nearest Neighbors.**Step 3**: Make Predictions.

These steps will teach you the fundamentals of implementing and applying the k-Nearest Neighbors algorithm for classification and regression predictive modeling problems.

**Note**: This tutorial assumes that you are using Python 3. If you need help installing Python, see this tutorial:

I believe the code in this tutorial will also work with Python 2.7 without any changes.

The first step is to calculate the distance between two rows in a dataset.

Rows of data are mostly made up of numbers and an easy way to calculate the distance between two rows or vectors of numbers is to draw a straight line. This makes sense in 2D or 3D and scales nicely to higher dimensions.

We can calculate the straight line distance between two vectors using the Euclidean distance measure. It is calculated as the square root of the sum of the squared differences between the two vectors.

- Euclidean Distance = sqrt(sum i to N (x1_i – x2_i)^2)

Where *x1* is the first row of data, *x2* is the second row of data and *i* is the index to a specific column as we sum across all columns.

With Euclidean distance, the smaller the value, the more similar two records will be. A value of 0 means that there is no difference between two records.

Below is a function named *euclidean_distance()* that implements this in Python.

# calculate the Euclidean distance between two vectors def euclidean_distance(row1, row2): distance = 0.0 for i in range(len(row1)-1): distance += (row1[i] - row2[i])**2 return sqrt(distance)

You can see that the function assumes that the last column in each row is an output value which is ignored from the distance calculation.

We can test this distance function with a small contrived classification dataset. We will use this dataset a few times as we construct the elements needed for the KNN algorithm.

X1 X2 Y 2.7810836 2.550537003 0 1.465489372 2.362125076 0 3.396561688 4.400293529 0 1.38807019 1.850220317 0 3.06407232 3.005305973 0 7.627531214 2.759262235 1 5.332441248 2.088626775 1 6.922596716 1.77106367 1 8.675418651 -0.242068655 1 7.673756466 3.508563011 1

Below is a plot of the dataset using different colors to show the different classes for each point.

Putting this all together, we can write a small example to test our distance function by printing the distance between the first row and all other rows. We would expect the distance between the first row and itself to be 0, a good thing to look out for.

The full example is listed below.

# Example of calculating Euclidean distance from math import sqrt # calculate the Euclidean distance between two vectors def euclidean_distance(row1, row2): distance = 0.0 for i in range(len(row1)-1): distance += (row1[i] - row2[i])**2 return sqrt(distance) # Test distance function dataset = [[2.7810836,2.550537003,0], [1.465489372,2.362125076,0], [3.396561688,4.400293529,0], [1.38807019,1.850220317,0], [3.06407232,3.005305973,0], [7.627531214,2.759262235,1], [5.332441248,2.088626775,1], [6.922596716,1.77106367,1], [8.675418651,-0.242068655,1], [7.673756466,3.508563011,1]] row0 = dataset[0] for row in dataset: distance = euclidean_distance(row0, row) print(distance)

Running this example prints the distances between the first row and every row in the dataset, including itself.

0.0 1.3290173915275787 1.9494646655653247 1.5591439385540549 0.5356280721938492 4.850940186986411 2.592833759950511 4.214227042632867 6.522409988228337 4.985585382449795

Now it is time to use the distance calculation to locate neighbors within a dataset.

Neighbors for a new piece of data in the dataset are the *k* closest instances, as defined by our distance measure.

To locate the neighbors for a new piece of data within a dataset we must first calculate the distance between each record in the dataset to the new piece of data. We can do this using our distance function prepared above.

Once distances are calculated, we must sort all of the records in the training dataset by their distance to the new data. We can then select the top *k* to return as the most similar neighbors.

We can do this by keeping track of the distance for each record in the dataset as a tuple, sort the list of tuples by the distance (in descending order) and then retrieve the neighbors.

Below is a function named *get_neighbors()* that implements this.

# Locate the most similar neighbors def get_neighbors(train, test_row, num_neighbors): distances = list() for train_row in train: dist = euclidean_distance(test_row, train_row) distances.append((train_row, dist)) distances.sort(key=lambda tup: tup[1]) neighbors = list() for i in range(num_neighbors): neighbors.append(distances[i][0]) return neighbors

You can see that the *euclidean_distance()* function developed in the previous step is used to calculate the distance between each *train_row* and the new *test_row*.

The list of *train_row* and distance tuples is sorted where a custom key is used ensuring that the second item in the tuple (*tup[1]*) is used in the sorting operation.

Finally, a list of the *num_neighbors* most similar neighbors to *test_row* is returned.

We can test this function with the small contrived dataset prepared in the previous section.

The complete example is listed below.

# Example of getting neighbors for an instance from math import sqrt # calculate the Euclidean distance between two vectors def euclidean_distance(row1, row2): distance = 0.0 for i in range(len(row1)-1): distance += (row1[i] - row2[i])**2 return sqrt(distance) # Locate the most similar neighbors def get_neighbors(train, test_row, num_neighbors): distances = list() for train_row in train: dist = euclidean_distance(test_row, train_row) distances.append((train_row, dist)) distances.sort(key=lambda tup: tup[1]) neighbors = list() for i in range(num_neighbors): neighbors.append(distances[i][0]) return neighbors # Test distance function dataset = [[2.7810836,2.550537003,0], [1.465489372,2.362125076,0], [3.396561688,4.400293529,0], [1.38807019,1.850220317,0], [3.06407232,3.005305973,0], [7.627531214,2.759262235,1], [5.332441248,2.088626775,1], [6.922596716,1.77106367,1], [8.675418651,-0.242068655,1], [7.673756466,3.508563011,1]] neighbors = get_neighbors(dataset, dataset[0], 3) for neighbor in neighbors: print(neighbor)

Running this example prints the 3 most similar records in the dataset to the first record, in order of similarity.

As expected, the first record is the most similar to itself and is at the top of the list.

[2.7810836, 2.550537003, 0] [3.06407232, 3.005305973, 0] [1.465489372, 2.362125076, 0]

Now that we know how to get neighbors from the dataset, we can use them to make predictions.

The most similar neighbors collected from the training dataset can be used to make predictions.

In the case of classification, we can return the most represented class among the neighbors.

We can achieve this by performing the *max()* function on the list of output values from the neighbors. Given a list of class values observed in the neighbors, the *max()* function takes a set of unique class values and calls the count on the list of class values for each class value in the set.

Below is the function named *predict_classification()* that implements this.

# Make a classification prediction with neighbors def predict_classification(train, test_row, num_neighbors): neighbors = get_neighbors(train, test_row, num_neighbors) output_values = [row[-1] for row in neighbors] prediction = max(set(output_values), key=output_values.count) return prediction

We can test this function on the above contrived dataset.

Below is a complete example.

# Example of making predictions from math import sqrt # calculate the Euclidean distance between two vectors def euclidean_distance(row1, row2): distance = 0.0 for i in range(len(row1)-1): distance += (row1[i] - row2[i])**2 return sqrt(distance) # Locate the most similar neighbors def get_neighbors(train, test_row, num_neighbors): distances = list() for train_row in train: dist = euclidean_distance(test_row, train_row) distances.append((train_row, dist)) distances.sort(key=lambda tup: tup[1]) neighbors = list() for i in range(num_neighbors): neighbors.append(distances[i][0]) return neighbors # Make a classification prediction with neighbors def predict_classification(train, test_row, num_neighbors): neighbors = get_neighbors(train, test_row, num_neighbors) output_values = [row[-1] for row in neighbors] prediction = max(set(output_values), key=output_values.count) return prediction # Test distance function dataset = [[2.7810836,2.550537003,0], [1.465489372,2.362125076,0], [3.396561688,4.400293529,0], [1.38807019,1.850220317,0], [3.06407232,3.005305973,0], [7.627531214,2.759262235,1], [5.332441248,2.088626775,1], [6.922596716,1.77106367,1], [8.675418651,-0.242068655,1], [7.673756466,3.508563011,1]] prediction = predict_classification(dataset, dataset[0], 3) print('Expected %d, Got %d.' % (dataset[0][-1], prediction))

Running this example prints the expected classification of 0 and the actual classification predicted from the 3 most similar neighbors in the dataset.

Expected 0, Got 0.

We can imagine how the *predict_classification()* function can be changed to calculate the mean value of the outcome values.

We now have all of the pieces to make predictions with KNN. Let’s apply it to a real dataset.

This section applies the KNN algorithm to the Iris flowers dataset.

The first step is to load the dataset and convert the loaded data to numbers that we can use with the mean and standard deviation calculations. For this we will use the helper function *load_csv()* to load the file, *str_column_to_float()* to convert string numbers to floats and *str_column_to_int()* to convert the class column to integer values.

We will evaluate the algorithm using k-fold cross-validation with 5 folds. This means that 150/5=30 records will be in each fold. We will use the helper functions *evaluate_algorithm()* to evaluate the algorithm with cross-validation and *accuracy_metric()* to calculate the accuracy of predictions.

A new function named *k_nearest_neighbors()* was developed to manage the application of the KNN algorithm, first learning the statistics from a training dataset and using them to make predictions for a test dataset.

If you would like more help with the data loading functions used below, see the tutorial:

If you would like more help with the way the model is evaluated using cross validation, see the tutorial:

The complete example is listed below.

# k-nearest neighbors on the Iris Flowers Dataset from random import seed from random import randrange from csv import reader from math import sqrt # Load a CSV file def load_csv(filename): dataset = list() with open(filename, 'r') as file: csv_reader = reader(file) for row in csv_reader: if not row: continue dataset.append(row) return dataset # Convert string column to float def str_column_to_float(dataset, column): for row in dataset: row[column] = float(row[column].strip()) # Convert string column to integer def str_column_to_int(dataset, column): class_values = [row[column] for row in dataset] unique = set(class_values) lookup = dict() for i, value in enumerate(unique): lookup[value] = i for row in dataset: row[column] = lookup[row[column]] return lookup # Find the min and max values for each column def dataset_minmax(dataset): minmax = list() for i in range(len(dataset[0])): col_values = [row[i] for row in dataset] value_min = min(col_values) value_max = max(col_values) minmax.append([value_min, value_max]) return minmax # Rescale dataset columns to the range 0-1 def normalize_dataset(dataset, minmax): for row in dataset: for i in range(len(row)): row[i] = (row[i] - minmax[i][0]) / (minmax[i][1] - minmax[i][0]) # Split a dataset into k folds def cross_validation_split(dataset, n_folds): dataset_split = list() dataset_copy = list(dataset) fold_size = int(len(dataset) / n_folds) for _ in range(n_folds): fold = list() while len(fold) < fold_size: index = randrange(len(dataset_copy)) fold.append(dataset_copy.pop(index)) dataset_split.append(fold) return dataset_split # Calculate accuracy percentage def accuracy_metric(actual, predicted): correct = 0 for i in range(len(actual)): if actual[i] == predicted[i]: correct += 1 return correct / float(len(actual)) * 100.0 # Evaluate an algorithm using a cross validation split def evaluate_algorithm(dataset, algorithm, n_folds, *args): folds = cross_validation_split(dataset, n_folds) scores = list() for fold in folds: train_set = list(folds) train_set.remove(fold) train_set = sum(train_set, []) test_set = list() for row in fold: row_copy = list(row) test_set.append(row_copy) row_copy[-1] = None predicted = algorithm(train_set, test_set, *args) actual = [row[-1] for row in fold] accuracy = accuracy_metric(actual, predicted) scores.append(accuracy) return scores # Calculate the Euclidean distance between two vectors def euclidean_distance(row1, row2): distance = 0.0 for i in range(len(row1)-1): distance += (row1[i] - row2[i])**2 return sqrt(distance) # Locate the most similar neighbors def get_neighbors(train, test_row, num_neighbors): distances = list() for train_row in train: dist = euclidean_distance(test_row, train_row) distances.append((train_row, dist)) distances.sort(key=lambda tup: tup[1]) neighbors = list() for i in range(num_neighbors): neighbors.append(distances[i][0]) return neighbors # Make a prediction with neighbors def predict_classification(train, test_row, num_neighbors): neighbors = get_neighbors(train, test_row, num_neighbors) output_values = [row[-1] for row in neighbors] prediction = max(set(output_values), key=output_values.count) return prediction # kNN Algorithm def k_nearest_neighbors(train, test, num_neighbors): predictions = list() for row in test: output = predict_classification(train, row, num_neighbors) predictions.append(output) return(predictions) # Test the kNN on the Iris Flowers dataset seed(1) filename = 'iris.csv' dataset = load_csv(filename) for i in range(len(dataset[0])-1): str_column_to_float(dataset, i) # convert class column to integers str_column_to_int(dataset, len(dataset[0])-1) # evaluate algorithm n_folds = 5 num_neighbors = 5 scores = evaluate_algorithm(dataset, k_nearest_neighbors, n_folds, num_neighbors) print('Scores: %s' % scores) print('Mean Accuracy: %.3f%%' % (sum(scores)/float(len(scores))))

Running the example prints the mean classification accuracy scores on each cross-validation fold as well as the mean accuracy score.

We can see that the mean accuracy of about 96.6% is dramatically better than the baseline accuracy of 33%.

Scores: [96.66666666666667, 96.66666666666667, 100.0, 90.0, 100.0] Mean Accuracy: 96.667%

We can use the training dataset to make predictions for new observations (rows of data).

This involves making a call to the *predict_classification()* function with a row representing our new observation to predict the class label.

... # predict the label label = predict_classification(dataset, row, num_neighbors)

We also might like to know the class label (string) for a prediction.

We can update the *str_column_to_int()* function to print the mapping of string class names to integers so we can interpret the prediction made by the model.

# Convert string column to integer def str_column_to_int(dataset, column): class_values = [row[column] for row in dataset] unique = set(class_values) lookup = dict() for i, value in enumerate(unique): lookup[value] = i print('[%s] => %d' % (value, i)) for row in dataset: row[column] = lookup[row[column]] return lookup

Tying this together, a complete example of using KNN with the entire dataset and making a single prediction for a new observation is listed below.

# Make Predictions with k-nearest neighbors on the Iris Flowers Dataset from csv import reader from math import sqrt # Load a CSV file def load_csv(filename): dataset = list() with open(filename, 'r') as file: csv_reader = reader(file) for row in csv_reader: if not row: continue dataset.append(row) return dataset # Convert string column to float def str_column_to_float(dataset, column): for row in dataset: row[column] = float(row[column].strip()) # Convert string column to integer def str_column_to_int(dataset, column): class_values = [row[column] for row in dataset] unique = set(class_values) lookup = dict() for i, value in enumerate(unique): lookup[value] = i print('[%s] => %d' % (value, i)) for row in dataset: row[column] = lookup[row[column]] return lookup # Find the min and max values for each column def dataset_minmax(dataset): minmax = list() for i in range(len(dataset[0])): col_values = [row[i] for row in dataset] value_min = min(col_values) value_max = max(col_values) minmax.append([value_min, value_max]) return minmax # Rescale dataset columns to the range 0-1 def normalize_dataset(dataset, minmax): for row in dataset: for i in range(len(row)): row[i] = (row[i] - minmax[i][0]) / (minmax[i][1] - minmax[i][0]) # Calculate the Euclidean distance between two vectors def euclidean_distance(row1, row2): distance = 0.0 for i in range(len(row1)-1): distance += (row1[i] - row2[i])**2 return sqrt(distance) # Locate the most similar neighbors def get_neighbors(train, test_row, num_neighbors): distances = list() for train_row in train: dist = euclidean_distance(test_row, train_row) distances.append((train_row, dist)) distances.sort(key=lambda tup: tup[1]) neighbors = list() for i in range(num_neighbors): neighbors.append(distances[i][0]) return neighbors # Make a prediction with neighbors def predict_classification(train, test_row, num_neighbors): neighbors = get_neighbors(train, test_row, num_neighbors) output_values = [row[-1] for row in neighbors] prediction = max(set(output_values), key=output_values.count) return prediction # Make a prediction with KNN on Iris Dataset filename = 'iris.csv' dataset = load_csv(filename) for i in range(len(dataset[0])-1): str_column_to_float(dataset, i) # convert class column to integers str_column_to_int(dataset, len(dataset[0])-1) # define model parameter num_neighbors = 5 # define a new record row = [5.7,2.9,4.2,1.3] # predict the label label = predict_classification(dataset, row, num_neighbors) print('Data=%s, Predicted: %s' % (row, label))

Running the data first summarizes the mapping of class labels to integers and then fits the model on the entire dataset.

Then a new observation is defined (in this case I took a row from the dataset), and a predicted label is calculated.

In this case our observation is predicted as belonging to class 1 which we know is “*Iris-setosa*“.

[Iris-virginica] => 0 [Iris-setosa] => 1 [Iris-versicolor] => 2 Data=[4.5, 2.3, 1.3, 0.3], Predicted: 1

This section lists extensions to the tutorial you may wish to consider to investigate.

**Tune KNN**. Try larger and larger*k*values to see if you can improve the performance of the algorithm on the Iris dataset.**Regression**. Adapt the example and apply it to a regression predictive modeling problem (e.g. predict a numerical value)**More Distance Measures**. Implement other distance measures that you can use to find similar historical data, such as Hamming distance, Manhattan distance and Minkowski distance.**Data Preparation**. Distance measures are strongly affected by the scale of the input data. Experiment with standardization and other data preparation methods in order to improve results.**More Problems**. As always, experiment with the technique on more and different classification and regression problems.

- Section 3.5 Comparison of Linear Regression with K-Nearest Neighbors, page 104, An Introduction to Statistical Learning, 2014.
- Section 18.8. Nonparametric Models, page 737, Artificial Intelligence: A Modern Approach, 2010.
- Section 13.5 K-Nearest Neighbors, page 350 Applied Predictive Modeling, 2013
- Section 4.7, Instance-based learning, page 128, Data Mining: Practical Machine Learning Tools and Techniques, 2nd edition, 2005.

In this tutorial you discovered how to implement the k-Nearest Neighbors algorithm from scratch with Python.

Specifically, you learned:

- How to code the k-Nearest Neighbors algorithm step-by-step.
- How to evaluate k-Nearest Neighbors on a real dataset.
- How to use k-Nearest Neighbors to make a prediction for new data.

Take action!

- Follow the tutorial and implement KNN from scratch.
- Adapt the example to another dataset.
- Follow the extensions and improve upon the implementation.

Leave a comment and share your experiences.

The post Develop k-Nearest Neighbors in Python From Scratch appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Maximum Likelihood Estimation for Machine Learning appeared first on Machine Learning Mastery.

]]>Density estimation is the problem of estimating the probability distribution for a sample of observations from a problem domain.

There are many techniques for solving density estimation, although a common framework used throughout the field of machine learning is maximum likelihood estimation. Maximum likelihood estimation involves defining a likelihood function for calculating the conditional probability of observing the data sample given a probability distribution and distribution parameters. This approach can be used to search a space of possible distributions and parameters.

This flexible probabilistic framework also provides the foundation for many machine learning algorithms, including important methods such as linear regression and logistic regression for predicting numeric values and class labels respectively, but also more generally for deep learning artificial neural networks.

In this post, you will discover a gentle introduction to maximum likelihood estimation.

After reading this post, you will know:

- Maximum Likelihood Estimation is a probabilistic framework for solving the problem of density estimation.
- It involves maximizing a likelihood function in order to find the probability distribution and parameters that best explain the observed data.
- It provides a framework for predictive modeling in machine learning where finding model parameters can be framed as an optimization problem.

Let’s get started.

This tutorial is divided into three parts; they are:

- Problem of Probability Density Estimation
- Maximum Likelihood Estimation
- Relationship to Machine Learning

A common modeling problem involves how to estimate a joint probability distribution for a dataset.

For example, given a sample of observation (*X*) from a domain (*x1, x2, x3, …, xn*), where each observation is drawn independently from the domain with the same probability distribution (so-called independent and identically distributed, i.i.d., or close to it).

Density estimation involves selecting a probability distribution function and the parameters of that distribution that best explain the joint probability distribution of the observed data (*X*).

- How do you choose the probability distribution function?
- How do you choose the parameters for the probability distribution function?

This problem is made more challenging as sample (*X*) drawn from the population is small and has noise, meaning that any evaluation of an estimated probability density function and its parameters will have some error.

There are many techniques for solving this problem, although two common approaches are:

- Maximum a Posteriori (MAP), a Bayesian method.
- Maximum Likelihood Estimation (MLE), frequentist method.

The main difference is that MLE assumes that all solutions are equally likely beforehand, whereas MAP allows prior information about the form of the solution to be harnessed.

In this post, we will take a closer look at the MLE method and its relationship to applied machine learning.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

One solution to probability density estimation is referred to as Maximum Likelihood Estimation, or MLE for short.

Maximum Likelihood Estimation involves treating the problem as an optimization or search problem, where we seek a set of parameters that results in the best fit for the joint probability of the data sample (*X*).

First, it involves defining a parameter called *theta* that defines both the choice of the probability density function and the parameters of that distribution. It may be a vector of numerical values whose values change smoothly and map to different probability distributions and their parameters.

In Maximum Likelihood Estimation, we wish to maximize the probability of observing the data from the joint probability distribution given a specific probability distribution and its parameters, stated formally as:

- P(X | theta)

This conditional probability is often stated using the semicolon (;) notation instead of the bar notation (|) because *theta* is not a random variable, but instead an unknown parameter. For example:

- P(X ; theta)

or

- P(x1, x2, x3, …, xn ; theta)

This resulting conditional probability is referred to as the likelihood of observing the data given the model parameters and written using the notation *L()* to denote the likelihood function. For example:

- L(X ; theta)

The objective of Maximum Likelihood Estimation is to find the set of parameters (*theta*) that maximize the likelihood function, e.g. result in the largest likelihood value.

- maximize L(X ; theta)

We can unpack the conditional probability calculated by the likelihood function.

Given that the sample is comprised of n examples, we can frame this as the joint probability of the observed data samples *x1, x2, x3, …, xn* in *X* given the probability distribution parameters (*theta*).

- L(x1, x2, x3, …, xn ; theta)

The joint probability distribution can be restated as the multiplication of the conditional probability for observing each example given the distribution parameters.

- product i to n P(xi ; theta)

Multiplying many small probabilities together can be numerically unstable in practice, therefore, it is common to restate this problem as the sum of the log conditional probabilities of observing each example given the model parameters.

- sum i to n log(P(xi ; theta))

Where log with base-e called the natural logarithm is commonly used.

This product over many probabilities can be inconvenient […] it is prone to numerical underflow. To obtain a more convenient but equivalent optimization problem, we observe that taking the logarithm of the likelihood does not change its arg max but does conveniently transform a product into a sum

— Page 132, Deep Learning, 2016.

Given the frequent use of log in the likelihood function, it is commonly referred to as a log-likelihood function.

It is common in optimization problems to prefer to minimize the cost function, rather than to maximize it. Therefore, the negative of the log-likelihood function is used, referred to generally as a Negative Log-Likelihood (NLL) function.

- minimize -sum i to n log(P(xi ; theta))

In software, we often phrase both as minimizing a cost function. Maximum likelihood thus becomes minimization of the negative log-likelihood (NLL) …

— Page 133, Deep Learning, 2016.

This problem of density estimation is directly related to applied machine learning.

We can frame the problem of fitting a machine learning model as the problem of probability density estimation. Specifically, the choice of model and model parameters is referred to as a modeling hypothesis *h*, and the problem involves finding *h* that best explains the data *X*.

- P(X ; h)

We can, therefore, find the modeling hypothesis that maximizes the likelihood function.

- maximize L(X ; h)

Or, more fully:

- maximize sum i to n log(P(xi ; h))

This provides the basis for estimating the probability density of a dataset, typically used in unsupervised machine learning algorithms; for example:

- Clustering algorithms.

Using the expected log joint probability as a key quantity for learning in a probability model with hidden variables is better known in the context of the celebrated “expectation maximization” or EM algorithm.

— Page 365, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.

The Maximum Likelihood Estimation framework is also a useful tool for supervised machine learning.

This applies to data where we have input and output variables, where the output variate may be a numerical value or a class label in the case of regression and classification predictive modeling retrospectively.

We can state this as the conditional probability of the output (*y*) given the input (*X*) given the modeling hypothesis (*h*).

- maximize L(y|X ; h)

Or, more fully:

- maximize sum i to n log(P(yi|xi ; h))

The maximum likelihood estimator can readily be generalized to the case where our goal is to estimate a conditional probability P(y | x ; theta) in order to predict y given x. This is actually the most common situation because it forms the basis for most supervised learning.

— Page 133, Deep Learning, 2016.

This means that the same Maximum Likelihood Estimation framework that is generally used for density estimation can be used to find a supervised learning model and parameters.

This provides the basis for foundational linear modeling techniques, such as:

- Linear Regression, for predicting a numerical value.
- Logistic Regression, for binary classification.

In the case of linear regression, the model is constrained to a line and involves finding a set of coefficients for the line that best fits the observed data. Fortunately, this problem can be solved analytically (e.g. directly using linear algebra).

In the case of logistic regression, the model defines a line and involves finding a set of coefficients for the line that best separates the classes. This cannot be solved analytically and is often solved by searching the space of possible coefficient values using an efficient optimization algorithm such as the BFGS algorithm or variants.

Both methods can also be solved less efficiently using a more general optimization algorithm such as stochastic gradient descent.

In fact, most machine learning models can be framed under the maximum likelihood estimation framework, providing a useful and consistent way to approach predictive modeling as an optimization problem.

An important benefit of the maximize likelihood estimator in machine learning is that as the size of the dataset increases, the quality of the estimator continues to improve.

This section provides more resources on the topic if you are looking to go deeper.

- Chapter 5 Machine Learning Basics, Deep Learning, 2016.
- Chapter 2 Probability Distributions, Pattern Recognition and Machine Learning, 2006.
- Chapter 8 Model Inference and Averaging, The Elements of Statistical Learning, 2016.
- Chapter 9 Probabilistic methods, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.
- Chapter 22 Maximum Likelihood and Clustering, Information Theory, Inference and Learning Algorithms, 2003.
- Chapter 8 Learning distributions, Bayesian Reasoning and Machine Learning, 2011.

- Maximum likelihood estimation, Wikipedia.
- Maximum Likelihood, Wolfram MathWorld.
- Likelihood function, Wikipedia.
- Some problems understanding the definition of a function in a maximum likelihood method, CrossValidated.

In this post, you discovered a gentle introduction to maximum likelihood estimation.

Specifically, you learned:

- Maximum Likelihood Estimation is a probabilistic framework for solving the problem of density estimation.
- It involves maximizing a likelihood function in order to find the probability distribution and parameters that best explain the observed data.
- It provides a framework for predictive modeling in machine learning where finding model parameters can be framed as an optimization problem.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Maximum Likelihood Estimation for Machine Learning appeared first on Machine Learning Mastery.

]]>