The post A Gentle Introduction To Sigmoid Function appeared first on Machine Learning Mastery.

]]>In this tutorial, you will discover the sigmoid function and its role in learning from examples in neural networks.

After completing this tutorial, you will know:

- The sigmoid function
- Linear vs. non-linear separability
- Why a neural network can make complex decision boundaries if a sigmoid unit is used

Let’s get started.

This tutorial is divided into 3 parts; they are:

- The sigmoid function
- The sigmoid function and its properties

- Linear vs. non-linearly separable problems
- Using a sigmoid as an activation function in neural networks

The sigmoid function is a special form of the logistic function and is usually denoted by σ(x) or sig(x). It is given by:

σ(x) = 1/(1+exp(-x))

The graph of sigmoid function is an S-shaped curve as shown by the green line in the graph below. The figure also shows the graph of the derivative in pink color. The expression for the derivative, along with some important properties are shown on the right.

A few other properties include:

- Domain: (-∞, +∞)
- Range: (0, +1)
- σ(0) = 0.5
- The function is monotonically increasing.
- The function is continuous everywhere.
- The function is differentiable everywhere in its domain.
- Numerically, it is enough to compute this function’s value over a small range of numbers, e.g., [-10, +10]. For values less than -10, the function’s value is almost zero. For values greater than 10, the function’s values are almost one.

The sigmoid function is also called a squashing function as its domain is the set of all real numbers, and its range is (0, 1). Hence, if the input to the function is either a very large negative number or a very large positive number, the output is always between 0 and 1. Same goes for any number between -∞ and +∞.

The sigmoid function is used as an activation function in neural networks. Just to review what is an activation function, the figure below shows the role of an activation function in one layer of a neural network. A weighted sum of inputs is passed through an activation function and this output serves as an input to the next layer.

When the activation function for a neuron is a sigmoid function it is a guarantee that the output of this unit will always be between 0 and 1. Also, as the sigmoid is a non-linear function, the output of this unit would be a non-linear function of the weighted sum of inputs. Such a neuron that employs a sigmoid function as an activation function is termed as a sigmoid unit.

Suppose we have a typical classification problem, where we have a set of points in space and each point is assigned a class label. If a straight line (or a hyperplane in an n-dimensional space) can divide the two classes, then we have a linearly separable problem. On the other hand, if a straight line is not enough to divide the two classes, then we have a non-linearly separable problem. The figure below shows data in the 2 dimensional space. Each point is assigned a red or blue class label. The left figure shows a linearly separable problem that requires a linear boundary to distinguish between the two classes. The right figure shows a non-linearly separable problem, where a non-linear decision boundary is required.

For three dimensional space, a linear decision boundary can be described via the equation of a plane. For an n-dimensional space, the linear decision boundary is described by the equation of a hyperplane.

If we use a linear activation function in a neural network, then this model can only learn linearly separable problems. However, with the addition of just one hidden layer and a sigmoid activation function in the hidden layer, the neural network can easily learn a non-linearly separable problem. Using a non-linear function produces non-linear boundaries and hence, the sigmoid function can be used in neural networks for learning complex decision functions.

The only non-linear function that can be used as an activation function in a neural network is one which is monotonically increasing. So for example, sin(x) or cos(x) cannot be used as activation functions. Also, the activation function should be defined everywhere and should be continuous everywhere in the space of real numbers. The function is also required to be differentiable over the entire space of real numbers.

Typically a back propagation algorithm uses gradient descent to learn the weights of a neural network. To derive this algorithm, the derivative of the activation function is required.

The fact that the sigmoid function is monotonic, continuous and differentiable everywhere, coupled with the property that its derivative can be expressed in terms of itself, makes it easy to derive the update equations for learning the weights in a neural network when using back propagation algorithm.

This section lists some ideas for extending the tutorial that you may wish to explore.

- Other non-linear activation functions, e.g., tanh function
- A Gentle Introduction to the Rectified Linear Unit (ReLU)
- Deep learning

If you explore any of these extensions, I’d love to know. Post your findings in the comments below.

This section provides more resources on the topic if you are looking to go deeper.

- Calculus in action: Neural networks
- A gentle introduction to gradient descent procedure
- Neural networks are function approximation algorithms
- How to Choose an Activation Function for Deep Learning

- Jason Brownlee’s excellent resource on Calculus Books for Machine Learning

- Pattern recognition and machine learning by Christopher M. Bishop.
- Deep learning by Ian Goodfellow, Joshua Begio, Aaron Courville.
- Thomas Calculus, 14th edition, 2017. (based on the original works of George B. Thomas, revised by Joel Hass, Christopher Heil, Maurice Weir)

In this tutorial, you discovered what is a sigmoid function. Specifically, you learned:

- The sigmoid function and its properties
- Linear vs. non-linear decision boundaries
- Why adding a sigmoid function at the hidden layer enables a neural network to learn complex non-linear boundaries

Ask your questions in the comments below and I will do my best to answer

The post A Gentle Introduction To Sigmoid Function appeared first on Machine Learning Mastery.

]]>The post 14 Different Types of Learning in Machine Learning appeared first on Machine Learning Mastery.

]]>The focus of the field is learning, that is, acquiring skills or knowledge from experience. Most commonly, this means synthesizing useful concepts from historical data.

As such, there are many different types of learning that you may encounter as a practitioner in the field of machine learning: from whole fields of study to specific techniques.

In this post, you will discover a gentle introduction to the different types of learning that you may encounter in the field of machine learning.

After reading this post, you will know:

- Fields of study, such as supervised, unsupervised, and reinforcement learning.
- Hybrid types of learning, such as semi-supervised and self-supervised learning.
- Broad techniques, such as active, online, and transfer learning.

Let’s get started.

Given that the focus of the field of machine learning is “*learning*,” there are many types that you may encounter as a practitioner.

Some types of learning describe whole subfields of study comprised of many different types of algorithms such as “*supervised learning*.” Others describe powerful techniques that you can use on your projects, such as “*transfer learning*.”

There are perhaps 14 types of learning that you must be familiar with as a machine learning practitioner; they are:

**Learning Problems**

- 1. Supervised Learning
- 2. Unsupervised Learning
- 3. Reinforcement Learning

**Hybrid Learning Problems**

- 4. Semi-Supervised Learning
- 5. Self-Supervised Learning
- 6. Multi-Instance Learning

**Statistical Inference**

- 7. Inductive Learning
- 8. Deductive Inference
- 9. Transductive Learning

**Learning Techniques**

- 10. Multi-Task Learning
- 11. Active Learning
- 12. Online Learning
- 13. Transfer Learning
- 14. Ensemble Learning

In the following sections, we will take a closer look at each in turn.

**Did I miss an important type of learning?**

Let me know in the comments below.

First, we will take a closer look at three main types of learning problems in machine learning: supervised, unsupervised, and reinforcement learning.

Supervised learning describes a class of problem that involves using a model to learn a mapping between input examples and the target variable.

Applications in which the training data comprises examples of the input vectors along with their corresponding target vectors are known as supervised learning problems.

— Page 3, Pattern Recognition and Machine Learning, 2006.

Models are fit on training data comprised of inputs and outputs and used to make predictions on test sets where only the inputs are provided and the outputs from the model are compared to the withheld target variables and used to estimate the skill of the model.

Learning is a search through the space of possible hypotheses for one that will perform well, even on new examples beyond the training set. To measure the accuracy of a hypothesis we give it a test set of examples that are distinct from the training set.

— Page 695, Artificial Intelligence: A Modern Approach, 3rd edition, 2015.

There are two main types of supervised learning problems: they are classification that involves predicting a class label and regression that involves predicting a numerical value.

**Classification**: Supervised learning problem that involves predicting a class label.**Regression**: Supervised learning problem that involves predicting a numerical label.

Both classification and regression problems may have one or more input variables and input variables may be any data type, such as numerical or categorical.

An example of a classification problem would be the MNIST handwritten digits dataset where the inputs are images of handwritten digits (pixel data) and the output is a class label for what digit the image represents (numbers 0 to 9).

An example of a regression problem would be the Boston house prices dataset where the inputs are variables that describe a neighborhood and the output is a house price in dollars.

Some machine learning algorithms are described as “*supervised*” machine learning algorithms as they are designed for supervised machine learning problems. Popular examples include: decision trees, support vector machines, and many more.

Our goal is to find a useful approximation f(x) to the function f(x) that underlies the predictive relationship between the inputs and outputs

— Page 28, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edition, 2016.

Algorithms are referred to as “*supervised*” because they learn by making predictions given examples of input data, and the models are supervised and corrected via an algorithm to better predict the expected target outputs in the training dataset.

The term supervised learning originates from the view of the target y being provided by an instructor or teacher who shows the machine learning system what to do.

— Page 105, Deep Learning, 2016.

Some algorithms may be specifically designed for classification (such as logistic regression) or regression (such as linear regression) and some may be used for both types of problems with minor modifications (such as artificial neural networks).

Unsupervised learning describes a class of problems that involves using a model to describe or extract relationships in data.

Compared to supervised learning, unsupervised learning operates upon only the input data without outputs or target variables. As such, unsupervised learning does not have a teacher correcting the model, as in the case of supervised learning.

In unsupervised learning, there is no instructor or teacher, and the algorithm must learn to make sense of the data without this guide.

— Page 105, Deep Learning, 2016.

There are many types of unsupervised learning, although there are two main problems that are often encountered by a practitioner: they are clustering that involves finding groups in the data and density estimation that involves summarizing the distribution of data.

**Clustering: Unsupervised**learning problem that involves finding groups in data.**Density Estimation**: Unsupervised learning problem that involves summarizing the distribution of data.

An example of a clustering algorithm is k-Means where *k* refers to the number of clusters to discover in the data. An example of a density estimation algorithm is Kernel Density Estimation that involves using small groups of closely related data samples to estimate the distribution for new points in the problem space.

The most common unsupervised learning task is clustering: detecting potentially useful clusters of input examples. For example, a taxi agent might gradually develop a concept of “good traffic days” and “bad traffic days” without ever being given labeled examples of each by a teacher.

— Pages 694-695, Artificial Intelligence: A Modern Approach, 3rd edition, 2015.

Clustering and density estimation may be performed to learn about the patterns in the data.

Additional unsupervised methods may also be used, such as visualization that involves graphing or plotting data in different ways and projection methods that involves reducing the dimensionality of the data.

**Visualization**: Unsupervised learning problem that involves creating plots of data.**Projection**: Unsupervised learning problem that involves creating lower-dimensional representations of data.

An example of a visualization technique would be a scatter plot matrix that creates one scatter plot of each pair of variables in the dataset. An example of a projection method would be Principal Component Analysis that involves summarizing a dataset in terms of eigenvalues and eigenvectors, with linear dependencies removed.

The goal in such unsupervised learning problems may be to discover groups of similar examples within the data, where it is called clustering, or to determine the distribution of data within the input space, known as density estimation, or to project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization.

— Page 3, Pattern Recognition and Machine Learning, 2006.

Reinforcement learning describes a class of problems where an agent operates in an environment and must *learn* to operate using feedback.

Reinforcement learning is learning what to do — how to map situations to actions—so as to maximize a numerical reward signal. The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them.

— Page 1, Reinforcement Learning: An Introduction, 2nd edition, 2018.

The use of an environment means that there is no fixed training dataset, rather a goal or set of goals that an agent is required to achieve, actions they may perform, and feedback about performance toward the goal.

Some machine learning algorithms do not just experience a fixed dataset. For example, reinforcement learning algorithms interact with an environment, so there is a feedback loop between the learning system and its experiences.

— Page 105, Deep Learning, 2016.

It is similar to supervised learning in that the model has some response from which to learn, although the feedback may be delayed and statistically noisy, making it challenging for the agent or model to connect cause and effect.

An example of a reinforcement problem is playing a game where the agent has the goal of getting a high score and can make moves in the game and received feedback in terms of punishments or rewards.

In many complex domains, reinforcement learning is the only feasible way to train a program to perform at high levels. For example, in game playing, it is very hard for a human to provide accurate and consistent evaluations of large numbers of positions, which would be needed to train an evaluation function directly from examples. Instead, the program can be told when it has won or lost, and it can use this information to learn an evaluation function that gives reasonably accurate estimates of the probability of winning from any given position.

— Page 831, Artificial Intelligence: A Modern Approach, 3rd edition, 2015.

Impressive recent results include the use of reinforcement in Google’s AlphaGo in out-performing the world’s top Go player.

Some popular examples of reinforcement learning algorithms include Q-learning, temporal-difference learning, and deep reinforcement learning.

The lines between unsupervised and supervised learning is blurry, and there are many hybrid approaches that draw from each field of study.

In this section, we will take a closer look at some of the more common hybrid fields of study: semi-supervised, self-supervised, and multi-instance learning.

Semi-supervised learning is supervised learning where the training data contains very few labeled examples and a large number of unlabeled examples.

The goal of a semi-supervised learning model is to make effective use of all of the available data, not just the labelled data like in supervised learning.

In semi-supervised learning we are given a few labeled examples and must make what we can of a large collection of unlabeled examples. Even the labels themselves may not be the oracular truths that we hope for.

— Page 695, Artificial Intelligence: A Modern Approach, 3rd edition, 2015.

Making effective use of unlabelled data may require the use of or inspiration from unsupervised methods such as clustering and density estimation. Once groups or patterns are discovered, supervised methods or ideas from supervised learning may be used to label the unlabeled examples or apply labels to unlabeled representations later used for prediction.

Unsupervised learning can provide useful cues for how to group examples in representation space. Examples that cluster tightly in the input space should be mapped to similar representations.

— Page 243, Deep Learning, 2016.

It is common for many real-world supervised learning problems to be examples of semi-supervised learning problems given the expense or computational cost for labeling examples. For example, classifying photographs requires a dataset of photographs that have already been labeled by human operators.

Many problems from the fields of computer vision (image data), natural language processing (text data), and automatic speech recognition (audio data) fall into this category and cannot be easily addressed using standard supervised learning methods.

… in many practical applications labeled data is very scarce but unlabeled data is plentiful. “Semisupervised” learning attempts to improve the accuracy of supervised learning by exploiting information in unlabeled data. This sounds like magic, but it can work!

— Page 467, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.

Self-supervised learning refers to an unsupervised learning problem that is framed as a supervised learning problem in order to apply supervised learning algorithms to solve it.

Supervised learning algorithms are used to solve an alternate or pretext task, the result of which is a model or representation that can be used in the solution of the original (actual) modeling problem.

The self-supervised learning framework requires only unlabeled data in order to formulate a pretext learning task such as predicting context or image rotation, for which a target objective can be computed without supervision.

— Revisiting Self-Supervised Visual Representation Learning, 2019.

A common example of self-supervised learning is computer vision where a corpus of unlabeled images is available and can be used to train a supervised model, such as making images grayscale and having a model predict a color representation (colorization) or removing blocks of the image and have a model predict the missing parts (inpainting).

In discriminative self-supervised learning, which is the main focus of this work, a model is trained on an auxiliary or ‘pretext’ task for which ground-truth is available for free. In most cases, the pretext task involves predicting some hidden portion of the data (for example, predicting color for gray-scale images

— Scaling and Benchmarking Self-Supervised Visual Representation Learning, 2019.

A general example of self-supervised learning algorithms are autoencoders. These are a type of neural network that is used to create a compact or compressed representation of an input sample. They achieve this via a model that has an encoder and a decoder element separated by a bottleneck that represents the internal compact representation of the input.

An autoencoder is a neural network that is trained to attempt to copy its input to its output. Internally, it has a hidden layer

hthat describes a code used to represent the input.

— Page 502, Deep Learning, 2016.

These autoencoder models are trained by providing the input to the model as both input and the target output, requiring that the model reproduce the input by first encoding it to a compressed representation then decoding it back to the original. Once trained, the decoder is discarded and the encoder is used as needed to create compact representations of input.

Although autoencoders are trained using a supervised learning method, they solve an unsupervised learning problem, namely, they are a type of projection method for reducing the dimensionality of input data.

Traditionally, autoencoders were used for dimensionality reduction or feature learning.

— Page 502, Deep Learning, 2016.

Another example of self-supervised learning is generative adversarial networks, or GANs. These are generative models that are most commonly used for creating synthetic photographs using only a collection of unlabeled examples from the target domain.

GAN models are trained indirectly via a separate discriminator model that classifies examples of photos from the domain as real or fake (generated), the result of which is fed back to update the GAN model and encourage it to generate more realistic photos on the next iteration.

The generator network directly produces samples […]. Its adversary, the discriminator network, attempts to distinguish between samples drawn from the training data and samples drawn from the generator. The discriminator emits a probability value given by d(x; θ(d)), indicating the probability that x is a real training example rather than a fake sample drawn from the model.

— Page 699, Deep Learning, 2016.

Multi-instance learning is a supervised learning problem where individual examples are unlabeled; instead, bags or groups of samples are labeled.

In multi-instance learning, an entire collection of examples is labeled as containing or not containing an example of a class, but the individual members of the collection are not labeled.

— Page 106, Deep Learning, 2016.

Instances are in “*bags*” rather than sets because a given instance may be present one or more times, e.g. duplicates.

Modeling involves using knowledge that one or some of the instances in a bag are associated with a target label, and to predict the label for new bags in the future given their composition of multiple unlabeled examples.

In supervised multi-instance learning, a class label is associated with each bag, and the goal of learning is to determine how the class can be inferred from the instances that make up the bag.

— Page 156, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.

Simple methods, such as assigning class labels to individual instances and using standard supervised learning algorithms, often work as a good first step.

Inference refers to reaching an outcome or decision.

In machine learning, fitting a model and making a prediction are both types of inference.

There are different paradigms for inference that may be used as a framework for understanding how some machine learning algorithms work or how some learning problems may be approached.

Some examples of approaches to learning are inductive, deductive, and transductive learning and inference.

Inductive learning involves using evidence to determine the outcome.

Inductive reasoning refers to using specific cases to determine general outcomes, e.g. specific to general.

Most machine learning models learn using a type of inductive inference or inductive reasoning where general rules (the model) are learned from specific historical examples (the data).

… the problem of induction, which is the problem of how to draw general conclusions about the future from specific observations from the past.

— Page 77, Machine Learning: A Probabilistic Perspective, 2012.

Fitting a machine learning model is a process of induction. The model is a generalization of the specific examples in the training dataset.

A model or hypothesis is made about the problem using the training data, and it is believed to hold over new unseen data later when the model is used.

Lacking any further information, our assumption is that the best hypothesis regarding unseen instances is the hypothesis that best fits the observed training data. This is the fundamental assumption of inductive learning …

— Page 23, Machine Learning, 1997.

Deduction or deductive inference refers to using general rules to determine specific outcomes.

We can better understand induction by contrasting it with deduction.

Deduction is the reverse of induction. If induction is going from the specific to the general, deduction is going from the general to the specific.

… the simple observation that induction is just the inverse of deduction!

— Page 291, Machine Learning, 1997.

Deduction is a top-down type of reasoning that seeks for all premises to be met before determining the conclusion, whereas induction is a bottom-up type of reasoning that uses available data as evidence for an outcome.

In the context of machine learning, once we use induction to fit a model on a training dataset, the model can be used to make predictions. The use of the model is a type of deduction or deductive inference.

Transduction or transductive learning is used in the field of statistical learning theory to refer to predicting specific examples given specific examples from a domain.

It is different from induction that involves learning general rules from specific examples, e.g. specific to specific.

Induction, deriving the function from the given data. Deduction, deriving the values of the given function for points of interest. Transduction, deriving the values of the unknown function for points of interest from the given data.

— Page 169, The Nature of Statistical Learning Theory, 1995.

Unlike induction, no generalization is required; instead, specific examples are used directly. This may, in fact, be a simpler problem than induction to solve.

The model of estimating the value of a function at a given point of interest describes a new concept of inference: moving from the particular to the particular. We call this type of inference transductive inference. Note that this concept of inference appears when one would like to get the best result from a restricted amount of information.

— Page 169, The Nature of Statistical Learning Theory, 1995.

A classical example of a transductive algorithm is the k-Nearest Neighbors algorithm that does not model the training data, but instead uses it directly each time a prediction is required.

For more on the topic of transduction, see the tutorial:

We can contrast these three types of inference in the context of machine learning.

For example:

**Induction**: Learning a general model from specific examples.**Deduction**: Using a model to make predictions.**Transduction**: Using specific examples to make predictions.

The image below summarizes these three different approaches nicely.

There are many techniques that are described as types of learning.

In this section, we will take a closer look at some of the more common methods.

This includes multi-task, active, online, transfer, and ensemble learning.

Multi-task learning is a type of supervised learning that involves fitting a model on one dataset that addresses multiple related problems.

It involves devising a model that can be trained on multiple related tasks in such a way that the performance of the model is improved by training across the tasks as compared to being trained on any single task.

Multi-task learning is a way to improve generalization by pooling the examples (which can be seen as soft constraints imposed on the parameters) arising out of several tasks.

— Page 244, Deep Learning, 2016.

Multi-task learning can be a useful approach to problem-solving when there is an abundance of input data labeled for one task that can be shared with another task with much less labeled data.

… we may want to learn multiple related models at the same time, which is known as multi-task learning. This will allow us to “borrow statistical strength” from tasks with lots of data and to share it with tasks with little data.

Page 231, Machine Learning: A Probabilistic Perspective, 2012.

For example, it is common for a multi-task learning problem to involve the same input patterns that may be used for multiple different outputs or supervised learning problems. In this setup, each output may be predicted by a different part of the model, allowing the core of the model to generalize across each task for the same inputs.

In the same way that additional training examples put more pressure on the parameters of the model towards values that generalize well, when part of a model is shared across tasks, that part of the model is more constrained towards good values (assuming the sharing is justified), often yielding better generalization.

— Page 244, Deep Learning, 2016.

A popular example of multi-task learning is where the same word embedding is used to learn a distributed representation of words in text that is then shared across multiple different natural language processing supervised learning tasks.

Active learning is a technique where the model is able to query a human user operator during the learning process in order to resolve ambiguity during the learning process.

Active learning: The learner adaptively or interactively collects training examples, typically by querying an oracle to request labels for new points.

— Page 7, Foundations of Machine Learning, 2nd edition, 2018.

Active learning is a type of supervised learning and seeks to achieve the same or better performance of so-called “*passive*” supervised learning, although by being more efficient about what data is collected or used by the model.

The key idea behind active learning is that a machine learning algorithm can achieve greater accuracy with fewer training labels if it is allowed to choose the data from which it learns. An active learner may pose queries, usually in the form of unlabeled data instances to be labeled by an oracle (e.g., a human annotator).

— Active Learning Literature Survey, 2009.

It is not unreasonable to view active learning as an approach to solving semi-supervised learning problems, or an alternative paradigm for the same types of problems.

… we see that active learning and semi-supervised learning attack the same problem from opposite directions. While semi-supervised methods exploit what the learner thinks it knows about the unlabeled data, active methods attempt to explore the unknown aspects. It is therefore natural to think about combining the two

— Active Learning Literature Survey, 2009.

Active learning is a useful approach when there is not much data available and new data is expensive to collect or label.

The active learning process allows the sampling of the domain to be directed in a way that minimizes the number of samples and maximizes the effectiveness of the model.

Active learning is often used in applications where labels are expensive to obtain, for example computational biology applications.

— Page 7, Foundations of Machine Learning, 2nd edition, 2018.

Online learning involves using the data available and updating the model directly before a prediction is required or after the last observation was made.

Online learning is appropriate for those problems where observations are provided over time and where the probability distribution of observations is expected to also change over time. Therefore, the model is expected to change just as frequently in order to capture and harness those changes.

Traditionally machine learning is performed offline, which means we have a batch of data, and we optimize an equation […] However, if we have streaming data, we need to perform online learning, so we can update our estimates as each new data point arrives rather than waiting until “the end” (which may never occur).

— Page 261, Machine Learning: A Probabilistic Perspective, 2012.

This approach is also used by algorithms where there may be more observations than can reasonably fit into memory, therefore, learning is performed incrementally over observations, such as a stream of data.

Online learning is helpful when the data may be changing rapidly over time. It is also useful for applications that involve a large collection of data that is constantly growing, even if changes are gradual.

— Page 753, Artificial Intelligence: A Modern Approach, 3rd edition, 2015.

Generally, online learning seeks to minimize “*regret*,” which is how well the model performed compared to how well it might have performed if all the available information was available as a batch.

In the theoretical machine learning community, the objective used in online learning is the regret, which is the averaged loss incurred relative to the best we could have gotten in hindsight using a single fixed parameter value

— Page 262, Machine Learning: A Probabilistic Perspective, 2012.

One example of online learning is so-called stochastic or online gradient descent used to fit an artificial neural network.

The fact that stochastic gradient descent minimizes generalization error is easiest to see in the online learning case, where examples or minibatches are drawn from a stream of data.

— Page 281, Deep Learning, 2016.

Transfer learning is a type of learning where a model is first trained on one task, then some or all of the model is used as the starting point for a related task.

In transfer learning, the learner must perform two or more different tasks, but we assume that many of the factors that explain the variations in P1 are relevant to the variations that need to be captured for learning P2.

— Page 536, Deep Learning, 2016.

It is a useful approach on problems where there is a task related to the main task of interest and the related task has a large amount of data.

It is different from multi-task learning as the tasks are learned sequentially in transfer learning, whereas multi-task learning seeks good performance on all considered tasks by a single model at the same time in parallel.

… pretrain a deep convolutional net with 8 layers of weights on a set of tasks (a subset of the 1000 ImageNet object categories) and then initialize a same-size network with the first k layers of the first net. All the layers of the second network (with the upper layers initialized randomly) are then jointly trained to perform a different set of tasks (another subset of the 1000 ImageNet object categories), with fewer training examples than for the first set of tasks.

— Page 325, Deep Learning, 2016.

An example is image classification, where a predictive model, such as an artificial neural network, can be trained on a large corpus of general images, and the weights of the model can be used as a starting point when training on a smaller more specific dataset, such as dogs and cats. The features already learned by the model on the broader task, such as extracting lines and patterns, will be helpful on the new related task.

If there is significantly more data in the first setting (sampled from P1), then that may help to learn representations that are useful to quickly generalize from only very few examples drawn from P2. Many visual categories share low-level notions of edges and visual shapes, the effects of geometric changes, changes in lighting, etc.

— Page 536, Deep Learning, 2016.

As noted, transfer learning is particularly useful with models that are incrementally trained and an existing model can be used as a starting point for continued training, such as deep learning networks.

For more on the topic of transfer learning, see the tutorial:

Ensemble learning is an approach where two or more modes are fit on the same data and the predictions from each model are combined.

The field of ensemble learning provides many ways of combining the ensemble members’ predictions, including uniform weighting and weights chosen on a validation set.

— Page 472, Deep Learning, 2016.

The objective of ensemble learning is to achieve better performance with the ensemble of models as compared to any individual model. This involves both deciding how to create models used in the ensemble and how to best combine the predictions from the ensemble members.

Ensemble learning can be broken down into two tasks: developing a population of base learners from the training data, and then combining them to form the composite predictor.

— Page 605, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edition, 2016.

Ensemble learning is a useful approach for improving the predictive skill on a problem domain and to reduce the variance of stochastic learning algorithms, such as artificial neural networks.

Some examples of popular ensemble learning algorithms include: weighted average, stacked generalization (stacking), and bootstrap aggregation (bagging).

Bagging, boosting, and stacking have been developed over the last couple of decades, and their performance is often astonishingly good. Machine learning researchers have struggled to understand why.

— Page 480, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.

For more on the topic of ensemble learning, see the tutorial:

This section provides more resources on the topic if you are looking to go deeper.

- Pattern Recognition and Machine Learning, 2006.
- Deep Learning, 2016.
- Reinforcement Learning: An Introduction, 2nd edition, 2018.
- Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edition, 2016.
- Machine Learning: A Probabilistic Perspective, 2012.
- Machine Learning, 1997.
- The Nature of Statistical Learning Theory, 1995.
- Foundations of Machine Learning, 2nd edition, 2018.
- Artificial Intelligence: A Modern Approach, 3rd edition, 2015.

- Revisiting Self-Supervised Visual Representation Learning, 2019.
- Active Learning Literature Survey, 2009.

- Supervised and Unsupervised Machine Learning Algorithms
- Why Do Machine Learning Algorithms Work on New Data?
- How Machine Learning Algorithms Work
- Gentle Introduction to Transduction in Machine Learning
- A Gentle Introduction to Transfer Learning for Deep Learning
- Ensemble Learning Methods for Deep Learning Neural Networks

- Supervised learning, Wikipedia.
- Unsupervised learning, Wikipedia.
- Reinforcement learning, Wikipedia.
- Semi-supervised learning, Wikipedia.
- Multi-task learning, Wikipedia.
- Multiple instance learning, Wikipedia.
- Inductive reasoning, Wikipedia.
- Deductive reasoning, Wikipedia.
- Transduction (machine learning), Wikipedia.
- Active learning (machine learning), Wikipedia.
- Online machine learning, Wikipedia.
- Transfer learning, Wikipedia.
- Ensemble learning, Wikipedia.

In this post, you discovered a gentle introduction to the different types of learning that you may encounter in the field of machine learning.

Specifically, you learned:

- Fields of study, such as supervised, unsupervised, and reinforcement learning.
- Hybrid types of learning, such as semi-supervised and self-supervised learning.
- Broad techniques, such as active, online, and transfer learning.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post 14 Different Types of Learning in Machine Learning appeared first on Machine Learning Mastery.

]]>The post What is a Hypothesis in Machine Learning? appeared first on Machine Learning Mastery.

]]>This description is characterized as searching through and evaluating candidate hypothesis from hypothesis spaces.

The discussion of hypotheses in machine learning can be confusing for a beginner, especially when “*hypothesis*” has a distinct, but related meaning in statistics (e.g. statistical hypothesis testing) and more broadly in science (e.g. scientific hypothesis).

In this post, you will discover the difference between a hypothesis in science, in statistics, and in machine learning.

After reading this post, you will know:

- A scientific hypothesis is a provisional explanation for observations that is falsifiable.
- A statistical hypothesis is an explanation about the relationship between data populations that is interpreted probabilistically.
- A machine learning hypothesis is a candidate model that approximates a target function for mapping inputs to outputs.

Let’s get started.

This tutorial is divided into four parts; they are:

- What Is a Hypothesis?
- Hypothesis in Statistics
- Hypothesis in Machine Learning
- Review of Hypothesis

A hypothesis is an explanation for something.

It is a provisional idea, an educated guess that requires some evaluation.

A good hypothesis is testable; it can be either true or false.

In science, a hypothesis must be falsifiable, meaning that there exists a test whose outcome could mean that the hypothesis is not true. The hypothesis must also be framed before the outcome of the test is known.

… not any hypothesis will do. There is one fundamental condition that any hypothesis or system of hypotheses must satisfy if it is to be granted the status of a scientific law or theory. If it is to form part of science, an hypothesis must be falsifiable.

— Pages 61-62, What Is This Thing Called Science?, Third Edition, 1999.

A good hypothesis fits the evidence and can be used to make predictions about new observations or new situations.

The hypothesis that best fits the evidence and can be used to make predictions is called a theory, or is part of a theory.

**Hypothesis in Science**: Provisional explanation that fits the evidence and can be confirmed or disproved.

Much of statistics is concerned with the relationship between observations.

Statistical hypothesis tests are techniques used to calculate a critical value called an “*effect*.” The critical value can then be interpreted in order to determine how likely it is to observe the effect if a relationship does not exist.

If the likelihood is very small, then it suggests that the effect is probably real. If the likelihood is large, then we may have observed a statistical fluctuation, and the effect is probably not real.

For example, we may be interested in evaluating the relationship between the means of two samples, e.g. whether the samples were drawn from the same distribution or not, whether there is a difference between them.

One hypothesis is that there is no difference between the population means, based on the data samples.

This is a hypothesis of no effect and is called the null hypothesis and we can use the statistical hypothesis test to either reject this hypothesis, or fail to reject (retain) it. We don’t say “accept” because the outcome is probabilistic and could still be wrong, just with a very low probability.

… we develop a hypothesis and establish a criterion that we will use when deciding whether to retain or reject our hypothesis. The primary hypothesis of interest in social science research is the null hypothesis

— Pages 64-65, Statistics In Plain English, Third Edition, 2010.

If the null hypothesis is rejected, then we assume the alternative hypothesis that there exists some difference between the means.

**Null Hypothesis (H0)**: Suggests no effect.**Alternate Hypothesis (H1)**: Suggests some effect.

Statistical hypothesis tests don’t comment on the size of the effect, only the likelihood of the presence or absence of the effect in the population, based on the observed samples of data.

**Hypothesis in Statistics**: Probabilistic explanation about the presence of a relationship between observations.

Machine learning, specifically supervised learning, can be described as the desire to use available data to learn a function that best maps inputs to outputs.

Technically, this is a problem called function approximation, where we are approximating an unknown target function (that we assume exists) that can best map inputs to outputs on all possible observations from the problem domain.

An example of a model that approximates the target function and performs mappings of inputs to outputs is called a hypothesis in machine learning.

The choice of algorithm (e.g. neural network) and the configuration of the algorithm (e.g. network topology and hyperparameters) define the space of possible hypothesis that the model may represent.

Learning for a machine learning algorithm involves navigating the chosen space of hypothesis toward the best or a good enough hypothesis that best approximates the target function.

Learning is a search through the space of possible hypotheses for one that will perform well, even on new examples beyond the training set.

— Page 695, Artificial Intelligence: A Modern Approach, Second Edition, 2009.

This framing of machine learning is common and helps to understand the choice of algorithm, the problem of learning and generalization, and even the bias-variance trade-off. For example, the training dataset is used to learn a hypothesis and the test dataset is used to evaluate it.

A common notation is used where lowercase-h (*h*) represents a given specific hypothesis and uppercase-h (*H*) represents the hypothesis space that is being searched.

**h (**: A single hypothesis, e.g. an instance or specific candidate model that maps inputs to outputs and can be evaluated and used to make predictions.*hypothesis*)**H (**: A space of possible hypotheses for mapping inputs to outputs that can be searched, often constrained by the choice of the framing of the problem, the choice of model and the choice of model configuration.*hypothesis set*)

The choice of algorithm and algorithm configuration involves choosing a hypothesis space that is believed to contain a hypothesis that is a good or best approximation for the target function. This is very challenging, and it is often more efficient to spot-check a range of different hypothesis spaces.

We say that a learning problem is realizable if the hypothesis space contains the true function. Unfortunately, we cannot always tell whether a given learning problem is realizable, because the true function is not known.

— Page 697, Artificial Intelligence: A Modern Approach, Second Edition, 2009.

It is a hard problem and we choose to constrain the hypothesis space both in terms of size and in terms of the complexity of the hypotheses that are evaluated in order to make the search process tractable.

There is a tradeoff between the expressiveness of a hypothesis space and the complexity of finding a good hypothesis within that space.

— Page 697, Artificial Intelligence: A Modern Approach, Second Edition, 2009.

**Hypothesis in Machine Learning**: Candidate model that approximates a target function for mapping examples of inputs to outputs.

We can summarize the three definitions again as follows:

**Hypothesis in Science**: Provisional explanation that fits the evidence and can be confirmed or disproved.**Hypothesis in Statistics**: Probabilistic explanation about the presence of a relationship between observations.**Hypothesis in Machine Learning**: Candidate model that approximates a target function for mapping examples of inputs to outputs.

We can see that a hypothesis in machine learning draws upon the definition of a hypothesis more broadly in science.

Just like a hypothesis in science is an explanation that covers available evidence, is falsifiable and can be used to make predictions about new situations in the future, a hypothesis in machine learning has similar properties.

A hypothesis in machine learning:

**Covers the available evidence**: the training dataset.**Is falsifiable (kind-of)**: a test harness is devised beforehand and used to estimate performance and compare it to a baseline model to see if is skillful or not.**Can be used in new situations**: make predictions on new data.

Did this post clear up your questions about what a hypothesis is in machine learning?

Let me know in the comments below.

This section provides more resources on the topic if you are looking to go deeper.

- What Is This Thing Called Science?, Third Edition, 1999.
- Statistics In Plain English, Third Edition, 2010.
- Artificial Intelligence: A Modern Approach, Second Edition, 2009.
- Machine Learning, 1997.

- A Gentle Introduction to Applied Machine Learning as a Search Problem
- A Gentle Introduction to Statistical Hypothesis Tests
- Critical Values for Statistical Hypothesis Testing and How to Calculate Them in Python
- 15 Statistical Hypothesis Tests in Python (Cheat Sheet)

- What is hypothesis in machine learning?, Quora.
- What exactly is a hypothesis space in the context of Machine Learning?, Cross Validated.
- What is Hypothesis set in Machine Learning?, Cross Validated.

In this post, you discovered the difference between a hypothesis in science, in statistics, and in machine learning.

Specifically, you learned:

- A scientific hypothesis is a provisional explanation for observations that is falsifiable.
- A statistical hypothesis is an explanation about the relationship between data populations that is interpreted probabilistically.
- A machine learning hypothesis is a candidate model that approximates a target function for mapping inputs to outputs.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post What is a Hypothesis in Machine Learning? appeared first on Machine Learning Mastery.

]]>The post Analytical vs Numerical Solutions in Machine Learning appeared first on Machine Learning Mastery.

]]>- What data is best for my problem?
- What algorithm is best for my data?
- How do I best configure my algorithm?

Why can’t a machine learning expert just give you a straight answer to your question?

In this post, I want to help you see why no one can ever tell you what algorithm to use or how to configure it for your specific dataset.

I want to help you see that finding good data/algorithm/configuration is in fact the hard part of applied machine learning and the only part you need to focus on solving.

Let’s get started.

In mathematics, some problems can be solved analytically and numerically.

- An analytical solution involves framing the problem in a well-understood form and calculating the exact solution.
- A numerical solution means making guesses at the solution and testing whether the problem is solved well enough to stop.

An example is the square root that can be solved both ways.

We prefer the analytical method in general because it is faster and because the solution is exact. Nevertheless, sometimes we must resort to a numerical method due to limitations of time or hardware capacity.

A good example is in finding the coefficients in a linear regression equation that can be calculated analytically (e.g. using linear algebra), but can be solved numerically when we cannot fit all the data into the memory of a single computer in order to perform the analytical calculation (e.g. via gradient descent).

Sometimes, the analytical solution is unknown and all we have to work with is the numerical approach.

Many problems have well-defined solutions that are obvious once the problem has been defined.

A set of logical steps that we can follow to calculate an exact outcome.

For example, you know what operation to use given a specific arithmetic task such as addition or subtraction.

In linear algebra, there are a suite of methods that you can use to factorize a matrix, depending on if the properties of your matrix are square, rectangular, contain real or imaginary values, and so on.

We can stretch this more broadly to software engineering, where there are problems that turn up again and again that can be solved with a pattern of design that is known to work well, regardless of the specifics of your application. Such as the visitor pattern for performing an operation on each item in a list.

Some problems in applied machine learning are well defined and have an analytical solution.

For example, the method for transforming a categorical variable into a one hot encoding is simple, repeatable and (practically) always the same methodology regardless of the number of integer values in the set.

Unfortunately, most of the problems that we care about solving in machine learning do not have analytical solutions.

There are many problems that we are interested in that do not have exact solutions.

Or at least, analytical solutions that we have figured out yet.

We have to make guesses at solutions and test them to see how good the solution is. This involves framing the problem and using trial and error across a set of candidate solutions.

In essence, the process of finding a numerical solution can be described as a search.

These types of solutions have some interesting properties:

- We often easily can tell a good solution from a bad solution.
- We often don’t objectively know what a “
*good*” solution looks like; we can only compare the goodness between candidate solutions that we have tested. - We are often satisfied with an approximate or “
*good enough*” solution rather than the single best solution.

This last point is key, because often the problems that we are trying to solve with numerical solutions are challenging (as we have no easy way to solve them), where any “*good enough*” solution would be useful. It also highlights that there are many solutions to a given problem and even that many of them may be good enough to be usable.

Most of the problems that we are interested in solving in applied machine learning require a numerical solution.

It’s worse than this.

The numerical solutions to each sub-problem along the way influences the space of possible solutions for subsequent sub-problems.

Applied machine learning is a numerical discipline.

The core of a given machine learning model is an optimization problem, which is really a search for a set of terms with unknown values needed to fill an equation. Each algorithm has a different “*equation*” and “*terms*“, using this terminology loosely.

The equation is easy to calculate in order to make a prediction for a given set of terms, but we don’t know the terms to use in order to get a “*good*” or even “*best*” set of predictions on a given set of data.

This is the numerical optimization problem that we always seek to solve.

It’s numerical, because we are trying to solve the optimization problem with noisy, incomplete, and error-prone limited samples of observations from our domain. The model is trying hard to interpret the data and create a map between the inputs and the outputs of these observations.

The numerical optimization problem at the core of a chosen machine learning algorithm is nested in a broader problem.

The specific optimization problem is influenced by many factors, all of which greatly contribute to the “*goodness*” of the ultimate solution, and all of which do not have analytical solutions.

For example:

- What data to use.
- How much data to use.
- How to treat the data prior to modeling.
- What modeling algorithm or algorithms to use.
- How to configure the algorithms
- How to evaluate machine learning algorithms.

Objectively, these are all part of the open problem that your specific predictive modeling machine learning problem represents.

There is no analytical solution; you must discover what combination of these elements works best for your specific problem.

It is one big search problem where combinations of elements are trialed and evaluated.

Where you only really know what a good score is relative to the scores of other candidate solutions that you have tried.

Where there is no objective path through this maze other than trial and error and perhaps borrowing ideas from other related problems that do have known “*good enough*” solutions.

This great empirical approach to applied machine learning is often referred to as “*machine learning as search*” and is described further in the post:

This is also covered in the post:

We bring this back to the specific question you have.

The question of what data, algorithm, or configuration will work best for your specific predictive modeling problem.

No one can look at your data or a description of your problem and tell you how to solve it best, or even well.

Experience may inform an expert on areas to start looking, and some of those early guesses may pay off, but more often than not, early guesses are too complicated or plain wrong.

A predictive modeling problem must be worked in order to find a good-enough solution and it is your job as the machine learning practitioner to work it.

This is the hard work of applied machine learning and it is the area to practice and get good at to be considered competent in the field.

This section provides more resources on the topic if you are looking to go deeper.

- A Data-Driven Approach to Choosing Machine Learning Algorithms
- A Gentle Introduction to Applied Machine Learning as a Search Problem
- Why Applied Machine Learning Is Hard
- What’s the difference between analytical and numerical approaches to problems?

In this post, you discovered the difference between analytical and numerical solutions and the empirical nature of applied machine learning.

Specifically, you learned:

- Analytical solutions are logical procedures that yield an exact solution.
- Numerical solutions are trial-and-error procedures that are slower and result in approximate solutions.
- Applied Machine learning has a numerical solution at the core with an adjusted mindset in order to choose data, algorithms, and configurations for a specific predictive modeling problem.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Analytical vs Numerical Solutions in Machine Learning appeared first on Machine Learning Mastery.

]]>The post Machine Learning Development Environment appeared first on Machine Learning Mastery.

]]>A few times a week, I get a question such as:

What is your development environment for machine learning?

In this post, you will discover the development environment that I use and recommend for applied machine learning for developers.

After reading this post, you will know:

- The important distinctions between the role of workstation and server hardware in machine learning.
- How to ensure that your machine learning dependencies are installed and updated in a repeatable manner.
- How to develop machine learning code and run it in a safe way that does not introduce new issues.

Let’s get started.

What does your machine learning development environment look like?

Let me know in the comments below.

Whether you are learning machine learning or are developing large models for operations, your workstation hardware does not matter that much.

Here’s why:

I do not recommend that you fit large models on your workstation.

Machine learning development involves lots of small tests to figure out preliminary answers to questions such as:

- What data to use.
- How to prepare data.
- What models to use.
- What configuration to use.

Ultimately, your goal on your workstation is to figure out what experiments to run. I call this preliminary experiments. For your preliminary experiments, use less data: a small sample that will fit within your hardware capabilities.

Larger experiments take minutes, hours, or even days to complete. They should be run on large hardware other than your workstation.

This may be a server environment, perhaps with GPU hardware if you are using deep learning methods. This hardware may be provided by your employer or you can rent it cheaply in the cloud, such as AWS.

It is true that the faster (CPU) your workstation is and the more capacity (RAM) your workstation has, the more or larger preliminary small experiments you can run and the more you can get out of your larger experiments. So, get the best hardware you can, but in general, work with what you have got.

I myself like large Linux boxes with lots of RAM and lots of cores for serious R&D. For everyday work, I like an iMac, again with as many cores and as much RAM as I can get.

In summary:

**Workstation**. Work with a small sample of your data and figure out what large experiments to run.**Server(s)**. Run large experiments that take hours or days and help you figure out what model to use in operations.

You must install the library dependencies you have for machine learning development.

This is mainly the libraries you are using.

In Python, this may be Pandas, scikit-learn, Keras, and more. In R, this is all the packages and perhaps caret.

More than just installing the dependencies, you should have a repeatable process so that you can set-up the development environment again in seconds, such as on new workstations and on new servers.

I recommend using a package manager and a script, such as a shell script to install everything.

On my iMac, I use macports to manage installed packages. I think have two scripts: one to install all the packages I require on a new mac (such as after an upgrade of workstation or laptop) and another script specifically to update the installed packages.

Libraries are always being updated with bug fixes, so this second script to update the specifically installed libraries (and their dependencies) is key.

These are shell scripts that I can run at any time and that I keep updated as I need to install new libraries.

If you need help setting up your environment, one of these tutorials may help:

- How to Setup a Python Environment for Machine Learning and Deep Learning with Anaconda
- How to Install a Python 3 Environment on Mac OS X for Machine Learning and Deep Learning
- How to Create a Linux Virtual Machine For Machine Learning Development With Python 3

You may wish to take things to the next level in terms of having a repeatable environment, such as using a container such as Docker or maintaining your own virtualized instance.

In summary:

**Install Script**. Maintain a script that you can use to reinstall everything needed for your development environment.**Update Script**. Maintain a script to update all key dependencies for machine learning development and run it periodically.

I recommend a very simple editing environment.

The hard work with machine learning development is not writing code; it is instead dealing with the unknowns already mentioned. Unknowns such as:

- What data to use.
- How to prepare the data.
- What algorithm/s to use.
- What configurations to use.

Writing code is the easy part, especially because you are very likely to use an existing algorithm implementation from a modern machine learning library.

For this reason, you do not need a fancy IDE; it will not help you get answers to these questions.

Instead, I recommend using a very simple text editor that offers basic code highlighting.

Personally, I use and recommend Sublime Text, but any similar text editor will work just as well.

Some developers like to use notebooks, such as Jupyter. I do not use or recommend them as I have found that these environments to be challenging for development; they can hide errors and introduce dependency strangeness for development.

For studying machine learning and for machine learning development, I recommend writing scripts or code that can be run directly from the command line or from a shell script.

For example, R scripts and Python scripts can be run directly using the respective interpreter.

For more advice on how to run experiments from the command line, see the post:

Once you have a finalized model (or set of predictions), you can integrate it into your application using your standard development tools for your project.

This section provides more resources on the topic if you are looking to go deeper.

- Computer Hardware for Machine Learning
- How To Develop and Evaluate Large Deep Learning Models with Keras on Amazon Web Services
- How to Setup a Python Environment for Machine Learning and Deep Learning with Anaconda
- How to Install a Python 3 Environment on Mac OS X for Machine Learning and Deep Learning
- How to Create a Linux Virtual Machine For Machine Learning Development With Python 3
- How to Run Deep Learning Experiments on a Linux Server

In this post, you discovered the hardware, dependencies, and editor to use for machine learning development.

Specifically, you learned:

- The important distinctions between the role of workstation and server hardware in machine learning.
- How to ensure that your machine learning dependencies are installed and updated in a repeatable manner.
- How to develop machine learning code and run it in a safe way that does not introduce new issues.

What does your machine learning development environment look like?

Let me know in the comments below.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Machine Learning Development Environment appeared first on Machine Learning Mastery.

]]>The post How to Think About Machine Learning appeared first on Machine Learning Mastery.

]]>You can achieve impressive results with machine learning and find solutions to very challenging problems. But this is only a small corner of the broader field of machine learning often called predictive modeling or predictive analytics.

In this post, you will discover how to change the way you think about machine learning in order to best serve you as a machine learning practitioner.

After reading this post, you will know:

- What machine learning is and how it relates to artificial intelligence and statistics.
- The corner of machine learning that you should focus on.
- How to think about your problem and the machine learning solution to your problem.

Let’s get started.

This post is divided into 3 parts; they are:

- You’re Confused
- What is Machine Learning?
- Your Machine Learning

You have a machine learning problem to solve, but you’re confused about what exactly machine learning is.

There’s good reason to be confused. It is confusing to beginners.

Machine learning is a large field of study, and not all much of it is going to be relevant to you if you’re focused on solving a problem.

In this post, I hope to clear things up for you.

We will start off by describing machine learning in the broadest terms and how it relates to other fields of study like statistics and artificial intelligence.

After that, we will zoom in on the aspects of machine learning that you really need to know about for practical engineering and problem solving.

Machine learning is a field of computer science concerned with programs that learn.

The field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experience.

— Machine Learning, 1997.

That is super broad.

There are many types of learning, many types of feedback to learn from, and many things that can be learned.

This could encompass diverse types of learning, such as:

- Developing code to investigate how populations of organisms “learn” to adapt to their environment over evolutionary time.
- Developing code to investigate how one neuron in the brain “learns” in response to stimulus from other neurons.
- Developing code to investigate how ants “learn” the optimal path from their home to their food source.

I give these esoteric examples on purpose to help you really nail down that machine learning is a broad and far reaching program of research.

Another case that you may be more familiar with is:

- Developing code to investigate how to “learn” patterns in historical data.

This is less glamorous, but is the basis of the small corner of machine learning in which we as practitioners are deeply interested.

This corner is not distinct from the other examples; there can be a lot of overlap in methods for learning, fundamental tasks, ways of evaluating learning, and so on.

Machine learning is a subfield of artificial intelligence.

It overlaps with machine learning.

Artificial intelligence is also an area of computer science, but it is concerned with developing programs that are intelligent, or can do intelligent things.

Intelligence involves learning, e.g. machine learning, but may involve other concerns such as reasoning, planning, memory, and much more.

This could encompass diverse types of learning such as:

- Developing code to investigate how to optimally plan logistics.
- Developing code to investigate how to reason about a paragraph of text.
- Developing code to investigate how to perceive the contents of a photograph.

Artificial intelligence is often framed in the context of an agent in an environment with the intent to address some problem, but this does not have to be the case.

Machine learning could just as easily be named artificial learning to remain consistent with artificial intelligence and help out beginners.

The lines are blurry. Machine learning problems are also artificial intelligence problems.

Statistics, or applied statistics with computers, is a sub-field of mathematics that is concerned with describing and understanding the relationships in data.

This could encompass diverse types of learning such as:

- Developing models to summarize the distribution of a variable.
- Developing models to best characterize the relationship between two variables.
- Developing models to test the similarity between two populations of observations.

It also overlaps with the corner of machine learning interested in learning patterns in data.

Many methods used for understanding data in statistics can be used in machine learning to learn patterns in data. These tasks could be called machine learning or applied statistics.

Machine learning is a large field of study, and it can help you solve specific problems.

But you don’t need to know about all of it.

- You’re not an academic investigating an esoteric type of learning as in machine learning.
- You’re not trying to make an intelligent agent as in artificial intelligence.
- You’re not interested in learning more about why variables relate to each other in data as in statistics.

In fact, when it comes to learning relationships in data:

- You’re not investigating the capabilities of an algorithm.
- You’re not developing an entirely new theory or algorithm.
- You’re not extending an existing machine learning algorithm to new cases.

These may be activities in the corner of machine learning that we may be interested in, but activities for academics, not practitioners like you.

**So what parts of machine learning do you need to focus on?**

I think there are two ways to think about machine learning:

- In terms of the problem you are trying to solve.
- In terms of the solution you require.

Your problem can best be described as the following:

Find a model or procedure that makes best use of historical data comprised of inputs and outputs in order to skillfully predict outputs given new and unseen inputs in the future.

This is super specific.

First of all, it discards entire sub-fields of machine learning, such as unsupervised learning, to focus on one type of learning called supervised learning and all the algorithms that fit into that bucket.

That does not mean that you cannot leverage unsupervised methods; it just means that you do not focus your attention there, at least not to begin with.

Second of all, it gives you a clear objective that dominates all others: that is model skill at the expense of other concerns such as model complexity, model interpretability, and so on.

Again, this does not mean that these are not important, just that they are considered after or in conjunction with model skill.

Thirdly, the framing of your problem this way fits neatly into another field of study called predictive modeling. That is a field of study that borrows methods from machine learning with the objective of developing models that make skillful predictions.

In some areas of business, this area may also be called predictive analytics and encompasses more than just the modeling component to include related activities of gathering and preparing data and deploying and maintaining the model.

More recently, this activity can also be called data science, although that phrase also has connotations of inventing or discovering the problem in addition to working it through to a solution.

I don’t think it matters what you call this activity. But I do think it is important to deeply understand that your interest in and use of machine learning is highly specific and different from some other uses by academics.

It allows you to filter the material you read and the tools you choose in order to stay focused on the problem you’re trying to solve.

The solution you require is best described as the following:

A model or procedure that automatically creates the most likely approximation of the unknown underlying relationship between inputs and associated outputs in historical data.

Again, this is super specific.

You need an automatic method that produces a program or model that you can use to make predictions.

You cannot sit down and write code to solve your problem. It is entirely data-specific and you have a lot of data.

In fact, problems of this type resist top-down hand-coded solutions. If you could sit down and write some if-statements to solve your problem, you would not need a machine learning solution. It would be a programming problem.

The type of machine learning methods that you need will learn the relationship between the inputs and outputs in your historical data.

This framing allows you to think about what that real underlying yet unknown mapping function might look like and how noise, corruption, and sampling of your historical data may impact approximations of this mapping made by different modeling methods.

Without this framing, you will wonder things like:

- Why there isn’t just one super algorithm or set of parameters.
- Why the experts can’t just tell you what algorithm to use.
- Why you can’t achieve a zero error rate with predictions from your model.

It helps you see the ill-defined nature of the predictive modeling problem you’re trying to solve and sets reasonable expectations.

Now that you know how to think about machine learning, the next step is to change the way you think about the process of solving a problem with a machine learning solution.

For a hint, see the post:

This section provides more resources on the topic if you are looking to go deeper.

- Gentle Introduction to Predictive Modeling
- How Machine Learning Algorithms Work
- What is Machine Learning?
- A Gentle Introduction to Applied Machine Learning as a Search Problem
- Where Does Machine Learning Fit In?

In this post, you discovered how to change the way you think about machine learning in order to best serve you as a machine learning practitioner.

Specifically, you learned:

- What machine learning is and how it relates to artificial intelligence and statistics.
- The corner of machine learning that you should focus on.
- How to think about your problem and the machine learning solution to your problem.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Think About Machine Learning appeared first on Machine Learning Mastery.

]]>The post Why Machine Learning Does Not Have to Be So Hard appeared first on Machine Learning Mastery.

]]>This approach involves laying out the topics in an area of study in a logical way with a natural progression in complexity and capability.

The problem is, humans are not robots executing a learning program. We require motivation, excitement, and most importantly, a connection of the topic to tangible results.

Useful skills we use every day like reading, driving, and programming were not learned this way and were in fact learned using an inverted top-down approach. This **top-down approach** can be used to learn technical subjects directly such as **machine learning**, which can make you a lot more productive a lot sooner, and be a lot of fun.

In this post, you will discover the concrete difference between the top-down and bottom-up approaches to learning technical material and why this is the approach that practitioners should use to learn machine learning and even related mathematics.

After reading this post, you will know:

- The bottom-up approach used in universities to teach technical subjects and the problems with it.
- How people learn to read, drive, and program in a top-down manner and how the top-down approach works.
- The frame of machine learning and even mathematics using the top-down approach to learning and how to start to make rapid progress as a practitioner.

Let’s get started.

This is an important blog post, because I think it can really help to shake you out of the bottom-up, university-style way of learning machine learning.

This post is divided into seven parts; they are:

- Bottom-Up Learning
- Learning to Read
- Learning to Drive
- Learning to Code
- Top-Down Learning
- Learn Machine Learning
- Learning Mathematics

Take a field of study, such as mathematics.

There is a logical way to lay out the topics in mathematics that build on each other and lead through a natural progression in skills, capability, and understanding.

The problem is, this logical progression might only make sense to those who are already on the other side and can intuit the relationships between the topics.

Most of school is built around this bottom-up natural progression through material. A host of technical and scientific fields of study are taught this way.

Think back to high-school or undergraduate studies and the fundamental fields you may have worked through: examples such as:

- Mathematics, as mentioned.
- Biology.
- Chemistry.
- Physics.
- Computer Science.

Think about how the material was laid out, week-by-week, semester-by-semester, year-by-year. Bottom-up, logical progression.

The problem is, the logical progression through the material may not be the best way to learn the material in order to be productive.

We are not robots executing a learning program. We are emotional humans that need motivation, interest, attention, encouragement, and results.

You can learn technical subjects from the bottom-up, and a small percentage of people do prefer things this way, but it is not the only way.

Now, if you have completed a technical subject, think back to how to you actually learned it. I bet it was not bottom-up.

Think back; how did you learn to read?

My son is starting to read. Without thinking too much, here are the general techniques he’s using (really the school and us as parents):

- Start by being read to in order to generate interest and show benefits.
- Get the alphabet down and making the right sounds.
- Memorize the most frequent words, their sounds, and how to spell them.
- Learn the “spell-out-the-word” heuristic to deal with unknown words.
- Read through books with supervision.
- Read through books without supervision.

It is important that he continually knows why reading is important, connected to very tangible things he wants to do, like:

- Read captions on TV shows.
- Read stories on topics he loves, like Star Wars.
- Read signs and menus when we are out and about.
- So on…

It is also important that he gets results that he can track and in which he can see improvement.

- Larger vocabulary.
- Smoother reading style
- Books of increasing complexity.

Here’s how he did not learn to read:

- Definitions of word types (verbs, nouns, adverbs, etc.)
- Rules of grammar.
- Rules of punctuation.
- Theory of human languages.

Do you drive?

It’s cool if you don’t, but most adults do out of necessity. Society and city design is built around personal mobility.

How did you learn to drive?

I remember some written tests and maybe a test on a computer. I have no memory of studying for them, though I very likely did. Here’s what I do remember.

I remember hiring a driving instructor and doing driving lessons. Every single lesson was practical, in the car, practicing the skill I was required to master, driving the vehicle in traffic.

Here’s what I did not study or discuss with my driving instructor:

- The history of the automobile.
- The theory of combustion engines.
- The common mechanical faults in cars.
- The electrical system of the car.
- The theory of traffic flows.

To this day, I still manage to drive safely without any knowledge on these topics.

In fact, I never expect to learn these topics. I have zero need or interest and they will not help me realize the thing I want and need, which is safe and easy personal mobility.

If the car breaks, I’ll call an expert.

I started programming without any idea of what coding or software engineering meant.

At home, I messed around with commands in Basic. I messed around with commands in Excel. I modified computer games. And so on. It was fun.

When I started to learn programming and software engineering, it was in university and it was bottom up.

We started with:

- Language theory
- Data types
- Control flow structures
- Data structures
- etc.

When we did get to write code, it was on the command line and plagued with compiler problems, path problems, and a whole host of problems unrelated to actually learning programming.

**I hated programming.**

Flash-forward a few years. Somehow, I eventually starting working as a professional software engineer on some complex systems that were valued by their users. I was really good at it and I loved it.

Eventually, I did a course that showed how to create graphical user interfaces. And another that showed how to get computers to talk to each other using socket programming. And another on how to get multiple things to run at the same time using threads.

I connected the boring stuff with the thing I really liked: making software that could solve problems, that others could use. I connected it to something that mattered. It was no longer abstract and esoteric.

At least for me, and many developers like me, they taught it wrong. They really did. And it wasted years of time, effort, and results/outcomes that enthusiastic and time-free students like me could dedicate to something they are truly passionate about.

The bottom-up approach is not just a common way for teaching technical topics; it looks like the only way.

At least until you think about how you actually learn.

The designers of university courses, masters of their subject area, are trying to help. They are laying everything out to give you the logical progression through the material that they think will get you to the skills and capabilities that you require (hopefully).

And as I mentioned, it can work for some people.

It does not work for me, and I expect it does not work for you. In fact, very few programmers I’ve met that are really good at their craft came through computer science programs, or if they did, they learned at home, alone, hacking on side projects.

An alternative is the top-down approach.

**Flip the conventional approach on its head.**

Don’t start with definitions and theory. Instead, start by connecting the subject with the results you want and show how to get results immediately.

Lay out a program that focuses on practicing this process of getting results, going deeper into some areas as needed, but always in the context of the result they require.

It is not the traditional path.

Be careful not to use traditional ways of thinking or comparison if you take this path.

The onus is on you. There is no system to blame. You only fail when you stop.

**It is iterative**. Topics are revisited many times with deeper understanding.**It is imperfect**. Results may be poor in the beginning, but improve with practice.**It requires discovery**. The learner must be open to continual learning and discoverery.**It requires ownership**. The learner is responsible for improvement.**It requires curiosity**. The learner must pay attention to what interests them and follow it.

Seriously, I’ve heard “*experts*” say this many times, saying things like:

You have to know the theory first before you can use this technique, otherwise you cannot use it properly.

I agree that results will be imperfect in the beginning, but improvement and even expertise does not only have to come from theory and fundamentals.

If you believe that a beginner programmer should not be pushing changes to production and deploying them, then surely you must believe that a beginner machine learning practitioner would suffer the same constraints.

Skill must be demonstrated.

Trust must be earned.

This is true regardless of how a skill is acquired.

Really!?

This is another “*criticism*” I’ve seen leveled at this approach to learning.

Exactly. We want to be technicians, using the tools in practice to help people and not be researchers..

You do not need to cover all of the same ground because you have a different learning objective. Although you can circle back and learn anything you like later once you have a context in which to integrate the abstract knowledge.

Developers in industry are not computer scientists; they are engineers. They are proud technicians of the craft.

The benefits vastly outweigh the challenge of learning this way:

- You go straight to the thing you want and start practicing it.
- You have a context for connecting deeper knowledge and even theory.
- You can efficiently sift and filter topics based on your goals in the subject.

It’s faster.

It’s more fun.

And, I bet it makes you much better.

How could you be better?

Because the subject is connected to you emotionally. You have connected it to an outcome or result that matters to you. You are invested. You have demonstrable competence. We all love things we are good at (even if we are a little color blind to how good we are), which drives motivation, enthusiasm, and passion.

An enthusiastic learner will blow straight past the fundamentalist.

So, how have you approached the subject of machine learning?

Seriously, tell me your approach in the comments below.

- Are you taking a bottom-up university course?
- Are you modeling your learning on such a course?

Or worse:

Are you following a top-down type approach but are riddled with guilt, math envy, and insecurities?

You are not alone; I see this every single day in helping beginners on this website.

To connect the dots for you, I strongly encourage you to study machine learning using the top-down approach.

- Don’t start with precursor math.
- Don’t start with machine learning theory.
- Don’t code every algorithm from scratch.

This can all come later to refine and deepen your understanding once you have connections for this abstract knowledge.

- Start by learning how to work through very simple predictive modeling problems using a fixed framework with free and easy-to-use open source tools.
- Practice on many small projects and slowly increase their complexity.
- Show your work by building a public portfolio.

I have written about this approach many times; see the “*Further Reading*” section at the end of the post for some solid posts on how to get started with the top-down approach to machine learning.

“Experts” entrenched in universities will say it’s dangerous. Ignore them.

World-class practitioners will tell you it’s the way they learned and continue to learn. Model them.

Remember:

- You learned to read by practicing reading, not by studying language theory.
- You learned to drive by practicing driving, not by studying combustion engines.
- You learned to code by practicing coding, not by studying computability theory.

You can learn machine learning by practicing predictive modeling, not by studying math and theory.

Not only is this the way I learned and continue to practice machine learning, but it has helped tens of thousands of my students (and the many millions of readers of this blog).

Don’t stop there.

A time may come when you want or need to pull back the curtain on the mathematical pillars of machine learning such as linear algebra, calculus, statistics, probability, and so on.

You can use the exact same top-down approach.

Pick a goal or result that matters to you, and use that as a lens, filter, or sift on the topics to study and learn to the depth you need to get that result.

For example, let’s say you pick linear algebra.

A goal might be to grok SVD or PCA. These are methods used in machine learning for data projection, data reduction, and feature selection type tasks.

A top-down approach might be to:

- Implement the method in a high-level library such as scikit-learn and get a result.
- Implement the method in a lower-level library such as NumPy/SciPy and reproduce the result.
- Implement the method directly using matrices and matrix operations in NumPy or Octave.
- Study and explore the matrix arithmetic operations involved.
- Study and explore the matrix decomposition operations involved.
- Study methods for approximating the eigendecomposition of a matrix.
- And so on…

The goal provides the context and you can let your curiosity define the depth of study.

Painted this way, studying math is no different to studying any other topic in programming, machine learning, or other technical subjects.

It’s highly productive, and it’s a lot of fun!

This section provides more resources on the topic if you are looking to go deeper.

In this post, you discovered the concrete difference between the top-down and bottom-up approaches to learning technical material and why this is the approach that practitioners should and do use to learn machine learning and even related mathematics.

Specifically, you learned:

- The bottom-up approach used in universities to teach technical subjects and the problems with it.
- How people learn to read, drive, and program in a top-down manner and how the top-down approach works.
- The frame of machine learning and even mathematics using the top-down approach to learning and how to start to make rapid progress as a practitioner.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Why Machine Learning Does Not Have to Be So Hard appeared first on Machine Learning Mastery.

]]>The post Why Do Machine Learning Algorithms Work on New Data? appeared first on Machine Learning Mastery.

]]>I recently got the question:

“How can a machine learning model make accurate predictions on data that it has not seen before?”

The answer is generalization, and this is the capability that we seek when we apply machine learning to challenging problems.

In this post, you will discover generalization, the superpower of machine learning

After reading this post, you will know:

- That machine learning algorithms all seek to learn a mapping from inputs to outputs.
- That simpler skillful machine learning models are easier to understand and more robust.
- That machine learning is only suitable when the problem requires generalization.

Let’s get started.

When we fit a machine learning algorithm, we require a training dataset.

This training dataset includes a set of input patterns and the corresponding output patterns. The goal of the machine learning algorithm is to learn a reasonable approximation of the mapping from input patterns to output patterns.

Here are some examples to make this concrete:

- Mapping from emails to whether they are spam or not for email spam classification.
- Mapping from house details to house sale price for house sale price regression.
- Mapping from photograph to text to describe the photo in photo caption generation.

The list could go on.

We can summarize this mapping that machine learning algorithms learn as a function (f) that predicts the output (y) given the input (X), or restated:

y = f(X)

Our goal in fitting the machine learning algorithms is to get the best possible f() for our purposes.

We are training the model to make predictions in the future given inputs for cases where we do not have the outputs. Where the outputs are unknown. This requires that the algorithm learn in general how to take observations from the domain and make a prediction, not just the specifics of the training data.

This is called generalization.

A machine learning algorithm must generalize from training data to the entire domain of all unseen observations in the domain so that it can make accurate predictions when you use the model.

This is really hard.

This approach of generalization requires that the data that we use to train the model (X) is a good and reliable sample of the observations in the mapping we want the algorithm to learn. The higher the quality and the more representative, the easier it will be for the model to learn the unknown and underlying “true” mapping that exists from inputs to outputs.

To generalize means to go from something specific to something broad.

It is the way humans we learn.

- We don’t memorize specific roads when we learn to drive; we learn to drive in general so that we can drive on any road or set of conditions.
- We don’t memorize specific computer programs when learning to code; we learn general ways to solve problems with code for any business case that might come up.
- We don’t memorize the specific word order in natural language; we learn general meanings for words and put them together in new sequences as needed.

The list could go on.

Machine learning algorithms are procedures to automatically generalize from historical observations. And they can generalize on more data than a human could consider, faster than a human could consider it.

It is the speed and scale with which these automated generalization machines operate that is what is so exciting in the field of machine learning.

The machine learning model is the result of the automated generalization procedure called the machine learning algorithm.

The model could be said to be a generalization of the mapping from training inputs to training outputs.

There may be many ways to map inputs to outputs for a specific problem and we can navigate these ways by testing different algorithms, different algorithm configurations, different training data, and so on.

We cannot know which approach will result in the most skillful model beforehand, therefore we must test a suite of approaches, configurations, and framings of the problem to discover what works and what the limits of learning are on the problem before selecting a final model to use.

The skill of the model at making predictions determines the quality of the generalization and can help as a guide during the model selection process.

Out of the millions of possible mappings, we prefer simpler mappings over complex mappings. Put another way, we prefer the simplest possible hypothesis that explains the data. This is one way to choose models and comes from Occam’s razor.

The simpler model is often (but not always) easier to understand and maintain and is more robust. In practice, you may want to choose the best performing simplest model.

The ability to automatically learn by generalization is powerful, but is not suitable for all problems.

- Some problems require a precise solution, such as arithmetic on a bank account balance.
- Some problems can be solved by generalization, but simpler solutions exist, such as calculating the square root of positive numbers.
- Some problems look like they could be solved by generalization but there exists no structured underlying relationship to generalize from the data, or such a function is too complex, such as predicting security prices.

Key to the effective use of machine learning is learning where it can and cannot (or should not) be used.

Sometimes this is obvious, but often it is not. Again, you must use experience and experimentation to help tease out whether a problem is a good fit for being solved by generalization.

This section provides more resources on the topic if you are looking to go deeper.

- How Machine Learning Algorithms Work
- What is Machine Learning?
- Generalization (learning) on Wikipedia
- Ockham’s razor on Wikipedia

In this post, you discovered generalization, the key capabilities that underlie all supervised machine learning algorithms.

Specifically, you learned:

- That machine learning algorithms all seek to learn a mapping from inputs to outputs.
- That simpler skillful machine learning models are easier to understand and more robust.
- That machine learning is only suitable when the problem requires generalization.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Why Do Machine Learning Algorithms Work on New Data? appeared first on Machine Learning Mastery.

]]>The post Difference Between Classification and Regression in Machine Learning appeared first on Machine Learning Mastery.

]]>Fundamentally, classification is about predicting a label and regression is about predicting a quantity.

I often see questions such as:

How do I calculate accuracy for my regression problem?

Questions like this are a symptom of not truly understanding the difference between classification and regression and what accuracy is trying to measure.

In this tutorial, you will discover the differences between classification and regression.

After completing this tutorial, you will know:

- That predictive modeling is about the problem of learning a mapping function from inputs to outputs called function approximation.
- That classification is the problem of predicting a discrete class label output for an example.
- That regression is the problem of predicting a continuous quantity output for an example.

Let’s get started.

This tutorial is divided into 5 parts; they are:

- Function Approximation
- Classification
- Regression
- Classification vs Regression
- Converting Between Classification and Regression Problems

Predictive modeling is the problem of developing a model using historical data to make a prediction on new data where we do not have the answer.

Predictive modeling can be described as the mathematical problem of approximating a mapping function (f) from input variables (X) to output variables (y). This is called the problem of function approximation.

The job of the modeling algorithm is to find the best mapping function we can given the time and resources available.

For more on approximating functions in applied machine learning, see the post:

Generally, we can divide all function approximation tasks into classification tasks and regression tasks.

Classification predictive modeling is the task of approximating a mapping function (f) from input variables (X) to discrete output variables (y).

The output variables are often called labels or categories. The mapping function predicts the class or category for a given observation.

For example, an email of text can be classified as belonging to one of two classes: “spam*“* and “*not spam*“.

- A classification problem requires that examples be classified into one of two or more classes.
- A classification can have real-valued or discrete input variables.
- A problem with two classes is often called a two-class or binary classification problem.
- A problem with more than two classes is often called a multi-class classification problem.
- A problem where an example is assigned multiple classes is called a multi-label classification problem.

It is common for classification models to predict a continuous value as the probability of a given example belonging to each output class. The probabilities can be interpreted as the likelihood or confidence of a given example belonging to each class. A predicted probability can be converted into a class value by selecting the class label that has the highest probability.

For example, a specific email of text may be assigned the probabilities of 0.1 as being “spam” and 0.9 as being “not spam”. We can convert these probabilities to a class label by selecting the “not spam” label as it has the highest predicted likelihood.

There are many ways to estimate the skill of a classification predictive model, but perhaps the most common is to calculate the classification accuracy.

The classification accuracy is the percentage of correctly classified examples out of all predictions made.

For example, if a classification predictive model made 5 predictions and 3 of them were correct and 2 of them were incorrect, then the classification accuracy of the model based on just these predictions would be:

accuracy = correct predictions / total predictions * 100 accuracy = 3 / 5 * 100 accuracy = 60%

An algorithm that is capable of learning a classification predictive model is called a classification algorithm.

Regression predictive modeling is the task of approximating a mapping function (f) from input variables (X) to a continuous output variable (y).

A continuous output variable is a real-value, such as an integer or floating point value. These are often quantities, such as amounts and sizes.

For example, a house may be predicted to sell for a specific dollar value, perhaps in the range of $100,000 to $200,000.

- A regression problem requires the prediction of a quantity.
- A regression can have real valued or discrete input variables.
- A problem with multiple input variables is often called a multivariate regression problem.
- A regression problem where input variables are ordered by time is called a time series forecasting problem.

Because a regression predictive model predicts a quantity, the skill of the model must be reported as an error in those predictions.

There are many ways to estimate the skill of a regression predictive model, but perhaps the most common is to calculate the root mean squared error, abbreviated by the acronym RMSE.

For example, if a regression predictive model made 2 predictions, one of 1.5 where the expected value is 1.0 and another of 3.3 and the expected value is 3.0, then the RMSE would be:

RMSE = sqrt(average(error^2)) RMSE = sqrt(((1.0 - 1.5)^2 + (3.0 - 3.3)^2) / 2) RMSE = sqrt((0.25 + 0.09) / 2) RMSE = sqrt(0.17) RMSE = 0.412

A benefit of RMSE is that the units of the error score are in the same units as the predicted value.

An algorithm that is capable of learning a regression predictive model is called a regression algorithm.

Some algorithms have the word “regression” in their name, such as linear regression and logistic regression, which can make things confusing because linear regression is a regression algorithm whereas logistic regression is a classification algorithm.

Classification predictive modeling problems are different from regression predictive modeling problems.

- Classification is the task of predicting a discrete class label.
- Regression is the task of predicting a continuous quantity.

There is some overlap between the algorithms for classification and regression; for example:

- A classification algorithm may predict a continuous value, but the continuous value is in the form of a probability for a class label.
- A regression algorithm may predict a discrete value, but the discrete value in the form of an integer quantity.

Some algorithms can be used for both classification and regression with small modifications, such as decision trees and artificial neural networks. Some algorithms cannot, or cannot easily be used for both problem types, such as linear regression for regression predictive modeling and logistic regression for classification predictive modeling.

Importantly, the way that we evaluate classification and regression predictions varies and does not overlap, for example:

- Classification predictions can be evaluated using accuracy, whereas regression predictions cannot.
- Regression predictions can be evaluated using root mean squared error, whereas classification predictions cannot.

In some cases, it is possible to convert a regression problem to a classification problem. For example, the quantity to be predicted could be converted into discrete buckets.

For example, amounts in a continuous range between $0 and $100 could be converted into 2 buckets:

- Class 0: $0 to $49
- Class 1: $50 to $100

This is often called discretization and the resulting output variable is a classification where the labels have an ordered relationship (called ordinal).

In some cases, a classification problem can be converted to a regression problem. For example, a label can be converted into a continuous range.

Some algorithms do this already by predicting a probability for each class that in turn could be scaled to a specific range:

quantity = min + probability * range

Alternately, class values can be ordered and mapped to a continuous range:

- $0 to $49 for Class 1
- $50 to $100 for Class 2

If the class labels in the classification problem do not have a natural ordinal relationship, the conversion from classification to regression may result in surprising or poor performance as the model may learn a false or non-existent mapping from inputs to the continuous output range.

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered the difference between classification and regression problems.

Specifically, you learned:

- That predictive modeling is about the problem of learning a mapping function from inputs to outputs called function approximation.
- That classification is the problem of predicting a discrete class label output for an example.
- That regression is the problem of predicting a continuous quantity output for an example.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Difference Between Classification and Regression in Machine Learning appeared first on Machine Learning Mastery.

]]>The post How Álvaro Lemos got a Machine Learning Internship on a Data Science Team appeared first on Machine Learning Mastery.

]]>In this post, you will hear about Álvaro Lemos story and his transition from student to getting a machine learning internship. Including:

- How interest in genetic algorithms lead to the discovery of neural networks and the broader field of machine learning.
- How tutorial-based blog posts and books helped pass a test for a machine learning internship on a data science team

Let’s get started.

**Update Feb/2017**: Corrections made regarding Álvaro’s internship.

I’m from Salvador, Bahia (Brazil), but currently, I live in Belo Horizonte, Minas Gerais (also in Brazil).

I am studying Electrical Engineering at the Federal University of Minas Gerais and since the beginning of my undergraduation course, I’ve been involved with software development in some way.

On my first week as a freshman, I joined a research group called LabCOM to help a colleague on his master’s degree project. He wanted to build a self-managed traffic engineering system in which the network operation and maintenance are performed efficiently and without human intervention. It was built on top of a network simulator and I was assigned to deliver a module to measure some network parameters.

After that, I kept doing things related to software development, like the maintenance of a Linux server on my university, did a bunch of web development courses on sites like Code School, Codecademy and Coursera and a year ago I got my first internship job on a big software company.

It was an amazing experience, because I could work with state of the art technologies, very experienced developers, with whom I learned a lot of good practices and procedures.

When I was going to complete a year there, I received a proposal to work on another company on a Data Science team that was being formed, so I decided to accept that.

Good question…

I first heard of it during one meeting of the research group that I mentioned.

To make a long story short, we were using a genetic algorithm to get some results and, although they were reasonably good, they were taking longer than we could afford to be processed.

To overcome this, a colleague suggested to train a neural network with these results, because once we had a trained model, it would output results really fast.

I was one of the assigned people to implement this solution, but I didn’t know anything about it, so I googled it.

When I realized that an algorithm could provide the expected output without being explicitly programmed to do so and that it did so by mimicking the human brain, I was like “wooooow*, that’s magic!*”

When I decided that I wanted to learn machine learning, my first goal was to start the Johns Hopkins Data Science specialization on Coursera.

After completing two (out of ten!) courses, I let it go. I wasn’t really needing to apply that knowledge at that moment, I just wanted to learn machine learning and I felt that attending ten courses just to get that knowledge was quite overwhelming. I got distracted with other things and forgot about it.

One year after that, I decided to give a second shot to my “learning machine learning” quest. I registered myself on the famous Andrew Ng’s machine learning course on Coursera. It was just one course (instead of ten!), so I thought it would be okay. I really liked his lessons, he knows how to explain complex stuff in an easy way.

I was making progress there quite fast, but after finishing 60%, my first internship started and I started to use my spare time to learn the technologies I was using there. Then my classes at my university started and yes, I never came back to Coursera to finish that course again.

On the next semester, I attended an “Artificial Neural Network” class at my university. It was a good experience, reminded me of Andrew Ng’s approach, but I left that class with the same feeling that I still didn’t know machine learning enough, or that I wasn’t allowed to say that I know it.

Nobody told me, but I started thinking that in order to say that you can apply machine learning, you have to do some master’s degree program, because I saw a lot of students doing that.

Oh, another thing that I tried was to learn from articles (research papers). Please, do not do that. **That’s by far the worst approach I have ever tried**.

Maybe I was naive, but well, some teachers encourage you to learn that way. I think they are good to find techniques and/or algorithms that do what you want, but after making a short list, leave them and start googling for YouTube videos, blog posts and books.

It helped me a lot.

I was doing fine on my previous job when I heard of a machine learning internship opportunity. It was with a company that I had already heard good things about, so I decided to give it a shot.

They gave me three machine learning challenges to do within a week, but since I was working and studying, I just had a weekend to do so.

- The first problem asked us to train a logistic regression model to predict a target variable from a dataset with four features. I was supposed to do an exploratory data analysis, order the most relevant features, estimate the error and do some prediction on a test dataset. For this one I was able to use the knowledge that I already had, just had to learn the Scikit Learn API.
- The second one was quite similar, but the dataset was heavily imbalanced and I had no clue on how to deal with that, so I started googling and I found your blog. It really helped me because I discovered that I could use other metrics instead of the default accuracy, do cross validation, stratified cross validation, undersample and oversample the dataset, compare algorithms, etc. With all this new information, I created a Python module that would do that automatically for me and rank the models based on their F1-Score.
- The third one was the most challenging. I was supposed to find the most relevant features in a classification dataset with 128 features. Your blog posts also helped me with that.

I couldn’t simply send them the results, I also had to write a detailed report, so your blog posts were fundamental as they helped to fill my knowledge gap very fast.

Now, in my new job, your books are helping me a lot, our manager bought the Super Bundle for us

Thank you!

The company is called Radix and I just joined the Data Science team.

My first project was already finishing when I got there, but was very interesting. It’s a system called Oil X!pert, which receives as input oil sample of trucks, loaders and other equipment and output both the criticality level of the part and a diagnosis text, as shown in the figure below:

Now we’re using data-driven approaches in other projects to reach better solutions.

Specifically, the project I’m currently working on aims to find the root cause of fouling deposition on heat exchangers.

- GitHub: https://github.com/alvarolemos
- LinkedIn: https://www.linkedin.com/in/alvarolemos

The post How Álvaro Lemos got a Machine Learning Internship on a Data Science Team appeared first on Machine Learning Mastery.

]]>