The post Difference Between Algorithm and Model in Machine Learning appeared first on Machine Learning Mastery.

]]>For beginners, this is very confusing as often “*machine learning algorithm*” is used interchangeably with “*machine learning model*.” Are they the same thing or something different?

As a developer, your intuition with “*algorithms*” like sort algorithms and search algorithms will help to clear up this confusion.

In this post, you will discover the difference between machine learning “*algorithms*” and “*models*.”

After reading this post, you will know:

- Machine learning algorithms are procedures that are implemented in code and are run on data.
- Machine learning models are output by algorithms and are comprised of model data and a prediction algorithm.
- Machine learning algorithms provide a type of automatic programming where machine learning models represent the program.

Let’s get started.

This tutorial is divided into four parts; they are:

- What Is an Algorithm in Machine Learning
- What Is a Model in Machine Learning
- Algorithm vs. Model Framework
- Machine Learning Is Automatic Programming

An “*algorithm*” in machine learning is a procedure that is run on data to create a machine learning “*model*.”

Machine learning algorithms perform “*pattern recognition*.” Algorithms “*learn*” from data, or are “*fit*” on a dataset.

There are many machine learning algorithms.

For example, we have algorithms for classification, such as k-nearest neighbors. We have algorithms for regression, such as linear regression, and we have algorithms for clustering, such as k-means.

Examples of machine learning algorithms:

- Linear Regression
- Logistic Regression
- Decision Tree
- Artificial Neural Network
- k-Nearest Neighbors
- k-Means

You can think of a machine learning algorithm like any other algorithm in computer science.

For example, some other types of algorithms you might be familiar with include bubble sort for sorting data and best-first for searching.

As such, machine learning algorithms have a number of properties:

- Machine learning algorithms can be described using math and pseudocode.
- The efficiency of machine learning algorithms can be analyzed and described.
- Machine learning algorithms can be implemented with any one of a range of modern programming languages.

For example, you may see machine learning algorithms described with pseudocode or linear algebra in research papers and textbooks. You may see the computational efficiency of a specific machine learning algorithm compared to another specific algorithm.

Academics can devise entirely new machine learning algorithms and machine learning practitioners can use standard machine learning algorithms on their projects. This is just like other areas of computer science where academics can devise entirely new sorting algorithms, and programmers can use the standard sorting algorithms in their applications.

You are also likely to see multiple machine learning algorithms implemented together and provided in a library with a standard application programming interface (API). A popular example is the scikit-learn library that provides implementations of many classification, regression, and clustering machine learning algorithms in Python.

A “*model*” in machine learning is the output of a machine learning algorithm run on data.

A model represents what was learned by a machine learning algorithm.

The model is the “*thing*” that is saved after running a machine learning algorithm on training data and represents the rules, numbers, and any other algorithm-specific data structures required to make predictions.

Some examples might make this clearer:

- The linear regression algorithm results in a model comprised of a vector of coefficients with specific values.
- The decision tree algorithm results in a model comprised of a tree of if-then statements with specific values.
- The neural network / backpropagation / gradient descent algorithms together result in a model comprised of a graph structure with vectors or matrices of weights with specific values.

A machine learning model is more challenging for a beginner because there is not a clear analogy with other algorithms in computer science.

For example, the sorted list output of a sorting algorithm is not really a model.

**The best analogy is to think of the machine learning model as a “ program.”**

The machine learning model “*program*” is comprised of both data and a procedure for using the data to make a prediction.

For example, consider the linear regression algorithm and resulting model. The model is comprised of a vector of coefficients (data) that are multiplied and summed with a row of new data taken as input in order to make a prediction (prediction procedure).

We save the data for the machine learning model for later use.

We often use the prediction procedure for the machine learning model provided by a machine learning library. Sometimes we may implement the prediction procedure ourselves as part of our application. This is often straightforward to do given that most prediction procedures are quite simple.

So now we are familiar with a machine learning “*algorithm*” vs. a machine learning “*model*.”

Specifically, an algorithm is run on data to create a model.

- Machine Learning => Machine Learning Model

We also understand that a model is comprised of both data and a procedure for how to use the data to make a prediction on new data. You can think of the procedure as a prediction algorithm if you like.

- Machine Learning Model == Model Data + Prediction Algorithm

This division is very helpful in understanding a wide range of algorithms.

For example, most algorithms have all of their work in the “*algorithm*” and the “*prediction algorithm*” does very little.

Typically, the algorithm is some sort of optimization procedure that minimizes error of the model (data + prediction algorithm) on the training dataset. The linear regression algorithm is a good example. It performs an optimization process (or is solved analytically using linear algebra) to find a set of weights that minimize the sum squared error on the training dataset.

**Linear Regression:**

**Algorithm**: Find set of coefficients that minimize error on training dataset**Model**:**Model Data**: Vector of coefficients**Prediction Algorithm**: Multiple and sum coefficients with input row

Some algorithms are trivial or even do nothing, and all of the work is in the model or prediction algorithm.

The k-nearest neighbor algorithm has no “*algorithm*” other than saving the entire training dataset. The model data, therefore, is the entire training dataset and all of the work is in the prediction algorithm, i.e. how a new row of data interacts with the saved training dataset to make a prediction.

**k-Nearest Neighbors**

**Algorithm**: Save training data.**Model**:**Model Data**: Entire training dataset.**Prediction Algorithm**: Find k most similar rows and average their target variable.

You can use this breakdown as a framework to understand any machine learning algorithm.

**What is your favorite algorithm?**

Can you describe it using this framework in the comments below?

Do you know an algorithm that does not fit neatly into this breakdown?

We really just want a machine learning “*model*” and the “*algorithm*” is just the path we follow to get the model.

Machine learning techniques are used for problems that cannot be solved efficiently or effectively in other ways.

For example, if we need to classify emails as spam or not spam, we need a software program to do this.

We could sit down, manually review a ton of email, and write if-statements to perform this task. People have tried. It turns out that this approach is slow, fragile, and not very effective.

Instead, we can use machine learning techniques to solve this problem. Specifically, an algorithm like Naive Bayes can learn how to classify email messages as spam and not spam from a large dataset of historical examples of email.

We don’t want “*Naive Bayes*.” We want the model that Naive Bayes gives is that we can use to classify email (the vectors of probabilities and prediction algorithm for using them). We want the model, not the algorithm used to create the model.

In this sense, the machine learning model is a program automatically written or created or learned by the machine learning algorithm to solve our problem.

As developers, we are less interested in the “*learning*” performed by machine learning algorithms in the artificial intelligence sense. We don’t care about simulating learning processes. Some people may be, and it is interesting, but this is not why we are using machine learning algorithms.

Instead, we are more interested in the automatic programming capability offered by machine learning algorithms. We want an effective model created efficiently that we can incorporate into our software project.

Machine learning algorithms perform automatic programming and machine learning models are the programs created for us.

In this post, you discovered the difference between machine learning “*algorithms*” and “*models*.”

Specifically, you learned:

- Machine learning algorithms are procedures that are implemented in code and are run on data.
- Machine learning models are output by algorithms and are comprised of model data and a prediction algorithm.
- Machine learning algorithms provide a type of automatic programming where machine learning models represent the program.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Difference Between Algorithm and Model in Machine Learning appeared first on Machine Learning Mastery.

]]>The post How to Handle Big-p, Little-n (p >> n) in Machine Learning appeared first on Machine Learning Mastery.

]]>Machine learning datasets are often structured or tabular data comprised of rows and columns.

The columns that are fed as input to a model are called predictors or “*p*” and the rows are samples “*n*“. Most machine learning algorithms assume that there are many more samples than there are predictors, denoted as p << n.

Sometimes, this is not the case, and there are many more predictors than samples in the dataset, referred to as “**big-p, little-n**” and denoted as **p >> n**. These problems often require specialized data preparation and modeling algorithms to address them correctly.

In this tutorial, you will discover the challenge of big-p, little n or p >> n machine learning problems.

After completing this tutorial, you will know:

- Most machine learning problems have many more samples than predictors and most machine learning algorithms make this assumption during the training process.
- Some modeling problems have many more predictors than samples, referred to as p >> n.
- Algorithms to explore when modeling machine learning datasets with more predictors than samples.

Let’s get started.

This tutorial is divided into three parts; they are:

- Predictors (p) and Samples (n)
- Machine Learning Assumes p << n
- How to Handle p >> n

Consider a predictive modeling problem, such as classification or regression.

The dataset is structured data or tabular data, like what you might see in an Excel spreadsheet.

There are columns and rows. Most of the columns would be used as inputs to a model and one column would represent the output or variable to be predicted.

The inputs go by different names, such as predictors, independent variables, features, or sometimes just variables. The output variable—in this case, sales—is often called the response or dependent variable, and is typically denoted using the symbol Y.

— Page 15, An Introduction to Statistical Learning with Applications in R, 2017.

Each column represents a variable or one aspect of a sample. The columns that represent the inputs to the model are called predictors.

Each row represents one sample with values across each of the columns or features.

**Predictors**: Input columns of a dataset, also called input variables or features.**Samples**: Rows of a dataset, also called an observation, example, or instance.

It is common to describe a training dataset in machine learning in terms of the predictors and samples.

The number of predictors in a dataset is described using the term “*p*” and the number of samples in a dataset is described using the term “*n*” or sometimes “*N*“.

**p**: The number of predictors in a dataset.**n**: The number of samples in a dataset.

To make this concrete, let’s take a look at the iris flowers classification problem.

Below is a sample of the first five rows of this dataset.

5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 5.0,3.6,1.4,0.2,Iris-setosa ...

This dataset has five columns and 150 rows.

The first four columns are inputs and the fifth column is the output, meaning that there are four predictors.

We would describe the iris flowers dataset as:

- p=4, n=150.

It is almost always the case that the number of predictors (*p*) will be smaller than the number of samples (*n*).

Often much smaller.

We can summarize this expectation as p << n, where “<<” is a mathematical inequality that means “*much less than*.”

**p << n**: Typically we have fewer predictors than samples.

To demonstrate this, let’s look at a few more standard machine learning datasets:

- Pima Indians Diabetes: p=8, n=768
- Glass Identification: p=9, n=214
- Boston Housing: p=13, n=506

Most machine learning algorithms operate based on the assumption that there are many more samples than predictors.

One way to think about predictors and samples is to take a geometrical perspective.

Consider a hypercube where the number of predictors (*p*) defines the number of dimensions of the hypercube. The volume of this hypercube is the scope of possible samples that could be drawn from the domain. The number of samples (*n*) are the actual samples drawn from the domain that you must use to model your predictive modeling problem.

This is a rationale for the axiom “get as much data as possible” in applied machine learning. It is a desire to gather a sufficiently representative sample of the *p*-dimensional problem domain.

As the number of dimensions (*p*) increases, the volume of the domain increases exponentially. This, in turn, requires more samples (*n*) from the domain to provide effective coverage of the domain for a learning algorithm. We don’t need full coverage of the domain, just what is likely to be observable.

This challenge of effectively sampling high-dimensional spaces is generally referred to as the curse of dimensionality.

Machine learning algorithms overcome the curse of dimensionality by making assumptions about the data and structure of the mapping function from inputs to outputs. They add a bias.

The fundamental reason for the curse of dimensionality is that high-dimensional functions have the potential to be much more complicated than low-dimensional ones, and that those complications are harder to discern. The only way to beat the curse is to incorporate knowledge about the data that is correct.

— Page 15, Pattern Classification, 2000.

Many machine learning algorithms that use distance measures and other local models (in feature space) often degrade in performance as the number of predictors is increased.

When the number of features p is large, there tends to be a deterioration in the performance of KNN and other local approaches that perform prediction using only observations that are near the test observation for which a prediction must be made. This phenomenon is known as the curse of dimensionality, and it ties into the fact that non-parametric approaches often perform poorly when p is large.

— Page 168, An Introduction to Statistical Learning with Applications in R, 2017.

It is not always the case that the number of predictors is less than the number of samples.

Some predictive modeling problems have more predictors than samples by definition.

Often many more predictors than samples.

This is often described as “*big-p, little-n*,” “*large-p, small-n*,” or more compactly as “p >> n”, where the “>>” is a mathematical inequality operator that means “*much more than*.”

… prediction problems in which the number of features p is much larger than the number of observations N, often written p >> N.

— Page 649, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2016.

Consider this from a geometrical perspective.

Now, instead of having a domain with tens of dimensions (or fewer), the domain has many thousands of dimensions and only a few tens of samples from this space. We cannot expect to have anything like a representative sample of the domain.

Many examples of p >> n problems come from the field of medicine, where there is a small patient population and a large number of descriptive characteristics.

At the same time, applications have emerged in which the number of experimental units is comparatively small but the underlying dimension is massive; illustrative examples might include image analysis, microarray analysis, document classification, astronomy and atmospheric science.

— Statistical challenges of high-dimensional data, 2009.

A common example of a p >> n problem is gene expression arrays, where there may be thousands of genes (predictors) and only tens of samples.

Gene expression arrays typically have 50 to 100 samples and 5,000 to 20,000 variables (genes).

— Expression Arrays and the p >> n Problem, 2003.

Given that most machine learning algorithms assume many more samples than predictors, this introduces a challenge when modeling.

Specifically, the assumptions made by standard machine learning models may cause the models to behave unexpectedly, provide misleading results, or fail completely.

… models cannot be used “out of the box”, since the standard fitting algorithms all require p<n; in fact the usual rule of thumb is that there be five or ten times as many samples as variables.

— Expression Arrays and the p >> n Problem, 2003.

A major problem with p >> n problems when using machine learning models is overfitting the training dataset.

Given the lack of samples, most models are unable to generalize and instead learn the statistical noise in the training data. This makes the model perform well on the training dataset but perform poorly on new examples from the problem domain.

This is also a hard problem to diagnose, as the lack of samples does not allow for a test or validation dataset by which model overfitting can be evaluated. As such, it is common to use leave-one-out style cross-validation (LOOCV) when evaluating models on p >> n problems.

There are many ways to approach a p >> n type classification or regression problem.

Some examples include:

One approach is to ignore the p and n relationship and evaluate standard machine learning models.

This might be considered the baseline method by which any other more specialized interventions may be compared.

Feature selection involves selecting a subset of predictors to use as input to predictive models.

Common techniques include filter methods that select features based on their statistical relationship to the target variable (e.g. correlation), and wrapper methods that select features based on their contribution to a model when predicting the target variable (e.g. RFE).

A suite of feature selection methods could be evaluated and compared, perhaps applied in an aggressive manner to dramatically reduce the number of input features to those determined to be most critical.

Feature selection is an important scientific requirement for a classifier when p is large.

— Page 658, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2016.

For more on feature selection see the tutorial:

Projection methods create a lower-dimensionality representation of samples that preserves the relationships observed in the data.

They are often used for visualization, although the dimensionality reduction nature of the techniques may also make them useful as a data transform to reduce the number of predictors.

This might include techniques from linear algebra, such as SVD and PCA.

When p > N, the computations can be carried out in an N-dimensional space, rather than p, via the singular value decomposition …

— Page 659, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2016.

It might also include manifold learning algorithms often used for visualization such as t-SNE.

Standard machine learning algorithms may be adapted to use regularization during the training process.

This will penalize models based on the number of features used or weighting of features, encouraging the model to perform well and minimize the number of predictors used in the model.

This can act as a type of automatic feature selection during training and may involve augmenting existing models (e.g regularized linear regression and regularized logistic regression) or the use of specialized methods such as LARS and LASSO.

There is no best method and it is recommended to use controlled experiments to test a suite of different methods.

This section provides more resources on the topic if you are looking to go deeper.

- Expression Arrays and the p >> n Problem, 2003.
- Statistical challenges of high-dimensional data, 2009.

- Pattern Classification, 2000.
- Chapter 18, High-Dimensional Problems: p >> N, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2016.
- An Introduction to Statistical Learning with Applications in R, 2017.

In this tutorial, you discovered the challenge of big-p, little n or p >> n machine learning problems.

Specifically, you learned:

- Machine learning datasets can be described in terms of the number of predictors (p) and the number of samples (n).
- Most machine learning problems have many more samples than predictors and most machine learning algorithms make this assumption during the training process.
- Some modeling problems have many more predictors than samples, such as problems from medicine, referred to as p >> n, and may require the use of specialized algorithms.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Handle Big-p, Little-n (p >> n) in Machine Learning appeared first on Machine Learning Mastery.

]]>The post A Tour of Machine Learning Algorithms appeared first on Machine Learning Mastery.

]]>It is useful to tour the main algorithms in the field to get a feeling of what methods are available.

There are so many algorithms that it can feel overwhelming when algorithm names are thrown around and you are expected to just know what they are and where they fit.

I want to give you two ways to think about and categorize the algorithms you may come across in the field.

- The first is a grouping of algorithms by their
**learning style**. - The second is a grouping of algorithms by their
**similarity**in form or function (like grouping similar animals together).

Both approaches are useful, but we will focus in on the grouping of algorithms by similarity and go on a tour of a variety of different algorithm types.

After reading this post, you will have a much better understanding of the most popular machine learning algorithms for supervised learning and how they are related.

Discover how machine learning algorithms work including kNN, decision trees, naive bayes, SVM, ensembles and much more in my new book, with 22 tutorials and examples in excel.

Let’s get started.

There are different ways an algorithm can model a problem based on its interaction with the experience or environment or whatever we want to call the input data.

It is popular in machine learning and artificial intelligence textbooks to first consider the learning styles that an algorithm can adopt.

There are only a few main learning styles or learning models that an algorithm can have and we’ll go through them here with a few examples of algorithms and problem types that they suit.

This taxonomy or way of organizing machine learning algorithms is useful because it forces you to think about the roles of the input data and the model preparation process and select one that is the most appropriate for your problem in order to get the best result.

Input data is called training data and has a known label or result such as spam/not-spam or a stock price at a time.

A model is prepared through a training process in which it is required to make predictions and is corrected when those predictions are wrong. The training process continues until the model achieves a desired level of accuracy on the training data.

Example problems are classification and regression.

Example algorithms include: Logistic Regression and the Back Propagation Neural Network.

Input data is not labeled and does not have a known result.

A model is prepared by deducing structures present in the input data. This may be to extract general rules. It may be through a mathematical process to systematically reduce redundancy, or it may be to organize data by similarity.

Example problems are clustering, dimensionality reduction and association rule learning.

Example algorithms include: the Apriori algorithm and K-Means.

Input data is a mixture of labeled and unlabelled examples.

There is a desired prediction problem but the model must learn the structures to organize the data as well as make predictions.

Example problems are classification and regression.

Example algorithms are extensions to other flexible methods that make assumptions about how to model the unlabeled data.

When crunching data to model business decisions, you are most typically using supervised and unsupervised learning methods.

A hot topic at the moment is semi-supervised learning methods in areas such as image classification where there are large datasets with very few labeled examples.

Algorithms are often grouped by similarity in terms of their function (how they work). For example, tree-based methods, and neural network inspired methods.

I think this is the most useful way to group algorithms and it is the approach we will use here.

This is a useful grouping method, but it is not perfect. There are still algorithms that could just as easily fit into multiple categories like Learning Vector Quantization that is both a neural network inspired method and an instance-based method. There are also categories that have the same name that describe the problem and the class of algorithm such as Regression and Clustering.

We could handle these cases by listing algorithms twice or by selecting the group that subjectively is the “*best*” fit. I like this latter approach of not duplicating algorithms to keep things simple.

In this section, we list many of the popular machine learning algorithms grouped the way we think is the most intuitive. The list is not exhaustive in either the groups or the algorithms, but I think it is representative and will be useful to you to get an idea of the lay of the land.

**Please Note**: There is a strong bias towards algorithms used for classification and regression, the two most prevalent supervised machine learning problems you will encounter.

If you know of an algorithm or a group of algorithms not listed, put it in the comments and share it with us. Let’s dive in.

Regression is concerned with modeling the relationship between variables that is iteratively refined using a measure of error in the predictions made by the model.

Regression methods are a workhorse of statistics and have been co-opted into statistical machine learning. This may be confusing because we can use regression to refer to the class of problem and the class of algorithm. Really, regression is a process.

The most popular regression algorithms are:

- Ordinary Least Squares Regression (OLSR)
- Linear Regression
- Logistic Regression
- Stepwise Regression
- Multivariate Adaptive Regression Splines (MARS)
- Locally Estimated Scatterplot Smoothing (LOESS)

Instance-based learning model is a decision problem with instances or examples of training data that are deemed important or required to the model.

Such methods typically build up a database of example data and compare new data to the database using a similarity measure in order to find the best match and make a prediction. For this reason, instance-based methods are also called winner-take-all methods and memory-based learning. Focus is put on the representation of the stored instances and similarity measures used between instances.

The most popular instance-based algorithms are:

- k-Nearest Neighbor (kNN)
- Learning Vector Quantization (LVQ)
- Self-Organizing Map (SOM)
- Locally Weighted Learning (LWL)
- Support Vector Machines (SVM)

An extension made to another method (typically regression methods) that penalizes models based on their complexity, favoring simpler models that are also better at generalizing.

I have listed regularization algorithms separately here because they are popular, powerful and generally simple modifications made to other methods.

The most popular regularization algorithms are:

- Ridge Regression
- Least Absolute Shrinkage and Selection Operator (LASSO)
- Elastic Net
- Least-Angle Regression (LARS)

Decision tree methods construct a model of decisions made based on actual values of attributes in the data.

Decisions fork in tree structures until a prediction decision is made for a given record. Decision trees are trained on data for classification and regression problems. Decision trees are often fast and accurate and a big favorite in machine learning.

The most popular decision tree algorithms are:

- Classification and Regression Tree (CART)
- Iterative Dichotomiser 3 (ID3)
- C4.5 and C5.0 (different versions of a powerful approach)
- Chi-squared Automatic Interaction Detection (CHAID)
- Decision Stump
- M5
- Conditional Decision Trees

Bayesian methods are those that explicitly apply Bayes’ Theorem for problems such as classification and regression.

The most popular Bayesian algorithms are:

- Naive Bayes
- Gaussian Naive Bayes
- Multinomial Naive Bayes
- Averaged One-Dependence Estimators (AODE)
- Bayesian Belief Network (BBN)
- Bayesian Network (BN)

Clustering, like regression, describes the class of problem and the class of methods.

Clustering methods are typically organized by the modeling approaches such as centroid-based and hierarchal. All methods are concerned with using the inherent structures in the data to best organize the data into groups of maximum commonality.

The most popular clustering algorithms are:

- k-Means
- k-Medians
- Expectation Maximisation (EM)
- Hierarchical Clustering

Association rule learning methods extract rules that best explain observed relationships between variables in data.

These rules can discover important and commercially useful associations in large multidimensional datasets that can be exploited by an organization.

The most popular association rule learning algorithms are:

- Apriori algorithm
- Eclat algorithm

Artificial Neural Networks are models that are inspired by the structure and/or function of biological neural networks.

They are a class of pattern matching that are commonly used for regression and classification problems but are really an enormous subfield comprised of hundreds of algorithms and variations for all manner of problem types.

Note that I have separated out Deep Learning from neural networks because of the massive growth and popularity in the field. Here we are concerned with the more classical methods.

The most popular artificial neural network algorithms are:

- Perceptron
- Multilayer Perceptrons (MLP)
- Back-Propagation
- Stochastic Gradient Descent
- Hopfield Network
- Radial Basis Function Network (RBFN)

Deep Learning methods are a modern update to Artificial Neural Networks that exploit abundant cheap computation.

They are concerned with building much larger and more complex neural networks and, as commented on above, many methods are concerned with very large datasets of labelled analog data, such as image, text. audio, and video.

The most popular deep learning algorithms are:

- Convolutional Neural Network (CNN)
- Recurrent Neural Networks (RNNs)
- Long Short-Term Memory Networks (LSTMs)
- Stacked Auto-Encoders
- Deep Boltzmann Machine (DBM)
- Deep Belief Networks (DBN)

Like clustering methods, dimensionality reduction seek and exploit the inherent structure in the data, but in this case in an unsupervised manner or order to summarize or describe data using less information.

This can be useful to visualize dimensional data or to simplify data which can then be used in a supervised learning method. Many of these methods can be adapted for use in classification and regression.

- Principal Component Analysis (PCA)
- Principal Component Regression (PCR)
- Partial Least Squares Regression (PLSR)
- Sammon Mapping
- Multidimensional Scaling (MDS)
- Projection Pursuit
- Linear Discriminant Analysis (LDA)
- Mixture Discriminant Analysis (MDA)
- Quadratic Discriminant Analysis (QDA)
- Flexible Discriminant Analysis (FDA)

Ensemble methods are models composed of multiple weaker models that are independently trained and whose predictions are combined in some way to make the overall prediction.

Much effort is put into what types of weak learners to combine and the ways in which to combine them. This is a very powerful class of techniques and as such is very popular.

- Boosting
- Bootstrapped Aggregation (Bagging)
- AdaBoost
- Weighted Average (Blending)
- Stacked Generalization (Stacking)
- Gradient Boosting Machines (GBM)
- Gradient Boosted Regression Trees (GBRT)
- Random Forest

Many algorithms were not covered.

I did not cover algorithms from specialty tasks in the process of machine learning, such as:

- Feature selection algorithms
- Algorithm accuracy evaluation
- Performance measures
- Optimization algorithms

I also did not cover algorithms from specialty subfields of machine learning, such as:

- Computational intelligence (evolutionary algorithms, etc.)
- Computer Vision (CV)
- Natural Language Processing (NLP)
- Recommender Systems
- Reinforcement Learning
- Graphical Models
- And more…

These may feature in future posts.

This tour of machine learning algorithms was intended to give you an overview of what is out there and some ideas on how to relate algorithms to each other.

I’ve collected together some resources for you to continue your reading on algorithms. If you have a specific question, please leave a comment.

There are other great lists of algorithms out there if you’re interested. Below are few hand selected examples.

- List of Machine Learning Algorithms: On Wikipedia. Although extensive, I do not find this list or the organization of the algorithms particularly useful.
- Machine Learning Algorithms Category: Also on Wikipedia, slightly more useful than Wikipedias great list above. It organizes algorithms alphabetically.
- CRAN Task View: Machine Learning & Statistical Learning: A list of all the packages and all the algorithms supported by each machine learning package in R. Gives you a grounded feeling of what’s out there and what people are using for analysis day-to-day.
- Top 10 Algorithms in Data Mining: Published article and now a book (Affiliate Link) on the most popular algorithms for data mining. Another grounded and less overwhelming take on methods that you could go off and learn deeply.

Algorithms are a big part of machine learning. It’s a topic I am passionate about and write about a lot on this blog. Below are few hand selected posts that might interest you for further reading.

- How to Learn Any Machine Learning Algorithm: A systematic approach that you can use to study and understand any machine learning algorithm using “algorithm description templates” (I used this approach to write my first book).
- How to Create Targeted Lists of Machine Learning Algorithms: How you can create your own systematic lists of machine learning algorithms to jump start work on your next machine learning problem.
- How to Research a Machine Learning Algorithm: A systematic approach that you can use to research machine learning algorithms (works great in collaboration with the template approach listed above).
- How to Investigate Machine Learning Algorithm Behavior: A methodology you can use to understand how machine learning algorithms work by creating and executing very small studies into their behavior. Research is not just for academics!
- How to Implement a Machine Learning Algorithm: A process and tips and tricks for implementing machine learning algorithms from scratch.

Sometimes you just want to dive into code. Below are some links you can use to run machine learning algorithms, code them up using standard libraries or implement them from scratch.

- How To Get Started With Machine Learning Algorithms in R: Links to a large number of code examples on this site demonstrating machine learning algorithms in R.
- Machine Learning Algorithm Recipes in scikit-learn: A collection of Python code examples demonstrating how to create predictive models using scikit-learn.
- How to Run Your First Classifier in Weka: A tutorial for running your very first classifier in Weka (
**no code required!**).

I hope you have found this tour useful.

Please, leave a comment if you have any questions or ideas on how to improve the algorithm tour.

**Update**: Continue the discussion on HackerNews and reddit.

The post A Tour of Machine Learning Algorithms appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Concept Drift in Machine Learning appeared first on Machine Learning Mastery.

]]>This problem of the changing underlying relationships in the data is called concept drift in the field of machine learning.

In this post, you will discover the problem of concept drift and ways to you may be able to address it in your own predictive modeling problems.

After completing this post, you will know:

- The problem of data changing over time.
- What is concept drift and how it is defined.
- How to handle concept drift in your own predictive modeling problems.

Discover how machine learning algorithms work including kNN, decision trees, naive bayes, SVM, ensembles and much more in my new book, with 22 tutorials and examples in excel.

Let’s get started.

This post is divided into 3 parts; they are:

- Changes to Data Over Time
- What is Concept Drift?
- How to Address Concept Drift

Predictive modeling is the problem of learning a model from historical data and using the model to make predictions on new data where we do not know the answer.

Technically, predictive modeling is the problem of approximating a mapping function (f) given input data (X) to predict an output value (y).

y = f(X)

Often, this mapping is assumed to be static, meaning that the mapping learned from historical data is just as valid in the future on new data and that the relationships between input and output data do not change.

This is true for many problems, but not all problems.

In some cases, the relationships between input and output data can change over time, meaning that in turn there are changes to the unknown underlying mapping function.

The changes may be consequential, such as that the predictions made by a model trained on older historical data are no longer correct or as correct as they could be if the model was trained on more recent historical data.

These changes, in turn, may be able to be detected, and if detected, it may be possible to update the learned model to reflect these changes.

… many data mining methods assume that discovered patterns are static. However, in practice patterns in the database evolve over time. This poses two important challenges. The first challenge is to detect when concept drift occurs. The second challenge is to keep the patterns up-to-date without inducing the patterns from scratch.

— Page 10, Data Mining and Knowledge Discovery Handbook, 2010.

Concept drift in machine learning and data mining refers to the change in the relationships between input and output data in the underlying problem over time.

In other domains, this change maybe called “*covariate shift*,” “*dataset shift*,” or “*nonstationarity*.”

In most challenging data analysis applications, data evolve over time and must be analyzed in near real time. Patterns and relations in such data often evolve over time, thus, models built for analyzing such data quickly become obsolete over time. In machine learning and data mining this phenomenon is referred to as concept drift.

— An overview of concept drift applications, 2016.

A concept in “*concept drift*” refers to the unknown and hidden relationship between inputs and output variables.

For example, one concept in weather data may be the season that is not explicitly specified in temperature data, but may influence temperature data. Another example may be customer purchasing behavior over time that may be influenced by the strength of the economy, where the strength of the economy is not explicitly specified in the data. These elements are also called a “hidden context”.

A difficult problem with learning in many real-world domains is that the concept of interest may depend on some hidden context, not given explicitly in the form of predictive features. A typical example is weather prediction rules that may vary radically with the season. […] Often the cause of change is hidden, not known a priori, making the learning task more complicated.

— The problem of concept drift: definitions and related work, 2004.

The change to the data could take any form. It is conceptually easier to consider the case where there is some temporal consistency to the change such that data collected within a specific time period show the same relationship and that this relationship changes smoothly over time.

Note that this is not always the case and this assumption should be challenged. Some other types of changes may include:

- A gradual change over time.
- A recurring or cyclical change.
- A sudden or abrupt change.

Different concept drift detection and handling schemes may be required for each situation. Often, recurring change and long-term trends are considered systematic and can be explicitly identified and handled.

Concept drift may be present on supervised learning problems where predictions are made and data is collected over time. These are traditionally called online learning problems, given the change expected in the data over time.

There are domains where predictions are ordered by time, such as time series forecasting and predictions on streaming data where the problem of concept drift is more likely and should be explicitly tested for and addressed.

A common challenge when mining data streams is that the data streams are not always strictly stationary, i.e., the concept of data (underlying distribution of incoming data) unpredictably drifts over time. This has encouraged the need to detect these concept drifts in the data streams in a timely manner

— Concept Drift Detection for Streaming Data, 2015.

Indre Zliobaite in the 2010 paper titled “Learning under Concept Drift: An Overview” provides a framework for thinking about concept drift and the decisions required by the machine learning practitioner, as follows:

**Future assumption**: a designer needs to make an assumption about the future data source.**Change type**: a designer needs to identify possible change patterns.**Learner adaptivity**: based on the change type and the future assumption, a designer chooses the mechanisms which make the learner adaptive.**Model selection**: a designer needs a criterion to choose a particular parametrization of the selected learner at every time step (e.g. the weights for ensemble members, the window size for variable window method).

This framework may help in thinking about the decision points available to you when addressing concept drift on your own predictive modeling problems.

There are many ways to address concept drift; let’s take a look at a few.

The most common way is to not handle it at all and assume that the data does not change.

This allows you to develop a single “best” model once and use it on all future data.

This should be your starting point and baseline for comparison to other methods. If you believe your dataset may suffer concept drift, you can use a static model in two ways:

**Concept Drift Detection**. Monitor skill of the static model over time and if skill drops, perhaps concept drift is occurring and some intervention is required.**Baseline Performance**. Use the skill of the static model as a baseline to compare to any intervention you make.

A good first-level intervention is to periodically update your static model with more recent historical data.

For example, perhaps you can update the model each month or each year with the data collected from the prior period.

This may also involve back-testing the model in order to select a suitable amount of historical data to include when re-fitting the static model.

In some cases, it may be appropriate to only include a small portion of the most recent historical data to best capture the new relationships between inputs and outputs (e.g. the use of a sliding window).

Some machine learning models can be updated.

This is an efficiency over the previous approach (periodically re-fit) where instead of discarding the static model completely, the existing state is used as the starting point for a fit process that updates the model fit using a sample of the most recent historical data.

For example, this approach is suitable for most machine learning algorithms that use weights or coefficients such as regression algorithms and neural networks.

Some algorithms allow you to weigh the importance of input data.

In this case, you can use a weighting that is inversely proportional to the age of the data such that more attention is paid to the most recent data (higher weight) and less attention is paid to the least recent data (smaller weight).

An ensemble approach can be used where the static model is left untouched, but a new model learns to correct the predictions from the static model based on the relationships in more recent data.

This may be thought of as a boosting type ensemble (in spirit only) where subsequent models correct the predictions from prior models. The key difference here is that subsequent models are fit on different and more recent data, as opposed to a weighted form of the same dataset, as in the case of AdaBoost and gradient boosting.

For some problem domains it may be possible to design systems to detect changes and choose a specific and different model to make predictions.

This may be appropriate for domains that expect abrupt changes that may have occurred in the past and can be checked for in the future. It also assumes that it is possible to develop skillful models to handle each of the detectable changes to the data.

For example, the abrupt change may be a specific observation or observations in a range, or the change in the distribution of one or more input variables.

In some domains, such as time series problems, the data may be expected to change over time.

In these types of problems, it is common to prepare the data in such a way as to remove the systematic changes to the data over time, such as trends and seasonality by differencing.

This is so common that it is built into classical linear methods like the ARIMA model.

Typically, we do not consider systematic change to the data as a problem of concept drift because it can be dealt with directly. Rather, these examples may be a useful way of thinking about your problem and may help you anticipate change and prepare data in a specific way using standardization, scaling, projections, and more to mitigate or at least reduce the effects of change to input variables in the future.

This section provides more resources on the topic if you are looking to go deeper.

- Learning in the Presence of Concept Drift and Hidden Contexts, 1996.
- The problem of concept drift: definitions and related work, 2004.
- Concept Drift Detection for Streaming Data, 2015.
- Learning under Concept Drift: an Overview, 2010.
- An overview of concept drift applications, 2016.
- What Is Concept Drift and How to Measure It?, 2010.
- Understanding Concept Drift, 2017.

- Concept drift on Wikipedia
- Handling Concept Drift: Importance, Challenges and Solutions, 2011 (slides).

In this post, you discovered the problem of concept drift in changing data for applied machine learning.

Specifically, you learned:

- The problem of data changing over time.
- What is concept drift and how it is defined.
- How to handle concept drift in your own predictive modeling problems.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Concept Drift in Machine Learning appeared first on Machine Learning Mastery.

]]>The post Stop Coding Machine Learning Algorithms From Scratch appeared first on Machine Learning Mastery.

]]>…

Stop.

Are you implementing a machine learning algorithm at the moment?

Why?

Implementing algorithms from scratch is one of the biggest mistakes I see beginners make.

In this post you will discover:

- The algorithm implementation trap that beginners fall into.
- The very real difficulty of engineering world-class implementations of machine learning algorithms.
- Why you should be using off-the-shelf implementations.

Discover how machine learning algorithms work including kNN, decision trees, naive bayes, SVM, ensembles and much more in my new book, with 22 tutorials and examples in excel.

Let’s get started.

Here’s a snippet of an email I received:

… I am really struggling. Why do I have to implement algorithms from scratch?

It seems that a lot of developers get caught in this challenge.

They are told or imply that:

**Algorithms must be implemented
before being used.**

Or that:

**You can only learn machine learning by
implementing algorithms.**

Here are some similar questions I stumbled across:

*Why is there a need to manually implement machine learning algorithms when there are many advanced APIs like*tensorflow*available?*(on Quora)*Is there any value implementing machine learning algorithms by yourself or should you use libraries?*(on Quora)*Is it useful to implement machine learning algorithms?*(on Quora)*Which programming language should I use to implement Machine Learning algorithms?*(on Quora)*Why do you and other people sometimes implement machine learning algorithms from scratch?*(on GitHub)

You don’t have to implement machine learning algorithms from scratch.

This is a part of the bottom-up approach traditionally used to teach machine learning.

- Learn Math.
- Learn Theory.
- Implement Algorithm From Scratch.
*??? (magic happens here*).- Apply Machine Learning.

It is a lot easier to apply machine learning algorithms to a problem and get a result than it is to implement them from scratch.

**A Lot Easier!**

Learning how to use an algorithm rather than implement an algorithm is not only easier, it is a more valuable skill. A skill that you can start using to make a real impact very quickly.

There’s a lot of low-hanging fruit that you can pick with applied machine learning.

…is Really Hard!

Algorithms that you use to solve business problems need to be **fast** and **correct**.

The more sophisticated nonlinear methods require a lot more data than their linear counterparts.

This means they need to do a lot of work, which may take a long time.

Algorithms need to be fast to process through all of this data. Especially, at scale.

This may require a re-interpretation of the linear algebra that underlies the method in such a way that best suits a specific matrix operation in an underlying library.

It may require specialized knowledge of caching to make the most of your hardware.

These are not ad hoc tricks that come together after you get a “*hello world*” implementation working. These are engineering challenges that encompass the algorithm implementation project.

Machine learning algorithms will give you a result, even when their implementation is crippled.

You get a number. An output. A prediction.

Sometimes the prediction is correct and sometimes it is not.

Machine learning algorithms use randomness. They are stochastic algorithms.

This is not just a matter of unit tests, it is a matter of having a deep understanding of the technique and devising cases to prove the implementation is as expected and edge cases are handled.

You may be an excellent engineer.

But your “*hello world*” implementation of an algorithm will probably not cut-it when compared to an off-the-shelf implementation.

Your implementation will probably be based on a textbook description, meaning it will be naive and slow. And you may or may not have the expertise to devise tests to ensure the correctness of your implementation.

Off-the-shelf implementations in open source libraries are built for speed and/or robustness.

**How could you not use a standard machine learning library?**

They may be tailored to a very narrow problem type intended to be as fast as possible. They may also be intended for general purpose use, ensuring they operate correctly on a wide range of problems, beyond those you have considered.

Not all algorithm implementations you download off the Internet are created equal.

The code snippet from GitHub maybe a grad students “*hello world*” implementation, or it may be the highly optimized implementation contributed to by the entire research team at a large organization.

You need to evaluate the source of the code you are using. Some sources are better or more reliable than others.

General purposes libraries are often more robust at the cost of some speed.

Lighting fast implementations by hacker-engineers often suffer poor documentation and are highly pedantic when it comes to their expectations.

Consider this when you pick your implementation.

When asked, I typically recommend one of three platforms:

**Weka**. A graphical user interface that does not require any code. Perfect if you want to focus on the machine learning first and learning how to work through problems.**Python**. The ecosystem including pandas and scikit-learn. Excellent for stitching together a solution to a machine learning problem in development that is robust enough to also be deployed into operations.**R**. The more advanced platform that although has an esoteric language and sometimes buggy packages, offers access to state-of-the-art methods written directly by academics. Great for one-off projects and R&D.

These are just my recommendations, there are many more machine learning platforms to choose from.

You do not have to implement machine learning algorithms when getting started in machine learning.

But you can.

And there can be very good reasons for doing so.

For example here are 3 big reasons:

- You want to implement to learn how the algorithm works.
- There is no available implementation of the algorithm you need.
- There is no suitable (fast enough, etc.) implementation of the algorithm you need.

The first is my favorite. It’s the one that may have confused you.

You can implement machine learning algorithms to learn how they work. I recommend it. It’s very efficient for developers to learn this way.

But.

You do not have to **start** by implementing machine learning algorithms. You will build your confidence and skill in machine learning a lot faster by learning how to use machine learning algorithms before implementing them.

The implementation and any research required to complete the implementation would then be an improvement on your understanding. An addition that would help you to get better results the next time you used that algorithm.

In this post, you discovered that beginners fall into the trap of implementing machine learning algorithms from scratch.

**They are told that it’s the only way.**

You discovered that engineering fast and robust implementations of machine learning algorithms is a tough challenge.

You learned that it is much easier and more desirable to learn how to use machine learning algorithms before implementing them. You also learned that implementing algorithms is a great way to learn more about how they work and get more from them, but only after you know how to use them.

**Have you been caught in this trap?**

*Share your experiences in the comments.*

- 5 Mistakes Programmers Make when Starting in Machine Learning
- Understand Machine Learning Algorithms By Implementing Them From Scratch
- Benefits of Implementing Machine Learning Algorithms From Scratch

The post Stop Coding Machine Learning Algorithms From Scratch appeared first on Machine Learning Mastery.

]]>The post Embrace Randomness in Machine Learning appeared first on Machine Learning Mastery.

]]>Applied machine learning is a tapestry of breakthroughs and mindset shifts.

Understanding the role of randomness in machine learning algorithms is one of those breakthroughs.

Once you get it, you will see things differently. In a whole new light. Things like choosing between one algorithm and another, hyperparameter tuning and reporting results.

You will also start to see the abuses everywhere. The criminally unsupported performance claims.

In this post, I want to gently open your eyes to the role of random numbers in machine learning. I want to give you the tools to embrace this uncertainty. To give you a breakthrough.

Let’s dive in.

(*special thanks to Xu Zhang and Nil Fero who promoted this post*)

A lot of people ask this question or variants of this question.

**You are not alone!**

I get an email along these lines once per week.

Here are some similar questions posted to Q&A sites:

- Why do I get different results each time I run my algorithm?
- Cross-Validation gives different result on the same data
- Randomness in Artificial Intelligence & Machine Learning
- Why are the weights different in each running after convergence?
- Does the same neural network with the same learning data and same test data in two computers give different results?

Machine learning algorithms make use of randomness.

Trained with different data, machine learning algorithms will construct different models. It depends on the algorithm. How different a model is with different data is called the model variance (as in the bias-variance trade off).

So, the data itself is a source of randomness. Randomness in the collection of the data.

The order that the observations are exposed to the model affects internal decisions.

Some algorithms are especially susceptible to this, like neural networks.

It is good practice to randomly shuffle the training data before each training iteration. Even if your algorithm is not susceptible. It’s a best practice.

Algorithms harness randomness.

An algorithm may be initialized to a random state. Such as the initial weights in an artificial neural network.

Votes that end in a draw (and other internal decisions) during training in a deterministic method may rely on randomness to resolve.

We may have too much data to reasonably work with.

In which case, we may work with a random subsample to train the model.

We sample when we evaluate an algorithm.

We use techniques like splitting the data into a random training and test set or use k-fold cross validation that makes k random splits of the data.

The result is an estimate of the performance of the model (and process used to create it) on unseen data.

There’s no doubt, randomness plays a big part in applied machine learning.

**The randomness that we can control, should be controlled.**

I've created a handy mind map of 60+ algorithms organized by type.

Download it, print it and use it.

Also get exclusive access to the machine learning algorithms email mini-course.

Run an algorithm on a dataset and get a model.

Can you get the same model again given the same data?

You should be able to. It should be a requirement that is high on the list for your modeling project.

We achieve reproducibility in applied machine learning by using the exact same **code**, **data** and **sequence of random numbers**.

Random numbers are generated in software using a pretend random number generator. It’s a simple math function that generates a sequence of numbers that are random enough for most applications.

This math function is deterministic. If it uses the same starting point called a seed number, it will give the same sequence of random numbers.

**Problem solved. **

**Mostly.**

We can get reproducible results by fixing the random number generator’s seed before each model we construct.

In fact, this is a best practice.

We should be doing this if not already.

In fact, we should be giving the same sequence of random numbers to each algorithm we compare and each technique we try.

It should be a default part of each experiment we run.

If a machine learning algorithm gives a different model with a different sequence of random numbers, then which model do we pick?

Ouch. There’s the rub.

I get asked this question from time to time and I love it.

It’s a sign that someone really gets to the meat of all this applied machine learning stuff – or is about to.

- Different runs of an algorithm with…
- Different random numbers give…
- Different models with…
- Different performance characteristics…

But the differences are within a range.

A fancy name for this difference or random behavior within a range is stochastic.

Machine learning algorithms are stochastic in practice.

- Expect them to be stochastic.
- Expect there to be a range of models to choose from and not a single model.
- Expect the performance to be a range and not a single value.

**These are very real expectations that you MUST address in practice.**

What tactics can you think of to address these expectations?

Thankfully, academics have been struggling with this challenge for a long time.

There are 2 simple strategies that you can use:

- Reduce the Uncertainty.
- Report the Uncertainty.

If we get different models essentially every time we run an algorithm, what can we do?

How about we try running the algorithm many times and gather a population of performance measures.

We already do this if we use *k*-fold cross validation. We build *k* different models.

We can increase *k* and build even more models, as long as the data within each fold remains representative of the problem.

We can also repeat our evaluation process *n* times to get even more numbers in our population of performance measures.

**This tactic is called random repeats or random restarts.**

It is more prevalent with stochastic optimization and neural networks, but is just as relevant generally. Try it.

Never report the performance of your machine learning algorithm with a single number.

If you do, you’ve most likely made an error.

You have gathered a population of performance measures. Use statistics on this population.

**This tactic is called report summary statistics.**

The distribution of results is most likely a Gaussian, so a great start would be to report the mean and standard deviation of performance. Include the highest and lowest performance observed.

In fact, this is a best practice.

You can then compare populations of result measures when you’re performing model selection. Such as:

- Choosing between algorithms.
- Choosing between configurations for one algorithm.

You can see that this has important implications on the processes you follow. Such as: to select which algorithm to use on your problem and for tuning and choosing algorithm hyperparameters.

Lean on statistical significance tests. Statistical tests can determine if the difference between one population of result measures is significantly different from a second population of results.

Report the significance as well.

This too is a best practice, that sadly does not have enough adoption.

The final model is the one prepared on the entire training dataset, once we have chosen an algorithm and configuration.

It’s the model we intend to use to make predictions or deploy into operations.

We also get a different final model with different sequences of random numbers.

I’ve had some students ask:

Should I create many final models and select the one with the best accuracy on a hold out validation dataset.

“*No*” I replied.

This would be a fragile process, highly dependent on the quality of the held out validation dataset. You are selecting random numbers that optimize for a small sample of data.

**Sounds like a recipe for overfitting.**

In general, I would rely on the confidence gained from the above tactics on reducing and reporting uncertainty. Often I just take the first model, it’s just as good as any other.

Sometimes your application domain makes you care more.

In this situation, I would tell you to build an ensemble of models, each trained with a different random number seed.

Use a simple voting ensemble. Each model makes a prediction and the mean of all predictions is reported as the final prediction.

Make the ensemble as big as you need to. I think 10, 30 or 100 are nice round numbers.

Maybe keep adding new models until the predictions become stable. For example, continue until the variance of the predictions tightens up on some holdout set.

In this post, you discovered why random numbers are integral to applied machine learning. You can’t really escape them.

You learned about tactics that you can use to ensure that your results are reproducible.

You learned about techniques that you can use to embrace the stochastic nature of machine learning algorithms when selecting models and reporting results.

For more information on the importance of reproducible results in machine learning and techniques that you can use, see the post:

Do you have any questions about random numbers in machine learning or about this post?

Ask your question in the comments and I will do my best to answer.

The post Embrace Randomness in Machine Learning appeared first on Machine Learning Mastery.

]]>The post Machine Learning Algorithms Mini-Course appeared first on Machine Learning Mastery.

]]>You have to understand how they work to make any progress in the field.

In this post you will discover a 14-part machine learning algorithms mini course that you can follow to finally understand machine learning algorithms.

We are going to cover a lot of ground in this course and you are going to have a great time.

Let’s get started.

Before we get started, let’s make sure you are in the right place.

- This course is for beginners curious about machine learning algorithms.
- This course does not assume you know how to write code.
- This course does not assume a background in mathematics.
- This course does not assume a background in machine learning theory.

This mini-course will take you on a guided tour of machine learning algorithms from foundations and through 10 top techniques.

We will visit each algorithm to give you a sense of how it works, but not go into too much depth to keep things moving.

Let’s take a look at what we’re going to cover over the next 14 lessons.

You may need to come back to this post again and again, so you may want to bookmark it.

This mini-course is broken down int four parts: Algorithm Foundations, Linear Algorithms, Nonlinear Algorithms and Ensemble Algorithms.

**Lesson 1**: How To Talk About Data in Machine Learning**Lesson 2**: Principle That Underpins All Algorithms**Lesson 3**: Parametric and Nonparametric Algorithms**Lesson 4**: Bias, Variance and the Trade-off

**Lesson 5**: Linear Regression**Lesson 6**: Logistic Regression**Lesson 7**: Linear Discriminant Analysis

**Lesson 8**: Classification and Regression Trees**Lesson 9**: Naive Bayes**Lesson 10**: k-Nearest Neighbors**Lesson 11**: Learning Vector Quantization**Lesson 12**: Support Vector Machines

**Lesson 13**: Bagging and Random Forest**Lesson 14**: Boosting and AdaBoost

I've created a handy mind map of 60+ algorithms organized by type.

Download it, print it and use it.

Also get exclusive access to the machine learning algorithms email mini-course.

Data plays a big part in machine learning.

It is important to understand and use the right terminology when talking about data.

How do you think about data? Think of a spreadsheet. You have columns, rows, and cells.

The statistical perspective of machine learning frames data in the context of a hypothetical function (f) that the machine learning algorithm aims to learn. Given some input variables (Input) the function answer the question as to what is the predicted output variable (Output).

Output = f(Input)

The inputs and outputs can be referred to as variables or vectors.

The computer science perspective uses a row of data to describe an entity (like a person) or an observation about an entity. As such, the columns for a row are often referred to as attributes of the observation and the rows themselves are called instances.

There is a common principle that underlies all supervised machine learning algorithms for predictive modeling.

Machine learning algorithms are described as learning a target function (f) that best maps input variables (X) to an output variable (Y).

Y = f(X)

This is a general learning task where we would like to make predictions in the future (Y) given new examples of input variables (X). We don’t know what the function (f) looks like or it’s form. If we did, we would use it directly and we would not need to learn it from data using machine learning algorithms.

The most common type of machine learning is to learn the mapping Y = f(X) to make predictions of Y for new X. This is called predictive modeling or predictive analytics and our goal is to make the most accurate predictions possible.

What is a parametric machine learning algorithm and how is it different from a nonparametric machine learning algorithm?

Assumptions can greatly simplify the learning process, but can also limit what can be learned. Algorithms that simplify the function to a known form are called parametric machine learning algorithms.

The algorithms involve two steps:

- Select a form for the function.
- Learn the coefficients for the function from the training data.

Some examples of parametric machine learning algorithms are Linear Regression and Logistic Regression.

Algorithms that do not make strong assumptions about the form of the mapping function are called nonparametric machine learning algorithms. By not making assumptions, they are free to learn any functional form from the training data.

Non-parametric methods are often more flexible, achieve better accuracy but require a lot more data and training time.

Examples of nonparametric algorithms include Support Vector Machines, Neural Networks and Decision Trees.

Machine learning algorithms can best be understood through the lens of the bias-variance trade-off.

Bias are the simplifying assumptions made by a model to make the target function easier to learn.

Generally parametric algorithms have a high bias making them fast to learn and easier to understand but generally less flexible. In turn they have lower predictive performance on complex problems that fail to meet the simplifying assumptions of the algorithms bias.

Decision trees are an example of a low bias algorithm, whereas linear regression is an example of a high-bias algorithm.

Variance is the amount that the estimate of the target function will change if different training data was used. The target function is estimated from the training data by a machine learning algorithm, so we should expect the algorithm to have some variance, not zero variance.

The k-Nearest Neighbors algorithm is an example of a high-variance algorithm, whereas Linear Discriminant Analysis is an example of a low variance algorithm.

The goal of any predictive modeling machine learning algorithm is to achieve low bias and low variance. In turn the algorithm should achieve good prediction performance. The parameterization of machine learning algorithms is often a battle to balance out bias and variance.

- Increasing the bias will decrease the variance.
- Increasing the variance will decrease the bias.

Linear regression is perhaps one of the most well known and well understood algorithms in statistics and machine learning.

Isn’t it a technique from statistics?

Predictive modeling is primarily concerned with minimizing the error of a model or making the most accurate predictions possible, at the expense of explainability. We will borrow, reuse and steal algorithms from many different fields, including statistics and use them towards these ends.

The representation of linear regression is a equation that describes a line that best fits the relationship between the input variables (x) and the output variables (y), by finding specific weightings for the input variables called coefficients (B).

For example:

y = B0 + B1 * x

We will predict y given the input x and the goal of the linear regression learning algorithm is to find the values for the coefficients B0 and B1.

Different techniques can be used to learn the linear regression model from data, such as a linear algebra solution for ordinary least squares and gradient descent optimization.

Linear regression has been around for more than 200 years and has been extensively studied. Some good rules of thumb when using this technique are to remove variables that are very similar (correlated) and to remove noise from your data, if possible.

It is a fast and simple technique and good first algorithm to try.

Logistic regression is another technique borrowed by machine learning from the field of statistics. It is the go-to method for binary classification problems (problems with two class values).

Logistic regression is like linear regression in that the goal is to find the values for the coefficients that weight each input variable.

Unlike linear regression, the prediction for the output is transformed using a non-linear function called the logistic function.

The logistic function looks like a big S and will transform any value into the range 0 to 1. This is useful because we can apply a rule to the output of the logistic function to snap values to 0 and 1 (e.g. IF less than 0.5 then output 1) and predict a class value.

Because of the way that the model is learned, the predictions made by logistic regression can also be used as the probability of a given data instance belonging to class 0 or class 1. This can be useful on problems where you need to give more rationale for a prediction.

Like linear regression, logistic regression does work better when you remove attributes that are unrelated to the output variable as well as attributes that are very similar (correlated) to each other.

It’s a fast model to learn and effective on binary classification problems.

Logistic regression is a classification algorithm traditionally limited to only two-class classification problems. If you have more than two classes then the Linear Discriminant Analysis algorithm is the preferred linear classification technique.

The representation of LDA is pretty straight forward. It consists of statistical properties of your data, calculated for each class. For a single input variable this includes:

- The mean value for each class.
- The variance calculated across all classes.

Predictions are made by calculating a discriminate value for each class and making a prediction for the class with the largest value.

The technique assumes that the data has a Gaussian distribution (bell curve), so it is a good idea to remove outliers from your data before hand.

It’s a simple and powerful method for classification predictive modeling problems.

Decision Trees are an important type of algorithm for predictive modeling machine learning.

The representation for the decision tree model is a binary tree. This is your binary tree from algorithms and data structures, nothing too fancy. Each node represents a single input variable (x) and a split point on that variable (assuming the variable is numeric).

The leaf nodes of the tree contain an output variable (y) which is used to make a prediction. Predictions are made by walking the splits of the tree until arriving at a leaf node and output the class value at that leaf node.

Trees are fast to learn and very fast for making predictions. They are also often accurate for a broad range of problems and do not require any special preparation for your data.

Decision trees have a high variance and can yield more accurate predictions when used in an ensemble, a topic we will cover in Lesson 13 and Lesson 14.

Naive Bayes is a simple but surprisingly powerful algorithm for predictive modeling.

The model is comprised of two types of probabilities that can be calculated directly from your training data:

- The probability of each class.
- The conditional probability for each class given each x value.

Once calculated, the probability model can be used to make predictions for new data using Bayes Theorem.

When your data is real-valued it is common to assume a Gaussian distribution (bell curve) so that you can easily estimate these probabilities.

Naive Bayes is called naive because it assumes that each input variable is independent. This is a strong assumption and unrealistic for real data, nevertheless, the technique is very effective on a large range of complex problems.

The KNN algorithm is very simple and very effective.

The model representation for KNN is the entire training dataset. Simple right?

Predictions are made for a new data point by searching through the entire training set for the K most similar instances (the neighbors) and summarizing the output variable for those K instances. For regression this might be the mean output variable, in classification this might be the mode (or most common) class value.

The trick is in how to determine similarity between the data instances. The simplest technique if your attributes are all of the same scale (all in inches for example) is to use the Euclidean distance, a number you can calculate directly based on the differences between each input variable.

KNN can require a lot of memory or space to store all of the data, but only performs a calculation (or learn) when a prediction is needed, just in time. You can also update and curate your training instances over time to keep predictions accurate.

The idea of distance or closeness can break down in very high dimensions (lots of input variables) which can negatively effect the performance of the algorithm on your problem. This is called the curse of dimensionality. It suggests you only use those input variables that are most relevant to predicting the output variable.

A downside of K-Nearest Neighbors is that you need to hang on to your entire training dataset.

The Learning Vector Quantization algorithm (or LVQ for short) is an artificial neural network algorithm that allows you to choose how many training instances to hang onto and learns exactly what those instances should look like.

The representation for LVQ is a collection of codebook vectors. These are selected randomly in the beginning and adapted to best summarize the training dataset over a number of iterations of the learning algorithm.

After learned, the codebook vectors can be used to make predictions just like K-Nearest Neighbors. The most similar neighbor (best matching codebook vector) is found by calculating the distance between each codebook vector and the new data instance. The class value or (real value in the case of regression) for the best matching unit is then returned as the prediction.

Best results are achieved if you rescale your data to have the same range, such as between 0 and 1.

If you discover that KNN gives good results on your dataset try using LVQ to reduce the memory requirements of storing the entire training dataset.

Support Vector Machines are perhaps one of the most popular and talked about machine learning algorithms.

A hyperplane is a line that splits the input variable space. In SVM, a hyperplane is selected to best separate the points in the input variable space by their class, either class 0 or class 1.

In two-dimensions you can visualize this as a line and let’s assume that all of our input points can be completely separated by this line.

The SVM learning algorithm finds the coefficients that results in the best separation of the classes by the hyperplane.

The distance between the hyperplane and the closest data points is referred to as the margin. The best or optimal hyperplane that can separate the two classes is the line that as the largest margin.

Only these points are relevant in defining the hyperplane and in the construction of the classifier.

These points are called the support vectors. They support or define the hyperplane.

In practice, an optimization algorithm is used to find the values for the coefficients that maximizes the margin.

SVM might be one of the most powerful out-of-the-box classifiers and worth trying on your dataset.

Random Forest is one of the most popular and most powerful machine learning algorithms. It is a type of ensemble machine learning algorithm called Bootstrap Aggregation or bagging.

The bootstrap is a powerful statistical method for estimating a quantity from a data sample. Such as a mean. You take lots of samples of your data, calculate the mean, then average all of your mean values to give you a better estimation of the true mean value.

In bagging, the same approach is used, but instead for estimating entire statistical models, most commonly decision trees.

Multiple samples of your training data are taken then models are constructed for each data sample. When you need to make a prediction for new data, each model makes a prediction and the predictions are averaged to give a better estimate of the true output value.

Random forest is a tweak on this approach where decision trees are created so that rather than selecting optimal split points, suboptimal splits are made by introducing randomness.

The models created for each sample of the data are therefore more different than they otherwise would be, but still accurate in their unique and different ways. Combining their predictions results in a better estimate of the true underlying output value.

If you get good good results with an algorithm with high variance (like decision trees), you can often get better results by bagging that algorithm.

Boosting is an ensemble technique that attempts to create a strong classifier from a number of weak classifiers.

This is done by building a model from the training data, then creating a second model that attempts to correct the errors from the first model. Models are added until the training set is predicted perfectly or a maximum number of models are added.

AdaBoost was the first really successful boosting algorithm developed for binary classification. It is the best starting point for understanding boosting. Modern boosting methods build on AdaBoost, most notably stochastic gradient boosting machines.

AdaBoost is used with short decision trees. After the first tree is created, the performance of the tree on each training instance is used to weight how much attention the next tree that is created should pay attention to each training instance. Training data that is hard to predict is given more more weight, whereas easy to predict instances are given less weight.

Models are created sequentially one after the other, each updating the weights on the training instances that affect the learning performed by the next tree in the sequence.

After all the trees are built, predictions are made for new data, and the performance of each tree is weighted by how accurate it was on the training data.

Because so much attention is put on correcting mistakes by the algorithm it is important that you have clean data with outliers removed.

You made it. Well done! Take a moment and look back at how far you have come:

- You discovered how to talk about data in machine learning and about the underlying principles of all predictive modeling algorithms.
- You discovered the difference between parametric and nonparametric algorithms and the difference between error introduced by bias and variance.
- You discovered three linear machine learning algorithms: Linear Regression, Logistic Regression and Linear Discriminant Analysis.
- You were introduced to 5 nonlinear algorithms: Classification and Regression Trees, Naive Bayes, K-Nearest Neighbors, Learning Vector Quantization and Support Vector Machines.
- Finally, you discovered two of the most popular ensemble algorithms: Bagging with Decision Trees and Boosting with AdaBoost.

Don’t make light of this, you have come a long way in a short amount of time. This is just the beginning of your journey with machine learning algorithms. Keep practicing and developing your skills.

Did you enjoy this mini-course?

Do you have any questions or sticking points?

Leave a comment and let me know.

The post Machine Learning Algorithms Mini-Course appeared first on Machine Learning Mastery.

]]>The post 6 Questions To Understand Any Machine Learning Algorithm appeared first on Machine Learning Mastery.

]]>You have to choose the level of detail that you study machine learning algorithms. There is a sweet spot if you are a developer interested in applied predictive modeling.

This post describes that sweet spot and gives you a template that you can use to quickly understand any machine learning algorithm.

Let’s get started.

What do you need to know about a machine learning algorithm to be able to use it well on a classification or prediction problem?

I won’t argue that the more that you know about how and why a particular algorithm works, the better you can wield it. But I do believe that there is a point of diminishing returns where you can stop, use what you know to be effective and dive deeper into the theory and research on an algorithm if and only if you need to know more in order to get better results.

Let’s take a look at the 6 questions that will reveal how a machine learning algorithms and how to best use it.

I've created a handy mind map of 60+ algorithms organized by type.

Download it, print it and use it.

Also get exclusive access to the machine learning algorithms email mini-course.

There are 6 questions that you can ask to get to the heart of any machine learning algorithm:

- How do you refer to the technique (
*e.g. what name*)? - How do you represent a learned model (
*e.g. what coefficients*)? - How to you learn a model (
*e.g. the optimization process from data to the representation*)? - How do you make predictions from a learned model (
*e.g. apply the model to new data*)? - How do you best prepare your data for the modeling with the technique (
*e.g. assumptions*)? - How do you get more information on the technique (
*e.g. where to look*)?

You will note that I have phrased all of these questions as How-To. I did this intentionally to separate the practical concerns of how from the more theoretical concerns of why. I think knowing why a technique works is less important than knowing how it works, if you are looking to use it as a tool to get results. More on this in the next section.

Let’s take a closer look at each of these questions in turn.

This is obvious but important. You need to know the canonical name of the technique.

You need to be able to recognize the classical name or the name of the method from other fields as well and know that it is the same thing. This also includes the acronym for the algorithm, because sometimes they are less than intuitive.

This will help you sort out the base algorithm from extensions and the family tree of where the algorithm fits and relates to similar algorithms.

I really like this nuts and bolts question.

This is question often overlooked in textbooks and papers and is perhaps the first question an engineer has when thinking about how a model will actually be used and deployed.

The representation is the numbers and data structure that captures the distinct details learned from data by the learning algorithm to be used by the prediction algorithm. It’s the stuff you save to disk or the database when you finalize your model. It’s the stuff you update when new training data becomes available.

Let’s make this concrete with an example. In the case of linear regression, the representation is the vector of regression coefficients. That’s it. In the case of a decision tree is is the tree itself including the nodes, how they are connected and the variables and cut-off thresholds chosen.

Given some training data, the algorithm needs to create the model or fill in the model representation. This question is about exactly how that occurs.

Often learning involves estimating parameters from the training data directly in simpler algorithms.

In most other algorithms it involves using the training data as part of a cost or loss function and an optimization algorithm to minimize the function. Simpler linear techniques may use linear algebra to achieve this result, whereas others may use a numerical optimization.

Often the way a machine learning algorithm learns a model is synonymous with the algorithm itself. This is the challenging and often time consuming part of running a machine learning algorithm.

The learning algorithm may be parameterized and it is often a good idea to list common ranges for parameter values or configuration heuristics that may be used as a starting point.

Once a model is learned, it is intended to be used to make predictions on new data. Note, we re exclusively talking about predictive modeling machine learning algorithms for classification and regression problems.

This is often the fast and even trivial part of using a machine learning algorithm. Often it is so trivial that it is not even mentioned or discussed in the literature.

It may be trivial because prediction may be as simple as filling in the inputs in an equation and calculating a prediction, or traversing a decision tree to see what leaf-node lights up. In other algorithms, like k-nearest neighbors the prediction algorithm may be the main show (k-NN has no training algorithm other than “store the whole training set”).

Machine learning algorithms make assumptions.

Even the most relaxed non-parametric methods make assumptions about your training data. It is good or even critical to review these assumptions. Even better is to translate these assumptions into specific data preparation operations that you can perform.

This question flushes out transforms that you could use on your data before modeling, or at the very least gives you pause to think about data transforms to try. What I mean by this is that it is best to treat algorithm requirements and assumptions as suggestions of things to try to get the most out your model rather than hard and fast rules that your data must adhere to.

Just like you cannot know which algorithm will be best for your data before hand, you cannot know the best transforms to apply to your data to get the most from an algorithm. Real data is messy and it is a good idea to try a number of presentations of your data with a number of different algorithms to see what warrants deeper investigation. The requirements and assumptions of machine learning algorithms help to point out presentations of your data to try.

Some algorithms will bubble up as generally better than others on your data problem.

When they do, you need to know where to look to get a deeper understanding of the technique. This can help with further customizing the algorithm for your data and with tuning the parameters of the learning and prediction algorithms.

It is a good idea to collect and list resources that you can reference if and when you need to dive deeper. This may include:

- Journal Articles
- Conference Papers
- Books including textbooks and monographs
- Webpages

I also think it is a good idea to know of more practical references like example tutorials and open source implementations that you can look inside to get a more concrete idea of what is going on.

For more on researching machine learning algorithms, see the post How to Research a Machine Learning Algorithm.

In this post you discovered 6 questions that you can ask of a machine learning, that if answered, will give you a very good and practical idea of how it works and how you can use it effectively.

These questions were focused on machine learning algorithms for predictive modeling problems like classification and regression.

These questions, phrased simply are:

- What are the common names of the algorithm?
- What representation is used by the model?
- How does the algorithm learn from training data?
- How can you make predictions from the model on new data?
- How you can best prepare your data for the algorithm?
- Where you can you look for more information about the algorithm?

For another post along this theme of defining an algorithm description template see How to Learn a Machine Learning Algorithm.

Do you like this approach? Let me know in the comments.

The post 6 Questions To Understand Any Machine Learning Algorithm appeared first on Machine Learning Mastery.

]]>The post Boosting and AdaBoost for Machine Learning appeared first on Machine Learning Mastery.

]]>In this post you will discover the AdaBoost Ensemble method for machine learning. After reading this post, you will know:

- What the boosting ensemble method is and generally how it works.
- How to learn to boost decision trees using the AdaBoost algorithm.
- How to make predictions using the learned AdaBoost model.
- How to best prepare your data for use with the AdaBoost algorithm

This post was written for developers and assumes no background in statistics or mathematics. The post focuses on how the algorithm works and how to use it for predictive modeling problems. If you have any questions, leave a comment and I will do my best to answer.

Let’s get started.

Boosting is a general ensemble method that creates a strong classifier from a number of weak classifiers.

This is done by building a model from the training data, then creating a second model that attempts to correct the errors from the first model. Models are added until the training set is predicted perfectly or a maximum number of models are added.

AdaBoost was the first really successful boosting algorithm developed for binary classification. It is the best starting point for understanding boosting.

Modern boosting methods build on AdaBoost, most notably stochastic gradient boosting machines.

I've created a handy mind map of 60+ algorithms organized by type.

Download it, print it and use it.

Also get exclusive access to the machine learning algorithms email mini-course.

AdaBoost is best used to boost the performance of decision trees on binary classification problems.

AdaBoost was originally called AdaBoost.M1 by the authors of the technique Freund and Schapire. More recently it may be referred to as discrete AdaBoost because it is used for classification rather than regression.

AdaBoost can be used to boost the performance of any machine learning algorithm. It is best used with weak learners. These are models that achieve accuracy just above random chance on a classification problem.

The most suited and therefore most common algorithm used with AdaBoost are decision trees with one level. Because these trees are so short and only contain one decision for classification, they are often called decision stumps.

Each instance in the training dataset is weighted. The initial weight is set to:

weight(xi) = 1/n

Where xi is the i’th training instance and n is the number of training instances.

A weak classifier (decision stump) is prepared on the training data using the weighted samples. Only binary (two-class) classification problems are supported, so each decision stump makes one decision on one input variable and outputs a +1.0 or -1.0 value for the first or second class value.

The misclassification rate is calculated for the trained model. Traditionally, this is calculated as:

error = (correct – N) / N

Where error is the misclassification rate, correct are the number of training instance predicted correctly by the model and N is the total number of training instances. For example, if the model predicted 78 of 100 training instances correctly the error or misclassification rate would be (78-100)/100 or 0.22.

This is modified to use the weighting of the training instances:

error = sum(w(i) * terror(i)) / sum(w)

Which is the weighted sum of the misclassification rate, where w is the weight for training instance i and terror is the prediction error for training instance i which is 1 if misclassified and 0 if correctly classified.

For example, if we had 3 training instances with the weights 0.01, 0.5 and 0.2. The predicted values were -1, -1 and -1, and the actual output variables in the instances were -1, 1 and -1, then the terrors would be 0, 1, and 0. The misclassification rate would be calculated as:

error = (0.01*0 + 0.5*1 + 0.2*0) / (0.01 + 0.5 + 0.2)

or

error = 0.704

A stage value is calculated for the trained model which provides a weighting for any predictions that the model makes. The stage value for a trained model is calculated as follows:

stage = ln((1-error) / error)

Where stage is the stage value used to weight predictions from the model, ln() is the natural logarithm and error is the misclassification error for the model. The effect of the stage weight is that more accurate models have more weight or contribution to the final prediction.

The training weights are updated giving more weight to incorrectly predicted instances, and less weight to correctly predicted instances.

For example, the weight of one training instance (w) is updated using:

w = w * exp(stage * terror)

Where w is the weight for a specific training instance, exp() is the numerical constant e or Euler’s number raised to a power, stage is the misclassification rate for the weak classifier and terror is the error the weak classifier made predicting the output variable for the training instance, evaluated as:

terror = 0 if(y == p), otherwise 1

Where y is the output variable for the training instance and p is the prediction from the weak learner.

This has the effect of not changing the weight if the training instance was classified correctly and making the weight slightly larger if the weak learner misclassified the instance.

Weak models are added sequentially, trained using the weighted training data.

The process continues until a pre-set number of weak learners have been created (a user parameter) or no further improvement can be made on the training dataset.

Once completed, you are left with a pool of weak learners each with a stage value.

Predictions are made by calculating the weighted average of the weak classifiers.

For a new input instance, each weak learner calculates a predicted value as either +1.0 or -1.0. The predicted values are weighted by each weak learners stage value. The prediction for the ensemble model is taken as a the sum of the weighted predictions. If the sum is positive, then the first class is predicted, if negative the second class is predicted.

For example, 5 weak classifiers may predict the values 1.0, 1.0, -1.0, 1.0, -1.0. From a majority vote, it looks like the model will predict a value of 1.0 or the first class. These same 5 weak classifiers may have the stage values 0.2, 0.5, 0.8, 0.2 and 0.9 respectively. Calculating the weighted sum of these predictions results in an output of -0.8, which would be an ensemble prediction of -1.0 or the second class.

This section lists some heuristics for best preparing your data for AdaBoost.

**Quality Data**: Because the ensemble method continues to attempt to correct misclassifications in the training data, you need to be careful that the training data is of a high-quality.**Outliers**: Outliers will force the ensemble down the rabbit hole of working hard to correct for cases that are unrealistic. These could be removed from the training dataset.**Noisy Data**: Noisy data, specifically noise in the output variable can be problematic. If possible, attempt to isolate and clean these from your training dataset.

Below are some machine learning texts that describe AdaBoost from a machine learning perspective.

- An Introduction to Statistical Learning: with Applications in R, page 321
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Chapter 10
- Applied Predictive Modeling, pages 203 amd 389

Below are some seminal and good overview research articles on the method that may be useful if you are looking to dive deeper into the theoretical underpinnings of the method:

- A decision-theoretic generalization of on-line learning and an application to boosting, 1995
- Improved Boosting Algorithms Using Confidence-rated Predictions, 1999
- Explaining Adaboost, Chapter from Empirical Inference, 2013
- A Short Introduction to Boosting, 1999

In this post you discovered the Boosting ensemble method for machine learning. You learned about:

- Boosting and how it is a general technique that keeps adding weak learners to correct classification errors.
- AdaBoost as the first successful boosting algorithm for binary classification problems.
- Learning the AdaBoost model by weighting training instances and the weak learners themselves.
- Predicting with AdaBoost by weighting predictions from weak learners.
- Where to look for more theoretical background on the AdaBoost algorithm.

If you have any questions about this post or the Boosting or the AdaBoost algorithm ask in the comments and I will do my best to answer.

The post Boosting and AdaBoost for Machine Learning appeared first on Machine Learning Mastery.

]]>The post Bagging and Random Forest Ensemble Algorithms for Machine Learning appeared first on Machine Learning Mastery.

]]>In this post you will discover the Bagging ensemble algorithm and the Random Forest algorithm for predictive modeling. After reading this post you will know about:

- The bootstrap method for estimating statistical quantities from samples.
- The Bootstrap Aggregation algorithm for creating multiple different models from a single training dataset.
- The Random Forest algorithm that makes a small tweak to Bagging and results in a very powerful classifier.

This post was written for developers and assumes no background in statistics or mathematics. The post focuses on how the algorithm works and how to use it for predictive modeling problems.

If you have any questions, leave a comment and I will do my best to answer.

Let’s get started.

Before we get to Bagging, let’s take a quick look at an important foundation technique called the bootstrap.

The bootstrap is a powerful statistical method for estimating a quantity from a data sample. This is easiest to understand if the quantity is a descriptive statistic such as a mean or a standard deviation.

Let’s assume we have a sample of 100 values (x) and we’d like to get an estimate of the mean of the sample.

We can calculate the mean directly from the sample as:

mean(x) = 1/100 * sum(x)

We know that our sample is small and that our mean has error in it. We can improve the estimate of our mean using the bootstrap procedure:

- Create many (e.g. 1000) random sub-samples of our dataset with replacement (meaning we can select the same value multiple times).
- Calculate the mean of each sub-sample.
- Calculate the average of all of our collected means and use that as our estimated mean for the data.

For example, let’s say we used 3 resamples and got the mean values 2.3, 4.5 and 3.3. Taking the average of these we could take the estimated mean of the data to be 3.367.

This process can be used to estimate other quantities like the standard deviation and even quantities used in machine learning algorithms, like learned coefficients.

I've created a handy mind map of 60+ algorithms organized by type.

Download it, print it and use it.

Also get exclusive access to the machine learning algorithms email mini-course.

Bootstrap Aggregation (or Bagging for short), is a simple and very powerful ensemble method.

An ensemble method is a technique that combines the predictions from multiple machine learning algorithms together to make more accurate predictions than any individual model.

Bootstrap Aggregation is a general procedure that can be used to reduce the variance for those algorithm that have high variance. An algorithm that has high variance are decision trees, like classification and regression trees (CART).

Decision trees are sensitive to the specific data on which they are trained. If the training data is changed (e.g. a tree is trained on a subset of the training data) the resulting decision tree can be quite different and in turn the predictions can be quite different.

Bagging is the application of the Bootstrap procedure to a high-variance machine learning algorithm, typically decision trees.

Let’s assume we have a sample dataset of 1000 instances (x) and we are using the CART algorithm. Bagging of the CART algorithm would work as follows.

- Create many (e.g. 100) random sub-samples of our dataset with replacement.
- Train a CART model on each sample.
- Given a new dataset, calculate the average prediction from each model.

For example, if we had 5 bagged decision trees that made the following class predictions for a in input sample: blue, blue, red, blue and red, we would take the most frequent class and predict blue.

When bagging with decision trees, we are less concerned about individual trees overfitting the training data. For this reason and for efficiency, the individual decision trees are grown deep (e.g. few training samples at each leaf-node of the tree) and the trees are not pruned. These trees will have both high variance and low bias. These are important characterize of sub-models when combining predictions using bagging.

The only parameters when bagging decision trees is the number of samples and hence the number of trees to include. This can be chosen by increasing the number of trees on run after run until the accuracy begins to stop showing improvement (e.g. on a cross validation test harness). Very large numbers of models may take a long time to prepare, but will not overfit the training data.

Just like the decision trees themselves, Bagging can be used for classification and regression problems.

Random Forests are an improvement over bagged decision trees.

A problem with decision trees like CART is that they are greedy. They choose which variable to split on using a greedy algorithm that minimizes error. As such, even with Bagging, the decision trees can have a lot of structural similarities and in turn have high correlation in their predictions.

Combining predictions from multiple models in ensembles works better if the predictions from the sub-models are uncorrelated or at best weakly correlated.

Random forest changes the algorithm for the way that the sub-trees are learned so that the resulting predictions from all of the subtrees have less correlation.

It is a simple tweak. In CART, when selecting a split point, the learning algorithm is allowed to look through all variables and all variable values in order to select the most optimal split-point. The random forest algorithm changes this procedure so that the learning algorithm is limited to a random sample of features of which to search.

The number of features that can be searched at each split point (m) must be specified as a parameter to the algorithm. You can try different values and tune it using cross validation.

- For classification a good default is: m = sqrt(p)
- For regression a good default is: m = p/3

Where m is the number of randomly selected features that can be searched at a split point and p is the number of input variables. For example, if a dataset had 25 input variables for a classification problem, then:

- m = sqrt(25)
- m = 5

For each bootstrap sample taken from the training data, there will be samples left behind that were not included. These samples are called Out-Of-Bag samples or OOB.

The performance of each model on its left out samples when averaged can provide an estimated accuracy of the bagged models. This estimated performance is often called the OOB estimate of performance.

These performance measures are reliable test error estimate and correlate well with cross validation estimates.

As the Bagged decision trees are constructed, we can calculate how much the error function drops for a variable at each split point.

In regression problems this may be the drop in sum squared error and in classification this might be the Gini score.

These drops in error can be averaged across all decision trees and output to provide an estimate of the importance of each input variable. The greater the drop when the variable was chosen, the greater the importance.

These outputs can help identify subsets of input variables that may be most or least relevant to the problem and suggest at possible feature selection experiments you could perform where some features are removed from the dataset.

Bagging is a simple technique that is covered in most introductory machine learning texts. Some examples are listed below.

- An Introduction to Statistical Learning: with Applications in R, Chapter 8.
- Applied Predictive Modeling, Chapter 8 and Chapter 14.
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Chapter 15

In this post you discovered the Bagging ensemble machine learning algorithm and the popular variation called Random Forest. You learned:

- How to estimate statistical quantities from a data sample.
- How to combine the predictions from multiple high-variance models using bagging.
- How to tweak the construction of decision trees when bagging to de-correlate their predictions, a technique called Random Forests.

Do you have any questions about this post or the Bagging or Random Forest Ensemble algorithms?

Leave a comment and ask your question and I will do my best to answer it.

The post Bagging and Random Forest Ensemble Algorithms for Machine Learning appeared first on Machine Learning Mastery.

]]>