The post A Gentle Introduction to Concept Drift in Machine Learning appeared first on Machine Learning Mastery.

]]>This problem of the changing underlying relationships in the data is called concept drift in the field of machine learning.

In this post, you will discover the problem of concept drift and ways to you may be able to address it in your own predictive modeling problems.

After completing this post, you will know:

- The problem of data changing over time.
- What is concept drift and how it is defined.
- How to handle concept drift in your own predictive modeling problems.

Let’s get started.

This post is divided into 3 parts; they are:

- Changes to Data Over Time
- What is Concept Drift?
- How to Address Concept Drift

Predictive modeling is the problem of learning a model from historical data and using the model to make predictions on new data where we do not know the answer.

Technically, predictive modeling is the problem of approximating a mapping function (f) given input data (X) to predict an output value (y).

y = f(X)

Often, this mapping is assumed to be static, meaning that the mapping learned from historical data is just as valid in the future on new data and that the relationships between input and output data do not change.

This is true for many problems, but not all problems.

In some cases, the relationships between input and output data can change over time, meaning that in turn there are changes to the unknown underlying mapping function.

The changes may be consequential, such as that the predictions made by a model trained on older historical data are no longer correct or as correct as they could be if the model was trained on more recent historical data.

These changes, in turn, may be able to be detected, and if detected, it may be possible to update the learned model to reflect these changes.

… many data mining methods assume that discovered patterns are static. However, in practice patterns in the database evolve over time. This poses two important challenges. The first challenge is to detect when concept drift occurs. The second challenge is to keep the patterns up-to-date without inducing the patterns from scratch.

— Page 10, Data Mining and Knowledge Discovery Handbook, 2010.

Concept drift in machine learning and data mining refers to the change in the relationships between input and output data in the underlying problem over time.

In other domains, this change maybe called “*covariate shift*,” “*dataset shift*,” or “*nonstationarity*.”

In most challenging data analysis applications, data evolve over time and must be analyzed in near real time. Patterns and relations in such data often evolve over time, thus, models built for analyzing such data quickly become obsolete over time. In machine learning and data mining this phenomenon is referred to as concept drift.

— An overview of concept drift applications, 2016.

A concept in “*concept drift*” refers to the unknown and hidden relationship between inputs and output variables.

For example, one concept in weather data may be the season that is not explicitly specified in temperature data, but may influence temperature data. Another example may be customer purchasing behavior over time that may be influenced by the strength of the economy, where the strength of the economy is not explicitly specified in the data. These elements are also called a “hidden context”.

A difficult problem with learning in many real-world domains is that the concept of interest may depend on some hidden context, not given explicitly in the form of predictive features. A typical example is weather prediction rules that may vary radically with the season. […] Often the cause of change is hidden, not known a priori, making the learning task more complicated.

— The problem of concept drift: definitions and related work, 2004.

The change to the data could take any form. It is conceptually easier to consider the case where there is some temporal consistency to the change such that data collected within a specific time period show the same relationship and that this relationship changes smoothly over time.

Note that this is not always the case and this assumption should be challenged. Some other types of changes may include:

- A gradual change over time.
- A recurring or cyclical change.
- A sudden or abrupt change.

Different concept drift detection and handling schemes may be required for each situation. Often, recurring change and long-term trends are considered systematic and can be explicitly identified and handled.

Concept drift may be present on supervised learning problems where predictions are made and data is collected over time. These are traditionally called online learning problems, given the change expected in the data over time.

There are domains where predictions are ordered by time, such as time series forecasting and predictions on streaming data where the problem of concept drift is more likely and should be explicitly tested for and addressed.

A common challenge when mining data streams is that the data streams are not always strictly stationary, i.e., the concept of data (underlying distribution of incoming data) unpredictably drifts over time. This has encouraged the need to detect these concept drifts in the data streams in a timely manner

— Concept Drift Detection for Streaming Data, 2015.

Indre Zliobaite in the 2010 paper titled “Learning under Concept Drift: An Overview” provides a framework for thinking about concept drift and the decisions required by the machine learning practitioner, as follows:

**Future assumption**: a designer needs to make an assumption about the future data source.**Change type**: a designer needs to identify possible change patterns.**Learner adaptivity**: based on the change type and the future assumption, a designer chooses the mechanisms which make the learner adaptive.**Model selection**: a designer needs a criterion to choose a particular parametrization of the selected learner at every time step (e.g. the weights for ensemble members, the window size for variable window method).

This framework may help in thinking about the decision points available to you when addressing concept drift on your own predictive modeling problems.

There are many ways to address concept drift; let’s take a look at a few.

The most common way is to not handle it at all and assume that the data does not change.

This allows you to develop a single “best” model once and use it on all future data.

This should be your starting point and baseline for comparison to other methods. If you believe your dataset may suffer concept drift, you can use a static model in two ways:

**Concept Drift Detection**. Monitor skill of the static model over time and if skill drops, perhaps concept drift is occurring and some intervention is required.**Baseline Performance**. Use the skill of the static model as a baseline to compare to any intervention you make.

A good first-level intervention is to periodically update your static model with more recent historical data.

For example, perhaps you can update the model each month or each year with the data collected from the prior period.

This may also involve back-testing the model in order to select a suitable amount of historical data to include when re-fitting the static model.

In some cases, it may be appropriate to only include a small portion of the most recent historical data to best capture the new relationships between inputs and outputs (e.g. the use of a sliding window).

Some machine learning models can be updated.

This is an efficiency over the previous approach (periodically re-fit) where instead of discarding the static model completely, the existing state is used as the starting point for a fit process that updates the model fit using a sample of the most recent historical data.

For example, this approach is suitable for most machine learning algorithms that use weights or coefficients such as regression algorithms and neural networks.

Some algorithms allow you to weigh the importance of input data.

In this case, you can use a weighting that is inversely proportional to the age of the data such that more attention is paid to the most recent data (higher weight) and less attention is paid to the least recent data (smaller weight).

An ensemble approach can be used where the static model is left untouched, but a new model learns to correct the predictions from the static model based on the relationships in more recent data.

This may be thought of as a boosting type ensemble (in spirit only) where subsequent models correct the predictions from prior models. The key difference here is that subsequent models are fit on different and more recent data, as opposed to a weighted form of the same dataset, as in the case of AdaBoost and gradient boosting.

For some problem domains it may be possible to design systems to detect changes and choose a specific and different model to make predictions.

This may be appropriate for domains that expect abrupt changes that may have occurred in the past and can be checked for in the future. It also assumes that it is possible to develop skillful models to handle each of the detectable changes to the data.

For example, the abrupt change may be a specific observation or observations in a range, or the change in the distribution of one or more input variables.

In some domains, such as time series problems, the data may be expected to change over time.

In these types of problems, it is common to prepare the data in such a way as to remove the systematic changes to the data over time, such as trends and seasonality by differencing.

This is so common that it is built into classical linear methods like the ARIMA model.

Typically, we do not consider systematic change to the data as a problem of concept drift because it can be dealt with directly. Rather, these examples may be a useful way of thinking about your problem and may help you anticipate change and prepare data in a specific way using standardization, scaling, projections, and more to mitigate or at least reduce the effects of change to input variables in the future.

This section provides more resources on the topic if you are looking to go deeper.

- Learning in the Presence of Concept Drift and Hidden Contexts, 1996.
- The problem of concept drift: definitions and related work, 2004.
- Concept Drift Detection for Streaming Data, 2015.
- Learning under Concept Drift: an Overview, 2010.
- An overview of concept drift applications, 2016.
- What Is Concept Drift and How to Measure It?, 2010.
- Understanding Concept Drift, 2017.

- Concept drift on Wikipedia
- Handling Concept Drift: Importance, Challenges and Solutions, 2011 (slides).

In this post, you discovered the problem of concept drift in changing data for applied machine learning.

Specifically, you learned:

- The problem of data changing over time.
- What is concept drift and how it is defined.
- How to handle concept drift in your own predictive modeling problems.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Concept Drift in Machine Learning appeared first on Machine Learning Mastery.

]]>The post Stop Coding Machine Learning Algorithms From Scratch appeared first on Machine Learning Mastery.

]]>…

Stop.

Are you implementing a machine learning algorithm at the moment?

Why?

Implementing algorithms from scratch is one of the biggest mistakes I see beginners make.

In this post you will discover:

- The algorithm implementation trap that beginners fall into.
- The very real difficulty of engineering world-class implementations of machine learning algorithms.
- Why you should be using off-the-shelf implementations.

Let’s get started.

Here’s a snippet of an email I received:

… I am really struggling. Why do I have to implement algorithms from scratch?

It seems that a lot of developers get caught in this challenge.

They are told or imply that:

**Algorithms must be implemented
before being used.**

Or that:

**You can only learn machine learning by
implementing algorithms.**

Here are some similar questions I stumbled across:

*Why is there a need to manually implement machine learning algorithms when there are many advanced APIs like*tensorflow*available?*(on Quora)*Is there any value implementing machine learning algorithms by yourself or should you use libraries?*(on Quora)*Is it useful to implement machine learning algorithms?*(on Quora)*Which programming language should I use to implement Machine Learning algorithms?*(on Quora)*Why do you and other people sometimes implement machine learning algorithms from scratch?*(on GitHub)

You don’t have to implement machine learning algorithms from scratch.

This is a part of the bottom-up approach traditionally used to teach machine learning.

- Learn Math.
- Learn Theory.
- Implement Algorithm From Scratch.
*??? (magic happens here*).- Apply Machine Learning.

It is a lot easier to apply machine learning algorithms to a problem and get a result than it is to implement them from scratch.

**A Lot Easier!**

Learning how to use an algorithm rather than implement an algorithm is not only easier, it is a more valuable skill. A skill that you can start using to make a real impact very quickly.

There’s a lot of low-hanging fruit that you can pick with applied machine learning.

…is Really Hard!

Algorithms that you use to solve business problems need to be **fast** and **correct**.

The more sophisticated nonlinear methods require a lot more data than their linear counterparts.

This means they need to do a lot of work, which may take a long time.

Algorithms need to be fast to process through all of this data. Especially, at scale.

This may require a re-interpretation of the linear algebra that underlies the method in such a way that best suits a specific matrix operation in an underlying library.

It may require specialized knowledge of caching to make the most of your hardware.

These are not ad hoc tricks that come together after you get a “*hello world*” implementation working. These are engineering challenges that encompass the algorithm implementation project.

Machine learning algorithms will give you a result, even when their implementation is crippled.

You get a number. An output. A prediction.

Sometimes the prediction is correct and sometimes it is not.

Machine learning algorithms use randomness. They are stochastic algorithms.

This is not just a matter of unit tests, it is a matter of having a deep understanding of the technique and devising cases to prove the implementation is as expected and edge cases are handled.

You may be an excellent engineer.

But your “*hello world*” implementation of an algorithm will probably not cut-it when compared to an off-the-shelf implementation.

Your implementation will probably be based on a textbook description, meaning it will be naive and slow. And you may or may not have the expertise to devise tests to ensure the correctness of your implementation.

Off-the-shelf implementations in open source libraries are built for speed and/or robustness.

**How could you not use a standard machine learning library?**

They may be tailored to a very narrow problem type intended to be as fast as possible. They may also be intended for general purpose use, ensuring they operate correctly on a wide range of problems, beyond those you have considered.

Not all algorithm implementations you download off the Internet are created equal.

The code snippet from GitHub maybe a grad students “*hello world*” implementation, or it may be the highly optimized implementation contributed to by the entire research team at a large organization.

You need to evaluate the source of the code you are using. Some sources are better or more reliable than others.

General purposes libraries are often more robust at the cost of some speed.

Lighting fast implementations by hacker-engineers often suffer poor documentation and are highly pedantic when it comes to their expectations.

Consider this when you pick your implementation.

When asked, I typically recommend one of three platforms:

**Weka**. A graphical user interface that does not require any code. Perfect if you want to focus on the machine learning first and learning how to work through problems.**Python**. The ecosystem including pandas and scikit-learn. Excellent for stitching together a solution to a machine learning problem in development that is robust enough to also be deployed into operations.**R**. The more advanced platform that although has an esoteric language and sometimes buggy packages, offers access to state-of-the-art methods written directly by academics. Great for one-off projects and R&D.

These are just my recommendations, there are many more machine learning platforms to choose from.

You do not have to implement machine learning algorithms when getting started in machine learning.

But you can.

And there can be very good reasons for doing so.

For example here are 3 big reasons:

- You want to implement to learn how the algorithm works.
- There is no available implementation of the algorithm you need.
- There is no suitable (fast enough, etc.) implementation of the algorithm you need.

The first is my favorite. It’s the one that may have confused you.

You can implement machine learning algorithms to learn how they work. I recommend it. It’s very efficient for developers to learn this way.

But.

You do not have to **start** by implementing machine learning algorithms. You will build your confidence and skill in machine learning a lot faster by learning how to use machine learning algorithms before implementing them.

The implementation and any research required to complete the implementation would then be an improvement on your understanding. An addition that would help you to get better results the next time you used that algorithm.

In this post, you discovered that beginners fall into the trap of implementing machine learning algorithms from scratch.

**They are told that it’s the only way.**

You discovered that engineering fast and robust implementations of machine learning algorithms is a tough challenge.

You learned that it is much easier and more desirable to learn how to use machine learning algorithms before implementing them. You also learned that implementing algorithms is a great way to learn more about how they work and get more from them, but only after you know how to use them.

**Have you been caught in this trap?**

*Share your experiences in the comments.*

- 5 Mistakes Programmers Make when Starting in Machine Learning
- Understand Machine Learning Algorithms By Implementing Them From Scratch
- Benefits of Implementing Machine Learning Algorithms From Scratch

The post Stop Coding Machine Learning Algorithms From Scratch appeared first on Machine Learning Mastery.

]]>The post Embrace Randomness in Machine Learning appeared first on Machine Learning Mastery.

]]>Applied machine learning is a tapestry of breakthroughs and mindset shifts.

Understanding the role of randomness in machine learning algorithms is one of those breakthroughs.

Once you get it, you will see things differently. In a whole new light. Things like choosing between one algorithm and another, hyperparameter tuning and reporting results.

You will also start to see the abuses everywhere. The criminally unsupported performance claims.

In this post, I want to gently open your eyes to the role of random numbers in machine learning. I want to give you the tools to embrace this uncertainty. To give you a breakthrough.

Let’s dive in.

(*special thanks to Xu Zhang and Nil Fero who promoted this post*)

A lot of people ask this question or variants of this question.

**You are not alone!**

I get an email along these lines once per week.

Here are some similar questions posted to Q&A sites:

- Why do I get different results each time I run my algorithm?
- Cross-Validation gives different result on the same data
- Randomness in Artificial Intelligence & Machine Learning
- Why are the weights different in each running after convergence?
- Does the same neural network with the same learning data and same test data in two computers give different results?

Machine learning algorithms make use of randomness.

Trained with different data, machine learning algorithms will construct different models. It depends on the algorithm. How different a model is with different data is called the model variance (as in the bias-variance trade off).

So, the data itself is a source of randomness. Randomness in the collection of the data.

The order that the observations are exposed to the model affects internal decisions.

Some algorithms are especially susceptible to this, like neural networks.

It is good practice to randomly shuffle the training data before each training iteration. Even if your algorithm is not susceptible. It’s a best practice.

Algorithms harness randomness.

An algorithm may be initialized to a random state. Such as the initial weights in an artificial neural network.

Votes that end in a draw (and other internal decisions) during training in a deterministic method may rely on randomness to resolve.

We may have too much data to reasonably work with.

In which case, we may work with a random subsample to train the model.

We sample when we evaluate an algorithm.

We use techniques like splitting the data into a random training and test set or use k-fold cross validation that makes k random splits of the data.

The result is an estimate of the performance of the model (and process used to create it) on unseen data.

There’s no doubt, randomness plays a big part in applied machine learning.

**The randomness that we can control, should be controlled.**

I've created a handy mind map of 60+ algorithms organized by type.

Download it, print it and use it.

Also get exclusive access to the machine learning algorithms email mini-course.

Run an algorithm on a dataset and get a model.

Can you get the same model again given the same data?

You should be able to. It should be a requirement that is high on the list for your modeling project.

We achieve reproducibility in applied machine learning by using the exact same **code**, **data** and **sequence of random numbers**.

Random numbers are generated in software using a pretend random number generator. It’s a simple math function that generates a sequence of numbers that are random enough for most applications.

This math function is deterministic. If it uses the same starting point called a seed number, it will give the same sequence of random numbers.

**Problem solved. **

**Mostly.**

We can get reproducible results by fixing the random number generator’s seed before each model we construct.

In fact, this is a best practice.

We should be doing this if not already.

In fact, we should be giving the same sequence of random numbers to each algorithm we compare and each technique we try.

It should be a default part of each experiment we run.

If a machine learning algorithm gives a different model with a different sequence of random numbers, then which model do we pick?

Ouch. There’s the rub.

I get asked this question from time to time and I love it.

It’s a sign that someone really gets to the meat of all this applied machine learning stuff – or is about to.

- Different runs of an algorithm with…
- Different random numbers give…
- Different models with…
- Different performance characteristics…

But the differences are within a range.

A fancy name for this difference or random behavior within a range is stochastic.

Machine learning algorithms are stochastic in practice.

- Expect them to be stochastic.
- Expect there to be a range of models to choose from and not a single model.
- Expect the performance to be a range and not a single value.

**These are very real expectations that you MUST address in practice.**

What tactics can you think of to address these expectations?

Thankfully, academics have been struggling with this challenge for a long time.

There are 2 simple strategies that you can use:

- Reduce the Uncertainty.
- Report the Uncertainty.

If we get different models essentially every time we run an algorithm, what can we do?

How about we try running the algorithm many times and gather a population of performance measures.

We already do this if we use *k*-fold cross validation. We build *k* different models.

We can increase *k* and build even more models, as long as the data within each fold remains representative of the problem.

We can also repeat our evaluation process *n* times to get even more numbers in our population of performance measures.

**This tactic is called random repeats or random restarts.**

It is more prevalent with stochastic optimization and neural networks, but is just as relevant generally. Try it.

Never report the performance of your machine learning algorithm with a single number.

If you do, you’ve most likely made an error.

You have gathered a population of performance measures. Use statistics on this population.

**This tactic is called report summary statistics.**

The distribution of results is most likely a Gaussian, so a great start would be to report the mean and standard deviation of performance. Include the highest and lowest performance observed.

In fact, this is a best practice.

You can then compare populations of result measures when you’re performing model selection. Such as:

- Choosing between algorithms.
- Choosing between configurations for one algorithm.

You can see that this has important implications on the processes you follow. Such as: to select which algorithm to use on your problem and for tuning and choosing algorithm hyperparameters.

Lean on statistical significance tests. Statistical tests can determine if the difference between one population of result measures is significantly different from a second population of results.

Report the significance as well.

This too is a best practice, that sadly does not have enough adoption.

The final model is the one prepared on the entire training dataset, once we have chosen an algorithm and configuration.

It’s the model we intend to use to make predictions or deploy into operations.

We also get a different final model with different sequences of random numbers.

I’ve had some students ask:

Should I create many final models and select the one with the best accuracy on a hold out validation dataset.

“*No*” I replied.

This would be a fragile process, highly dependent on the quality of the held out validation dataset. You are selecting random numbers that optimize for a small sample of data.

**Sounds like a recipe for overfitting.**

In general, I would rely on the confidence gained from the above tactics on reducing and reporting uncertainty. Often I just take the first model, it’s just as good as any other.

Sometimes your application domain makes you care more.

In this situation, I would tell you to build an ensemble of models, each trained with a different random number seed.

Use a simple voting ensemble. Each model makes a prediction and the mean of all predictions is reported as the final prediction.

Make the ensemble as big as you need to. I think 10, 30 or 100 are nice round numbers.

Maybe keep adding new models until the predictions become stable. For example, continue until the variance of the predictions tightens up on some holdout set.

In this post, you discovered why random numbers are integral to applied machine learning. You can’t really escape them.

You learned about tactics that you can use to ensure that your results are reproducible.

You learned about techniques that you can use to embrace the stochastic nature of machine learning algorithms when selecting models and reporting results.

For more information on the importance of reproducible results in machine learning and techniques that you can use, see the post:

Do you have any questions about random numbers in machine learning or about this post?

Ask your question in the comments and I will do my best to answer.

The post Embrace Randomness in Machine Learning appeared first on Machine Learning Mastery.

]]>The post Machine Learning Algorithms Mini-Course appeared first on Machine Learning Mastery.

]]>You have to understand how they work to make any progress in the field.

In this post you will discover a 14-part machine learning algorithms mini course that you can follow to finally understand machine learning algorithms.

We are going to cover a lot of ground in this course and you are going to have a great time. Let’s get started.

Before we get started, let’s make sure you are in the right place.

- This course is for beginners curious about machine learning algorithms.
- This course does not assume you know how to write code.
- This course does not assume a background in mathematics.
- This course does not assume a background in machine learning theory.

This mini-course will take you on a guided tour of machine learning algorithms from foundations and through 10 top techniques.

We will visit each algorithm to give you a sense of how it works, but not go into too much depth to keep things moving.

Let’s take a look at what we’re going to cover over the next 14 lessons.

You may need to come back to this post again and again, so you may want to bookmark it.

This mini-course is broken down int four parts: Algorithm Foundations, Linear Algorithms, Nonlinear Algorithms and Ensemble Algorithms.

**Lesson 1**: How To Talk About Data in Machine Learning**Lesson 2**: Principle That Underpins All Algorithms**Lesson 3**: Parametric and Nonparametric Algorithms**Lesson 4**: Bias, Variance and the Trade-off

**Lesson 5**: Linear Regression**Lesson 6**: Logistic Regression**Lesson 7**: Linear Discriminant Analysis

**Lesson 8**: Classification and Regression Trees**Lesson 9**: Naive Bayes**Lesson 10**: k-Nearest Neighbors**Lesson 11**: Learning Vector Quantization**Lesson 12**: Support Vector Machines

**Lesson 13**: Bagging and Random Forest**Lesson 14**: Boosting and AdaBoost

I've created a handy mind map of 60+ algorithms organized by type.

Download it, print it and use it.

Also get exclusive access to the machine learning algorithms email mini-course.

Data plays a big part in machine learning.

It is important to understand and use the right terminology when talking about data.

How do you think about data? Think of a spreadsheet. You have columns, rows, and cells.

The statistical perspective of machine learning frames data in the context of a hypothetical function (f) that the machine learning algorithm aims to learn. Given some input variables (Input) the function answer the question as to what is the predicted output variable (Output).

Output = f(Input)

The inputs and outputs can be referred to as variables or vectors.

The computer science perspective uses a row of data to describe an entity (like a person) or an observation about an entity. As such, the columns for a row are often referred to as attributes of the observation and the rows themselves are called instances.

There is a common principle that underlies all supervised machine learning algorithms for predictive modeling.

Machine learning algorithms are described as learning a target function (f) that best maps input variables (X) to an output variable (Y).

Y = f(X)

This is a general learning task where we would like to make predictions in the future (Y) given new examples of input variables (X). We don’t know what the function (f) looks like or it’s form. If we did, we would use it directly and we would not need to learn it from data using machine learning algorithms.

The most common type of machine learning is to learn the mapping Y = f(X) to make predictions of Y for new X. This is called predictive modeling or predictive analytics and our goal is to make the most accurate predictions possible.

What is a parametric machine learning algorithm and how is it different from a nonparametric machine learning algorithm?

Assumptions can greatly simplify the learning process, but can also limit what can be learned. Algorithms that simplify the function to a known form are called parametric machine learning algorithms.

The algorithms involve two steps:

- Select a form for the function.
- Learn the coefficients for the function from the training data.

Some examples of parametric machine learning algorithms are Linear Regression and Logistic Regression.

Algorithms that do not make strong assumptions about the form of the mapping function are called nonparametric machine learning algorithms. By not making assumptions, they are free to learn any functional form from the training data.

Non-parametric methods are often more flexible, achieve better accuracy but require a lot more data and training time.

Examples of nonparametric algorithms include Support Vector Machines, Neural Networks and Decision Trees.

Machine learning algorithms can best be understood through the lens of the bias-variance trade-off.

Bias are the simplifying assumptions made by a model to make the target function easier to learn.

Generally parametric algorithms have a high bias making them fast to learn and easier to understand but generally less flexible. In turn they have lower predictive performance on complex problems that fail to meet the simplifying assumptions of the algorithms bias.

Decision trees are an example of a low bias algorithm, whereas linear regression is an example of a high-bias algorithm.

Variance is the amount that the estimate of the target function will change if different training data was used. The target function is estimated from the training data by a machine learning algorithm, so we should expect the algorithm to have some variance, not zero variance.

The k-Nearest Neighbors algorithm is an example of a high-variance algorithm, whereas Linear Discriminant Analysis is an example of a low variance algorithm.

The goal of any predictive modeling machine learning algorithm is to achieve low bias and low variance. In turn the algorithm should achieve good prediction performance. The parameterization of machine learning algorithms is often a battle to balance out bias and variance.

- Increasing the bias will decrease the variance.
- Increasing the variance will decrease the bias.

Linear regression is perhaps one of the most well known and well understood algorithms in statistics and machine learning.

Isn’t it a technique from statistics?

Predictive modeling is primarily concerned with minimizing the error of a model or making the most accurate predictions possible, at the expense of explainability. We will borrow, reuse and steal algorithms from many different fields, including statistics and use them towards these ends.

The representation of linear regression is a equation that describes a line that best fits the relationship between the input variables (x) and the output variables (y), by finding specific weightings for the input variables called coefficients (B).

For example:

y = B0 + B1 * x

We will predict y given the input x and the goal of the linear regression learning algorithm is to find the values for the coefficients B0 and B1.

Different techniques can be used to learn the linear regression model from data, such as a linear algebra solution for ordinary least squares and gradient descent optimization.

Linear regression has been around for more than 200 years and has been extensively studied. Some good rules of thumb when using this technique are to remove variables that are very similar (correlated) and to remove noise from your data, if possible.

It is a fast and simple technique and good first algorithm to try.

Logistic regression is another technique borrowed by machine learning from the field of statistics. It is the go-to method for binary classification problems (problems with two class values).

Logistic regression is like linear regression in that the goal is to find the values for the coefficients that weight each input variable.

Unlike linear regression, the prediction for the output is transformed using a non-linear function called the logistic function.

The logistic function looks like a big S and will transform any value into the range 0 to 1. This is useful because we can apply a rule to the output of the logistic function to snap values to 0 and 1 (e.g. IF less than 0.5 then output 1) and predict a class value.

Because of the way that the model is learned, the predictions made by logistic regression can also be used as the probability of a given data instance belonging to class 0 or class 1. This can be useful on problems where you need to give more rationale for a prediction.

Like linear regression, logistic regression does work better when you remove attributes that are unrelated to the output variable as well as attributes that are very similar (correlated) to each other.

It’s a fast model to learn and effective on binary classification problems.

Logistic regression is a classification algorithm traditionally limited to only two-class classification problems. If you have more than two classes then the Linear Discriminant Analysis algorithm is the preferred linear classification technique.

The representation of LDA is pretty straight forward. It consists of statistical properties of your data, calculated for each class. For a single input variable this includes:

- The mean value for each class.
- The variance calculated across all classes.

Predictions are made by calculating a discriminate value for each class and making a prediction for the class with the largest value.

The technique assumes that the data has a Gaussian distribution (bell curve), so it is a good idea to remove outliers from your data before hand.

It’s a simple and powerful method for classification predictive modeling problems.

Decision Trees are an important type of algorithm for predictive modeling machine learning.

The representation for the decision tree model is a binary tree. This is your binary tree from algorithms and data structures, nothing too fancy. Each node represents a single input variable (x) and a split point on that variable (assuming the variable is numeric).

The leaf nodes of the tree contain an output variable (y) which is used to make a prediction. Predictions are made by walking the splits of the tree until arriving at a leaf node and output the class value at that leaf node.

Trees are fast to learn and very fast for making predictions. They are also often accurate for a broad range of problems and do not require any special preparation for your data.

Decision trees have a high variance and can yield more accurate predictions when used in an ensemble, a topic we will cover in Lesson 13 and Lesson 14.

Naive Bayes is a simple but surprisingly powerful algorithm for predictive modeling.

The model is comprised of two types of probabilities that can be calculated directly from your training data:

- The probability of each class.
- The conditional probability for each class given each x value.

Once calculated, the probability model can be used to make predictions for new data using Bayes Theorem.

When your data is real-valued it is common to assume a Gaussian distribution (bell curve) so that you can easily estimate these probabilities.

Naive Bayes is called naive because it assumes that each input variable is independent. This is a strong assumption and unrealistic for real data, nevertheless, the technique is very effective on a large range of complex problems.

The KNN algorithm is very simple and very effective.

The model representation for KNN is the entire training dataset. Simple right?

Predictions are made for a new data point by searching through the entire training set for the K most similar instances (the neighbors) and summarizing the output variable for those K instances. For regression this might be the mean output variable, in classification this might be the mode (or most common) class value.

The trick is in how to determine similarity between the data instances. The simplest technique if your attributes are all of the same scale (all in inches for example) is to use the Euclidean distance, a number you can calculate directly based on the differences between each input variable.

KNN can require a lot of memory or space to store all of the data, but only performs a calculation (or learn) when a prediction is needed, just in time. You can also update and curate your training instances over time to keep predictions accurate.

The idea of distance or closeness can break down in very high dimensions (lots of input variables) which can negatively effect the performance of the algorithm on your problem. This is called the curse of dimensionality. It suggests you only use those input variables that are most relevant to predicting the output variable.

A downside of K-Nearest Neighbors is that you need to hang on to your entire training dataset.

The Learning Vector Quantization algorithm (or LVQ for short) is an artificial neural network algorithm that allows you to choose how many training instances to hang onto and learns exactly what those instances should look like.

The representation for LVQ is a collection of codebook vectors. These are selected randomly in the beginning and adapted to best summarize the training dataset over a number of iterations of the learning algorithm.

After learned, the codebook vectors can be used to make predictions just like K-Nearest Neighbors. The most similar neighbor (best matching codebook vector) is found by calculating the distance between each codebook vector and the new data instance. The class value or (real value in the case of regression) for the best matching unit is then returned as the prediction.

Best results are achieved if you rescale your data to have the same range, such as between 0 and 1.

If you discover that KNN gives good results on your dataset try using LVQ to reduce the memory requirements of storing the entire training dataset.

Support Vector Machines are perhaps one of the most popular and talked about machine learning algorithms.

A hyperplane is a line that splits the input variable space. In SVM, a hyperplane is selected to best separate the points in the input variable space by their class, either class 0 or class 1.

In two-dimensions you can visualize this as a line and let’s assume that all of our input points can be completely separated by this line.

The SVM learning algorithm finds the coefficients that results in the best separation of the classes by the hyperplane.

The distance between the hyperplane and the closest data points is referred to as the margin. The best or optimal hyperplane that can separate the two classes is the line that as the largest margin.

Only these points are relevant in defining the hyperplane and in the construction of the classifier.

These points are called the support vectors. They support or define the hyperplane.

In practice, an optimization algorithm is used to find the values for the coefficients that maximizes the margin.

SVM might be one of the most powerful out-of-the-box classifiers and worth trying on your dataset.

Random Forest is one of the most popular and most powerful machine learning algorithms. It is a type of ensemble machine learning algorithm called Bootstrap Aggregation or bagging.

The bootstrap is a powerful statistical method for estimating a quantity from a data sample. Such as a mean. You take lots of samples of your data, calculate the mean, then average all of your mean values to give you a better estimation of the true mean value.

In bagging, the same approach is used, but instead for estimating entire statistical models, most commonly decision trees.

Multiple samples of your training data are taken then models are constructed for each data sample. When you need to make a prediction for new data, each model makes a prediction and the predictions are averaged to give a better estimate of the true output value.

Random forest is a tweak on this approach where decision trees are created so that rather than selecting optimal split points, suboptimal splits are made by introducing randomness.

The models created for each sample of the data are therefore more different than they otherwise would be, but still accurate in their unique and different ways. Combining their predictions results in a better estimate of the true underlying output value.

If you get good good results with an algorithm with high variance (like decision trees), you can often get better results by bagging that algorithm.

Boosting is an ensemble technique that attempts to create a strong classifier from a number of weak classifiers.

This is done by building a model from the training data, then creating a second model that attempts to correct the errors from the first model. Models are added until the training set is predicted perfectly or a maximum number of models are added.

AdaBoost was the first really successful boosting algorithm developed for binary classification. It is the best starting point for understanding boosting. Modern boosting methods build on AdaBoost, most notably stochastic gradient boosting machines.

AdaBoost is used with short decision trees. After the first tree is created, the performance of the tree on each training instance is used to weight how much attention the next tree that is created should pay attention to each training instance. Training data that is hard to predict is given more more weight, whereas easy to predict instances are given less weight.

Models are created sequentially one after the other, each updating the weights on the training instances that affect the learning performed by the next tree in the sequence.

After all the trees are built, predictions are made for new data, and the performance of each tree is weighted by how accurate it was on the training data.

Because so much attention is put on correcting mistakes by the algorithm it is important that you have clean data with outliers removed.

You made it. Well done! Take a moment and look back at how far you have come:

- You discovered how to talk about data in machine learning and about the underlying principles of all predictive modeling algorithms.
- You discovered the difference between parametric and nonparametric algorithms and the difference between error introduced by bias and variance.
- You discovered three linear machine learning algorithms: Linear Regression, Logistic Regression and Linear Discriminant Analysis.
- You were introduced to 5 nonlinear algorithms: Classification and Regression Trees, Naive Bayes, K-Nearest Neighbors, Learning Vector Quantization and Support Vector Machines.
- Finally, you discovered two of the most popular ensemble algorithms: Bagging with Decision Trees and Boosting with AdaBoost.

Don’t make light of this, you have come a long way in a short amount of time. This is just the beginning of your journey with machine learning algorithms. Keep practicing and developing your skills.

Did you enjoy this mini-course?

Do you have any questions or sticking points?

Leave a comment and let me know.

The post Machine Learning Algorithms Mini-Course appeared first on Machine Learning Mastery.

]]>The post 6 Questions To Understand Any Machine Learning Algorithm appeared first on Machine Learning Mastery.

]]>You have to choose the level of detail that you study machine learning algorithms. There is a sweet spot if you are a developer interested in applied predictive modeling.

This post describes that sweet spot and gives you a template that you can use to quickly understand any machine learning algorithm.

Let’s get started.

What do you need to know about a machine learning algorithm to be able to use it well on a classification or prediction problem?

I won’t argue that the more that you know about how and why a particular algorithm works, the better you can wield it. But I do believe that there is a point of diminishing returns where you can stop, use what you know to be effective and dive deeper into the theory and research on an algorithm if and only if you need to know more in order to get better results.

Let’s take a look at the 6 questions that will reveal how a machine learning algorithms and how to best use it.

I've created a handy mind map of 60+ algorithms organized by type.

Download it, print it and use it.

Also get exclusive access to the machine learning algorithms email mini-course.

There are 6 questions that you can ask to get to the heart of any machine learning algorithm:

- How do you refer to the technique (
*e.g. what name*)? - How do you represent a learned model (
*e.g. what coefficients*)? - How to you learn a model (
*e.g. the optimization process from data to the representation*)? - How do you make predictions from a learned model (
*e.g. apply the model to new data*)? - How do you best prepare your data for the modeling with the technique (
*e.g. assumptions*)? - How do you get more information on the technique (
*e.g. where to look*)?

You will note that I have phrased all of these questions as How-To. I did this intentionally to separate the practical concerns of how from the more theoretical concerns of why. I think knowing why a technique works is less important than knowing how it works, if you are looking to use it as a tool to get results. More on this in the next section.

Let’s take a closer look at each of these questions in turn.

This is obvious but important. You need to know the canonical name of the technique.

You need to be able to recognize the classical name or the name of the method from other fields as well and know that it is the same thing. This also includes the acronym for the algorithm, because sometimes they are less than intuitive.

This will help you sort out the base algorithm from extensions and the family tree of where the algorithm fits and relates to similar algorithms.

I really like this nuts and bolts question.

This is question often overlooked in textbooks and papers and is perhaps the first question an engineer has when thinking about how a model will actually be used and deployed.

The representation is the numbers and data structure that captures the distinct details learned from data by the learning algorithm to be used by the prediction algorithm. It’s the stuff you save to disk or the database when you finalize your model. It’s the stuff you update when new training data becomes available.

Let’s make this concrete with an example. In the case of linear regression, the representation is the vector of regression coefficients. That’s it. In the case of a decision tree is is the tree itself including the nodes, how they are connected and the variables and cut-off thresholds chosen.

Given some training data, the algorithm needs to create the model or fill in the model representation. This question is about exactly how that occurs.

Often learning involves estimating parameters from the training data directly in simpler algorithms.

In most other algorithms it involves using the training data as part of a cost or loss function and an optimization algorithm to minimize the function. Simpler linear techniques may use linear algebra to achieve this result, whereas others may use a numerical optimization.

Often the way a machine learning algorithm learns a model is synonymous with the algorithm itself. This is the challenging and often time consuming part of running a machine learning algorithm.

The learning algorithm may be parameterized and it is often a good idea to list common ranges for parameter values or configuration heuristics that may be used as a starting point.

Once a model is learned, it is intended to be used to make predictions on new data. Note, we re exclusively talking about predictive modeling machine learning algorithms for classification and regression problems.

This is often the fast and even trivial part of using a machine learning algorithm. Often it is so trivial that it is not even mentioned or discussed in the literature.

It may be trivial because prediction may be as simple as filling in the inputs in an equation and calculating a prediction, or traversing a decision tree to see what leaf-node lights up. In other algorithms, like k-nearest neighbors the prediction algorithm may be the main show (k-NN has no training algorithm other than “store the whole training set”).

Machine learning algorithms make assumptions.

Even the most relaxed non-parametric methods make assumptions about your training data. It is good or even critical to review these assumptions. Even better is to translate these assumptions into specific data preparation operations that you can perform.

This question flushes out transforms that you could use on your data before modeling, or at the very least gives you pause to think about data transforms to try. What I mean by this is that it is best to treat algorithm requirements and assumptions as suggestions of things to try to get the most out your model rather than hard and fast rules that your data must adhere to.

Just like you cannot know which algorithm will be best for your data before hand, you cannot know the best transforms to apply to your data to get the most from an algorithm. Real data is messy and it is a good idea to try a number of presentations of your data with a number of different algorithms to see what warrants deeper investigation. The requirements and assumptions of machine learning algorithms help to point out presentations of your data to try.

Some algorithms will bubble up as generally better than others on your data problem.

When they do, you need to know where to look to get a deeper understanding of the technique. This can help with further customizing the algorithm for your data and with tuning the parameters of the learning and prediction algorithms.

It is a good idea to collect and list resources that you can reference if and when you need to dive deeper. This may include:

- Journal Articles
- Conference Papers
- Books including textbooks and monographs
- Webpages

I also think it is a good idea to know of more practical references like example tutorials and open source implementations that you can look inside to get a more concrete idea of what is going on.

For more on researching machine learning algorithms, see the post How to Research a Machine Learning Algorithm.

In this post you discovered 6 questions that you can ask of a machine learning, that if answered, will give you a very good and practical idea of how it works and how you can use it effectively.

These questions were focused on machine learning algorithms for predictive modeling problems like classification and regression.

These questions, phrased simply are:

- What are the common names of the algorithm?
- What representation is used by the model?
- How does the algorithm learn from training data?
- How can you make predictions from the model on new data?
- How you can best prepare your data for the algorithm?
- Where you can you look for more information about the algorithm?

For another post along this theme of defining an algorithm description template see How to Learn a Machine Learning Algorithm.

Do you like this approach? Let me know in the comments.

The post 6 Questions To Understand Any Machine Learning Algorithm appeared first on Machine Learning Mastery.

]]>The post Boosting and AdaBoost for Machine Learning appeared first on Machine Learning Mastery.

]]>In this post you will discover the AdaBoost Ensemble method for machine learning. After reading this post, you will know:

- What the boosting ensemble method is and generally how it works.
- How to learn to boost decision trees using the AdaBoost algorithm.
- How to make predictions using the learned AdaBoost model.
- How to best prepare your data for use with the AdaBoost algorithm

This post was written for developers and assumes no background in statistics or mathematics. The post focuses on how the algorithm works and how to use it for predictive modeling problems. If you have any questions, leave a comment and I will do my best to answer.

Let’s get started.

Boosting is a general ensemble method that creates a strong classifier from a number of weak classifiers.

This is done by building a model from the training data, then creating a second model that attempts to correct the errors from the first model. Models are added until the training set is predicted perfectly or a maximum number of models are added.

AdaBoost was the first really successful boosting algorithm developed for binary classification. It is the best starting point for understanding boosting.

Modern boosting methods build on AdaBoost, most notably stochastic gradient boosting machines.

I've created a handy mind map of 60+ algorithms organized by type.

Download it, print it and use it.

Also get exclusive access to the machine learning algorithms email mini-course.

AdaBoost is best used to boost the performance of decision trees on binary classification problems.

AdaBoost was originally called AdaBoost.M1 by the authors of the technique Freund and Schapire. More recently it may be referred to as discrete AdaBoost because it is used for classification rather than regression.

AdaBoost can be used to boost the performance of any machine learning algorithm. It is best used with weak learners. These are models that achieve accuracy just above random chance on a classification problem.

The most suited and therefore most common algorithm used with AdaBoost are decision trees with one level. Because these trees are so short and only contain one decision for classification, they are often called decision stumps.

Each instance in the training dataset is weighted. The initial weight is set to:

weight(xi) = 1/n

Where xi is the i’th training instance and n is the number of training instances.

A weak classifier (decision stump) is prepared on the training data using the weighted samples. Only binary (two-class) classification problems are supported, so each decision stump makes one decision on one input variable and outputs a +1.0 or -1.0 value for the first or second class value.

The misclassification rate is calculated for the trained model. Traditionally, this is calculated as:

error = (correct – N) / N

Where error is the misclassification rate, correct are the number of training instance predicted correctly by the model and N is the total number of training instances. For example, if the model predicted 78 of 100 training instances correctly the error or misclassification rate would be (78-100)/100 or 0.22.

This is modified to use the weighting of the training instances:

error = sum(w(i) * terror(i)) / sum(w)

Which is the weighted sum of the misclassification rate, where w is the weight for training instance i and terror is the prediction error for training instance i which is 1 if misclassified and 0 if correctly classified.

For example, if we had 3 training instances with the weights 0.01, 0.5 and 0.2. The predicted values were -1, -1 and -1, and the actual output variables in the instances were -1, 1 and -1, then the terrors would be 0, 1, and 0. The misclassification rate would be calculated as:

error = (0.01*0 + 0.5*1 + 0.2*0) / (0.01 + 0.5 + 0.2)

or

error = 0.704

A stage value is calculated for the trained model which provides a weighting for any predictions that the model makes. The stage value for a trained model is calculated as follows:

stage = ln((1-error) / error)

Where stage is the stage value used to weight predictions from the model, ln() is the natural logarithm and error is the misclassification error for the model. The effect of the stage weight is that more accurate models have more weight or contribution to the final prediction.

The training weights are updated giving more weight to incorrectly predicted instances, and less weight to correctly predicted instances.

For example, the weight of one training instance (w) is updated using:

w = w * exp(stage * terror)

Where w is the weight for a specific training instance, exp() is the numerical constant e or Euler’s number raised to a power, stage is the misclassification rate for the weak classifier and terror is the error the weak classifier made predicting the output variable for the training instance, evaluated as:

terror = 0 if(y == p), otherwise 1

Where y is the output variable for the training instance and p is the prediction from the weak learner.

This has the effect of not changing the weight if the training instance was classified correctly and making the weight slightly larger if the weak learner misclassified the instance.

Weak models are added sequentially, trained using the weighted training data.

The process continues until a pre-set number of weak learners have been created (a user parameter) or no further improvement can be made on the training dataset.

Once completed, you are left with a pool of weak learners each with a stage value.

Predictions are made by calculating the weighted average of the weak classifiers.

For a new input instance, each weak learner calculates a predicted value as either +1.0 or -1.0. The predicted values are weighted by each weak learners stage value. The prediction for the ensemble model is taken as a the sum of the weighted predictions. If the sum is positive, then the first class is predicted, if negative the second class is predicted.

For example, 5 weak classifiers may predict the values 1.0, 1.0, -1.0, 1.0, -1.0. From a majority vote, it looks like the model will predict a value of 1.0 or the first class. These same 5 weak classifiers may have the stage values 0.2, 0.5, 0.8, 0.2 and 0.9 respectively. Calculating the weighted sum of these predictions results in an output of -0.8, which would be an ensemble prediction of -1.0 or the second class.

This section lists some heuristics for best preparing your data for AdaBoost.

**Quality Data**: Because the ensemble method continues to attempt to correct misclassifications in the training data, you need to be careful that the training data is of a high-quality.**Outliers**: Outliers will force the ensemble down the rabbit hole of working hard to correct for cases that are unrealistic. These could be removed from the training dataset.**Noisy Data**: Noisy data, specifically noise in the output variable can be problematic. If possible, attempt to isolate and clean these from your training dataset.

Below are some machine learning texts that describe AdaBoost from a machine learning perspective.

- An Introduction to Statistical Learning: with Applications in R, page 321
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Chapter 10
- Applied Predictive Modeling, pages 203 amd 389

Below are some seminal and good overview research articles on the method that may be useful if you are looking to dive deeper into the theoretical underpinnings of the method:

- A decision-theoretic generalization of on-line learning and an application to boosting, 1995
- Improved Boosting Algorithms Using Confidence-rated Predictions, 1999
- Explaining Adaboost, Chapter from Empirical Inference, 2013
- A Short Introduction to Boosting, 1999

In this post you discovered the Boosting ensemble method for machine learning. You learned about:

- Boosting and how it is a general technique that keeps adding weak learners to correct classification errors.
- AdaBoost as the first successful boosting algorithm for binary classification problems.
- Learning the AdaBoost model by weighting training instances and the weak learners themselves.
- Predicting with AdaBoost by weighting predictions from weak learners.
- Where to look for more theoretical background on the AdaBoost algorithm.

If you have any questions about this post or the Boosting or the AdaBoost algorithm ask in the comments and I will do my best to answer.

The post Boosting and AdaBoost for Machine Learning appeared first on Machine Learning Mastery.

]]>The post Bagging and Random Forest Ensemble Algorithms for Machine Learning appeared first on Machine Learning Mastery.

]]>In this post you will discover the Bagging ensemble algorithm and the Random Forest algorithm for predictive modeling. After reading this post you will know about:

- The bootstrap method for estimating statistical quantities from samples.
- The Bootstrap Aggregation algorithm for creating multiple different models from a single training dataset.
- The Random Forest algorithm that makes a small tweak to Bagging and results in a very powerful classifier.

This post was written for developers and assumes no background in statistics or mathematics. The post focuses on how the algorithm works and how to use it for predictive modeling problems.

If you have any questions, leave a comment and I will do my best to answer.

Let’s get started.

Before we get to Bagging, let’s take a quick look at an important foundation technique called the bootstrap.

The bootstrap is a powerful statistical method for estimating a quantity from a data sample. This is easiest to understand if the quantity is a descriptive statistic such as a mean or a standard deviation.

Let’s assume we have a sample of 100 values (x) and we’d like to get an estimate of the mean of the sample.

We can calculate the mean directly from the sample as:

mean(x) = 1/100 * sum(x)

We know that our sample is small and that our mean has error in it. We can improve the estimate of our mean using the bootstrap procedure:

- Create many (e.g. 1000) random sub-samples of our dataset with replacement (meaning we can select the same value multiple times).
- Calculate the mean of each sub-sample.
- Calculate the average of all of our collected means and use that as our estimated mean for the data.

For example, let’s say we used 3 resamples and got the mean values 2.3, 4.5 and 3.3. Taking the average of these we could take the estimated mean of the data to be 3.367.

This process can be used to estimate other quantities like the standard deviation and even quantities used in machine learning algorithms, like learned coefficients.

I've created a handy mind map of 60+ algorithms organized by type.

Download it, print it and use it.

Also get exclusive access to the machine learning algorithms email mini-course.

Bootstrap Aggregation (or Bagging for short), is a simple and very powerful ensemble method.

An ensemble method is a technique that combines the predictions from multiple machine learning algorithms together to make more accurate predictions than any individual model.

Bootstrap Aggregation is a general procedure that can be used to reduce the variance for those algorithm that have high variance. An algorithm that has high variance are decision trees, like classification and regression trees (CART).

Decision trees are sensitive to the specific data on which they are trained. If the training data is changed (e.g. a tree is trained on a subset of the training data) the resulting decision tree can be quite different and in turn the predictions can be quite different.

Bagging is the application of the Bootstrap procedure to a high-variance machine learning algorithm, typically decision trees.

Let’s assume we have a sample dataset of 1000 instances (x) and we are using the CART algorithm. Bagging of the CART algorithm would work as follows.

- Create many (e.g. 100) random sub-samples of our dataset with replacement.
- Train a CART model on each sample.
- Given a new dataset, calculate the average prediction from each model.

For example, if we had 5 bagged decision trees that made the following class predictions for a in input sample: blue, blue, red, blue and red, we would take the most frequent class and predict blue.

When bagging with decision trees, we are less concerned about individual trees overfitting the training data. For this reason and for efficiency, the individual decision trees are grown deep (e.g. few training samples at each leaf-node of the tree) and the trees are not pruned. These trees will have both high variance and low bias. These are important characterize of sub-models when combining predictions using bagging.

The only parameters when bagging decision trees is the number of samples and hence the number of trees to include. This can be chosen by increasing the number of trees on run after run until the accuracy begins to stop showing improvement (e.g. on a cross validation test harness). Very large numbers of models may take a long time to prepare, but will not overfit the training data.

Just like the decision trees themselves, Bagging can be used for classification and regression problems.

Random Forests are an improvement over bagged decision trees.

A problem with decision trees like CART is that they are greedy. They choose which variable to split on using a greedy algorithm that minimizes error. As such, even with Bagging, the decision trees can have a lot of structural similarities and in turn have high correlation in their predictions.

Combining predictions from multiple models in ensembles works better if the predictions from the sub-models are uncorrelated or at best weakly correlated.

Random forest changes the algorithm for the way that the sub-trees are learned so that the resulting predictions from all of the subtrees have less correlation.

It is a simple tweak. In CART, when selecting a split point, the learning algorithm is allowed to look through all variables and all variable values in order to select the most optimal split-point. The random forest algorithm changes this procedure so that the learning algorithm is limited to a random sample of features of which to search.

The number of features that can be searched at each split point (m) must be specified as a parameter to the algorithm. You can try different values and tune it using cross validation.

- For classification a good default is: m = sqrt(p)
- For regression a good default is: m = p/3

Where m is the number of randomly selected features that can be searched at a split point and p is the number of input variables. For example, if a dataset had 25 input variables for a classification problem, then:

- m = sqrt(25)
- m = 5

For each bootstrap sample taken from the training data, there will be samples left behind that were not included. These samples are called Out-Of-Bag samples or OOB.

The performance of each model on its left out samples when averaged can provide an estimated accuracy of the bagged models. This estimated performance is often called the OOB estimate of performance.

These performance measures are reliable test error estimate and correlate well with cross validation estimates.

As the Bagged decision trees are constructed, we can calculate how much the error function drops for a variable at each split point.

In regression problems this may be the drop in sum squared error and in classification this might be the Gini score.

These drops in error can be averaged across all decision trees and output to provide an estimate of the importance of each input variable. The greater the drop when the variable was chosen, the greater the importance.

These outputs can help identify subsets of input variables that may be most or least relevant to the problem and suggest at possible feature selection experiments you could perform where some features are removed from the dataset.

Bagging is a simple technique that is covered in most introductory machine learning texts. Some examples are listed below.

- An Introduction to Statistical Learning: with Applications in R, Chapter 8.
- Applied Predictive Modeling, Chapter 8 and Chapter 14.
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Chapter 15

In this post you discovered the Bagging ensemble machine learning algorithm and the popular variation called Random Forest. You learned:

- How to estimate statistical quantities from a data sample.
- How to combine the predictions from multiple high-variance models using bagging.
- How to tweak the construction of decision trees when bagging to de-correlate their predictions, a technique called Random Forests.

Do you have any questions about this post or the Bagging or Random Forest Ensemble algorithms?

Leave a comment and ask your question and I will do my best to answer it.

The post Bagging and Random Forest Ensemble Algorithms for Machine Learning appeared first on Machine Learning Mastery.

]]>The post Support Vector Machines for Machine Learning appeared first on Machine Learning Mastery.

]]>They were extremely popular around the time they were developed in the 1990s and continue to be the go-to method for a high-performing algorithm with little tuning.

In this post you will discover the Support Vector Machine (SVM) machine learning algorithm. After reading this post you will know:

- How to disentangle the many names used to refer to support vector machines.
- The representation used by SVM when the model is actually stored on disk.
- How a learned SVM model representation can be used to make predictions for new data.
- How to learn an SVM model from training data.
- How to best prepare your data for the SVM algorithm.
- Where you might look to get more information on SVM.

SVM is an exciting algorithm and the concepts are relatively simple. This post was written for developers with little or no background in statistics and linear algebra.

As such we will stay high-level in this description and focus on the specific implementation concerns. The question around why specific equations are used or how they were derived are not covered and you may want to dive deeper in the further reading section.

Let’s get started.

The Maximal-Margin Classifier is a hypothetical classifier that best explains how SVM works in practice.

The numeric input variables (x) in your data (the columns) form an n-dimensional space. For example, if you had two input variables, this would form a two-dimensional space.

A hyperplane is a line that splits the input variable space. In SVM, a hyperplane is selected to best separate the points in the input variable space by their class, either class 0 or class 1. In two-dimensions you can visualize this as a line and let’s assume that all of our input points can be completely separated by this line. For example:

B0 + (B1 * X1) + (B2 * X2) = 0

Where the coefficients (B1 and B2) that determine the slope of the line and the intercept (B0) are found by the learning algorithm, and X1 and X2 are the two input variables.

You can make classifications using this line. By plugging in input values into the line equation, you can calculate whether a new point is above or below the line.

- Above the line, the equation returns a value greater than 0 and the point belongs to the first class (class 0).
- Below the line, the equation returns a value less than 0 and the point belongs to the second class (class 1).
- A value close to the line returns a value close to zero and the point may be difficult to classify.
- If the magnitude of the value is large, the model may have more confidence in the prediction.

The distance between the line and the closest data points is referred to as the margin. The best or optimal line that can separate the two classes is the line that as the largest margin. This is called the Maximal-Margin hyperplane.

The margin is calculated as the perpendicular distance from the line to only the closest points. Only these points are relevant in defining the line and in the construction of the classifier. These points are called the support vectors. They support or define the hyperplane.

The hyperplane is learned from training data using an optimization procedure that maximizes the margin.

I've created a handy mind map of 60+ algorithms organized by type.

Download it, print it and use it.

Also get exclusive access to the machine learning algorithms email mini-course.

In practice, real data is messy and cannot be separated perfectly with a hyperplane.

The constraint of maximizing the margin of the line that separates the classes must be relaxed. This is often called the soft margin classifier. This change allows some points in the training data to violate the separating line.

An additional set of coefficients are introduced that give the margin wiggle room in each dimension. These coefficients are sometimes called slack variables. This increases the complexity of the model as there are more parameters for the model to fit to the data to provide this complexity.

A tuning parameter is introduced called simply C that defines the magnitude of the wiggle allowed across all dimensions. The C parameters defines the amount of violation of the margin allowed. A C=0 is no violation and we are back to the inflexible Maximal-Margin Classifier described above. The larger the value of C the more violations of the hyperplane are permitted.

During the learning of the hyperplane from data, all training instances that lie within the distance of the margin will affect the placement of the hyperplane and are referred to as support vectors. And as C affects the number of instances that are allowed to fall within the margin, C influences the number of support vectors used by the model.

- The smaller the value of C, the more sensitive the algorithm is to the training data (higher variance and lower bias).
- The larger the value of C, the less sensitive the algorithm is to the training data (lower variance and higher bias).

The SVM algorithm is implemented in practice using a kernel.

The learning of the hyperplane in linear SVM is done by transforming the problem using some linear algebra, which is out of the scope of this introduction to SVM.

A powerful insight is that the linear SVM can be rephrased using the inner product of any two given observations, rather than the observations themselves. The inner product between two vectors is the sum of the multiplication of each pair of input values.

For example, the inner product of the vectors [2, 3] and [5, 6] is 2*5 + 3*6 or 28.

The equation for making a prediction for a new input using the dot product between the input (x) and each support vector (xi) is calculated as follows:

f(x) = B0 + sum(ai * (x,xi))

This is an equation that involves calculating the inner products of a new input vector (x) with all support vectors in training data. The coefficients B0 and ai (for each input) must be estimated from the training data by the learning algorithm.

The dot-product is called the kernel and can be re-written as:

K(x, xi) = sum(x * xi)

The kernel defines the similarity or a distance measure between new data and the support vectors. The dot product is the similarity measure used for linear SVM or a linear kernel because the distance is a linear combination of the inputs.

Other kernels can be used that transform the input space into higher dimensions such as a Polynomial Kernel and a Radial Kernel. This is called the Kernel Trick.

It is desirable to use more complex kernels as it allows lines to separate the classes that are curved or even more complex. This in turn can lead to more accurate classifiers.

Instead of the dot-product, we can use a polynomial kernel, for example:

K(x,xi) = 1 + sum(x * xi)^d

Where the degree of the polynomial must be specified by hand to the learning algorithm. When d=1 this is the same as the linear kernel. The polynomial kernel allows for curved lines in the input space.

Finally, we can also have a more complex radial kernel. For example:

K(x,xi) = exp(-gamma * sum((x – xi^2))

Where gamma is a parameter that must be specified to the learning algorithm. A good default value for gamma is 0.1, where gamma is often 0 < gamma < 1. The radial kernel is very local and can create complex regions within the feature space, like closed polygons in two-dimensional space.

The SVM model needs to be solved using an optimization procedure.

You can use a numerical optimization procedure to search for the coefficients of the hyperplane. This is inefficient and is not the approach used in widely used SVM implementations like LIBSVM. If implementing the algorithm as an exercise, you could use stochastic gradient descent.

There are specialized optimization procedures that re-formulate the optimization problem to be a Quadratic Programming problem. The most popular method for fitting SVM is the Sequential Minimal Optimization (SMO) method that is very efficient. It breaks the problem down into sub-problems that can be solved analytically (by calculating) rather than numerically (by searching or optimizing).

This section lists some suggestions for how to best prepare your training data when learning an SVM model.

**Numerical Inputs**: SVM assumes that your inputs are numeric. If you have categorical inputs you may need to covert them to binary dummy variables (one variable for each category).**Binary Classification**: Basic SVM as described in this post is intended for binary (two-class) classification problems. Although, extensions have been developed for regression and multi-class classification.

Support Vector Machines are a huge area of study. There are numerous books and papers on the topic. This section lists some of the seminal and most useful results if you are looking to dive deeper into the background and theory of the technique.

Vladimir Vapnik, one of the inventors of the technique has two books that are considered seminal on the topic. They are very mathematical and also rigorous.

- The Nature of Statistical Learning Theory, Vapnik, 1995
- Statistical Learning Theory, Vapnik, 1998

Any good book on machine learning will cover SVM, below are some of my favorites.

- An Introduction to Statistical Learning: with Applications in R, Chapter 8
- Applied Predictive Modeling, Chapter 13
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction Chapter 12

There are countless tutorials and journal articles on SVM. Below is a link to a seminal paper on SVM by Cortes and Vapnik and another to an excellent introductory tutorial.

- Support-Vector Networks [PDF] by Cortes and Vapnik 1995
- A Tutorial on Support Vector Machines for Pattern Recognition [PDF] 1998

Wikipedia provides some good (although dense) information on the topic:

Finally, there are a lot of posts on Q&A sites asking for simple explanations of SVM, below are two picks that you might find useful.

- What does support vector machine (SVM) mean in layman’s terms?
- Please explain Support Vector Machines (SVM) like I am a 5 year old

In this post you discovered the Support Vector Machine Algorithm for machine learning. You learned about:

- The Maximal-Margin Classifier that provides a simple theoretical model for understanding SVM.
- The Soft Margin Classifier which is a modification of the Maximal-Margin Classifier to relax the margin to handle noisy class boundaries in real data.
- Support Vector Machines and how the learning algorithm can be reformulated as a dot-product kernel and how other kernels like Polynomial and Radial can be used.
- How you can use numerical optimization to learn the hyperplane and that efficient implementations use an alternate optimization scheme called Sequential Minimal Optimization.

Do you have any questions about SVM or this post?

Ask in the comments and I will do my best to answer.

The post Support Vector Machines for Machine Learning appeared first on Machine Learning Mastery.

]]>The post Learning Vector Quantization for Machine Learning appeared first on Machine Learning Mastery.

]]>The Learning Vector Quantization algorithm (or LVQ for short) is an artificial neural network algorithm that lets you choose how many training instances to hang onto and learns exactly what those instances should look like.

In this post you will discover the Learning Vector Quantization algorithm. After reading this post you will know:

- The representation used by the LVQ algorithm that you actually save to a file.
- The procedure that you can use to make predictions with a learned LVQ model.
- How to learn an LVQ model from training data.
- The data preparation to use to get the best performance from the LVQ algorithm.
- Where to look for more information on LVQ.

This post was written for developers and assumes no background in statistics or mathematics. The post focuses on how the algorithm works and how to use it for predictive modeling problems.

If you have any questions on LVQ, leave a comment and I will do my best to answer.

Let’s get started.

The representation for LVQ is a collection of codebook vectors.

LVQ was developed and is best understood as a classification algorithm. It supports both binary (two-class) and multi-class classification problems.

A codebook vector is a list of numbers that have the same input and output attributes as your training data. For example, if your problem is a binary classification with classes 0 and 1, and the inputs width, length height, then a codebook vector would be comprised of all four attributes: width, length, height and class.

The model representation is a fixed pool of codebook vectors, learned from the training data. They look like training instances, but the values of each attribute have been adapted based on the learning procedure.

In the language of neural networks, each codebook vector may be called a neuron, each attribute on a codebook vector is called a weight and the collection of codebook vectors is called a network.

I've created a handy mind map of 60+ algorithms organized by type.

Download it, print it and use it.

Also get exclusive access to the machine learning algorithms email mini-course.

Predictions are made using the LVQ codebook vectors in the same way as K-Nearest Neighbors.

Predictions are made for a new instance (x) by searching through all codebook vectors for the K most similar instances and summarizing the output variable for those K instances. For classification this is the mode (or most common) class value.

Typically predictions are made with K=1, and the codebook vector that matches is called the Best Matching Unit (BMU).

To determine which of the K instances in the training dataset are most similar to a new input a distance measure is used. For real-valued input variables, the most popular distance measure is Euclidean distance. Euclidean distance is calculated as the square root of the sum of the squared differences between a new point (x) and an existing point (xi) for each attribute j.

EuclideanDistance(x, xi) = sqrt( sum( (xj – xij)^2 ) )

The LVQ algorithm learns the codebook vectors from the training data.

You must choose the number of codebook vectors to use, such as 20 or 40. You can find the best number of codebook vectors to use by testing different configurations on your training dataset.

The learning algorithm starts with a pool of random codebook vectors. These could be randomly selected instances from the training data, or randomly generated vectors with the same scale as the training data. Codebook vectors have the same number of input attributes as the training data. They also have an output class variable.

The instances in the training dataset are processed one at a time. For a given training instance, the most similar codebook vector is selected from the pool.

If the codebook vector has the same output as the training instance, the codebook vector is moved closer to the training instance. If it does not match, it is moved further away. The amount that the vector is moved is controlled by an algorithm parameter called the learning_rate.

For example, the input variable (x) of a codebook vector is moved closer to the training input value (t) by the amount in the learning_rate if the classes match as follows:

x = x + learning_rate * (t – x)

The opposite case of moving the input variables of a codebook variable away from a training instance is calculated as:

x = x – learning_rate * (t – x)

This would be repeated for each input variable.

Because one codebook vector is selected for modification for each training instance the algorithm is referred to as a winner-take-all algorithm or a type of competitive learning.

This process is repeated for each instance in the training dataset. One iteration of the training dataset is called an epoch. The process is completed for a number of epochs that you must choose (max_epoch), such as 200.

You must also choose an initial learning rate (such as alpha=0.3). The learning rate is decreased with the epoch, starting at the large value you specify at epoch 1 which makes the most change to the codebook vectors and finishing with a small value near zero on the last epoch, making very minor changes to the codebook vectors.

The learning rate for each epoch is calculated as:

learning_rate = alpha * (1 – (epoch/max_epoch))

Where learning_rate is the learning rate for the current epoch (0 to max_epoch-1), alpha is the learning rate specified to the algorithm at the start of the training run and max_epoch is the total number of epochs to run the algorithm also specified at the start of the run.

The intuition for the learning process is that the pool of codebook vectors is a compression of the training dataset to the points that best characterize the separation of the classes.

Generally, it is a good idea to prepare data for LVQ in the same way as you would prepare it for K-Nearest Neighbors.

**Classification**: LVQ is a classification algorithm that works for both binary (two-class) and multi-class classification algorithms. The technique has been adapted for regression.**Multiple-Passes**: Good technique with LVQ involves performing multiple passes of the training dataset over the codebook vectors (e.g. multiple learning runs). The first with a higher learning rate to settle the pool codebook vectors and the second run with a small learning rate to fine tune the vectors.**Multiple Best Matches**: Extensions of LVQ select multiple best matching units to modify during learning, such as one of the same class and one of a different class which are drawn toward and away from a training sample respectively. Other extensions use a custom learning rate for each codebook vector. These extensions can improve the learning process.**Normalize Inputs**: Traditionally, inputs are normalized (rescaled) to values between 0 and 1. This is to avoid one attribute from dominating the distance measure. If the input data is normalized, then the initial values for the codebook vectors can be selected as random values between 0 and 1.**Feature Selection**: Feature selection that can reduce the dimensionality of the input variables can improve the accuracy of the method. LVQ suffers from the same curse of dimensionality in making predictions as K-Nearest Neighbors.

The technique was developed by Kohonen who wrote the seminal book on LVQ and the sister method Self-Organizing Maps called: Self-Organizing Maps.

I highly recommend this book if you are interested in LVQ.

- Learning Vector Quantization on Wikipedia.
- Learning Vector Quantization chapter from my book Nature Inspired Algorithms.
- LVQ_PAK the official software implementation of LVQ by Kohonen.
- LVQ as plug-in for WEKA (that I created years ago).

In this post you discovered the LVQ algorithm. You learned:

- The representation for LVQ is a small pool of codebook vectors, smaller than the training dataset.
- The codebook vectors are used to make predictions using the same technique as K-Nearest Neighbors.
- The codebook vectors are learned from the training dataset by moving them closer when they are good match and further away when they are a bad match.
- The codebook vectors are a compression of the training data to best separate the classes.
- Data preparation traditionally involves normalizing the input values to the range between 0 and 1.

Do you have any questions about this post or the LVQ algorithm? Leave a comment and ask your question and I will do my best to answer it.

The post Learning Vector Quantization for Machine Learning appeared first on Machine Learning Mastery.

]]>The post K-Nearest Neighbors for Machine Learning appeared first on Machine Learning Mastery.

]]>- The model representation used by KNN.
- How a model is learned using KNN (hint, it’s not).
- How to make predictions using KNN
- The many names for KNN including how different fields refer to it.
- How to prepare your data to get the most from KNN.
- Where to look to learn more about the KNN algorithm.

This post was written for developers and assumes no background in statistics or mathematics. The focus is on how the algorithm works and how to use it for predictive modeling problems. If you have any questions, leave a comment and I will do my best to answer.

Let’s get started.

The model representation for KNN is the entire training dataset.

It is as simple as that.

KNN has no model other than storing the entire dataset, so there is no learning required.

Efficient implementations can store the data using complex data structures like k-d trees to make look-up and matching of new patterns during prediction efficient.

Because the entire training dataset is stored, you may want to think carefully about the consistency of your training data. It might be a good idea to curate it, update it often as new data becomes available and remove erroneous and outlier data.

I've created a handy mind map of 60+ algorithms organized by type.

Download it, print it and use it.

Also get exclusive access to the machine learning algorithms email mini-course.

KNN makes predictions using the training dataset directly.

Predictions are made for a new instance (x) by searching through the entire training set for the K most similar instances (the neighbors) and summarizing the output variable for those K instances. For regression this might be the mean output variable, in classification this might be the mode (or most common) class value.

To determine which of the K instances in the training dataset are most similar to a new input a distance measure is used. For real-valued input variables, the most popular distance measure is Euclidean distance.

Euclidean distance is calculated as the square root of the sum of the squared differences between a new point (x) and an existing point (xi) across all input attributes j.

EuclideanDistance(x, xi) = sqrt( sum( (xj – xij)^2 ) )

Other popular distance measures include:

**Hamming Distance**: Calculate the distance between binary vectors (more).**Manhattan Distance**: Calculate the distance between real vectors using the sum of their absolute difference. Also called City Block Distance (more).**Minkowski Distance**: Generalization of Euclidean and Manhattan distance (more).

There are many other distance measures that can be used, such as Tanimoto, Jaccard, Mahalanobis and cosine distance. You can choose the best distance metric based on the properties of your data. If you are unsure, you can experiment with different distance metrics and different values of K together and see which mix results in the most accurate models.

Euclidean is a good distance measure to use if the input variables are similar in type (e.g. all measured widths and heights). Manhattan distance is a good measure to use if the input variables are not similar in type (such as age, gender, height, etc.).

The value for K can be found by algorithm tuning. It is a good idea to try many different values for K (e.g. values from 1 to 21) and see what works best for your problem.

The computational complexity of KNN increases with the size of the training dataset. For very large training sets, KNN can be made stochastic by taking a sample from the training dataset from which to calculate the K-most similar instances.

KNN has been around for a long time and has been very well studied. As such, different disciplines have different names for it, for example:

**Instance-Based Learning**: The raw training instances are used to make predictions. As such KNN is often referred to as instance-based learning or a case-based learning (where each training instance is a case from the problem domain).**Lazy Learning**: No learning of the model is required and all of the work happens at the time a prediction is requested. As such, KNN is often referred to as a lazy learning algorithm.**Non-Parametric**: KNN makes no assumptions about the functional form of the problem being solved. As such KNN is referred to as a non-parametric machine learning algorithm.

KNN can be used for regression and classification problems.

When KNN is used for regression problems the prediction is based on the mean or the median of the K-most similar instances.

When KNN is used for classification, the output can be calculated as the class with the highest frequency from the K-most similar instances. Each instance in essence votes for their class and the class with the most votes is taken as the prediction.

Class probabilities can be calculated as the normalized frequency of samples that belong to each class in the set of K most similar instances for a new data instance. For example, in a binary classification problem (class is 0 or 1):

p(class=0) = count(class=0) / (count(class=0)+count(class=1))

If you are using K and you have an even number of classes (e.g. 2) it is a good idea to choose a K value with an odd number to avoid a tie. And the inverse, use an even number for K when you have an odd number of classes.

Ties can be broken consistently by expanding K by 1 and looking at the class of the next most similar instance in the training dataset.

KNN works well with a small number of input variables (p), but struggles when the number of inputs is very large.

Each input variable can be considered a dimension of a p-dimensional input space. For example, if you had two input variables x1 and x2, the input space would be 2-dimensional.

As the number of dimensions increases the volume of the input space increases at an exponential rate.

In high dimensions, points that may be similar may have very large distances. All points will be far away from each other and our intuition for distances in simple 2 and 3-dimensional spaces breaks down. This might feel unintuitive at first, but this general problem is called the “Curse of Dimensionality“.

**Rescale Data**: KNN performs much better if all of the data has the same scale. Normalizing your data to the range [0, 1] is a good idea. It may also be a good idea to standardize your data if it has a Gaussian distribution.**Address Missing Data**: Missing data will mean that the distance between samples can not be calculated. These samples could be excluded or the missing values could be imputed.**Lower Dimensionality**: KNN is suited for lower dimensional data. You can try it on high dimensional data (hundreds or thousands of input variables) but be aware that it may not perform as well as other techniques. KNN can benefit from feature selection that reduces the dimensionality of the input feature space.

If you are interested in implementing KNN from scratch in Python, checkout the post:

Below are some good machine learning texts that cover the KNN algorithm from a predictive modeling perspective.

- Applied Predictive Modeling, Chapter 7 for regression, Chapter 13 for classification.
- Data Mining: Practical Machine Learning Tools and Techniques, page 76 and 128
- Doing Data Science: Straight Talk from the Frontline, page 71
- Machine Learning, Chapter 8

Also checkout K-Nearest Neighbors on Wikipedia.

In this post you discovered the KNN machine learning algorithm. You learned that:

- KNN stores the entire training dataset which it uses as its representation.
- KNN does not learn any model.
- KNN makes predictions just-in-time by calculating the similarity between an input sample and each training instance.
- There are many distance measures to choose from to match the structure of your input data.
- That it is a good idea to rescale your data, such as using normalization, when using KNN.

If you have any questions about this post or the KNN algorithm ask in the comments and I will do my best to answer.

The post K-Nearest Neighbors for Machine Learning appeared first on Machine Learning Mastery.

]]>