The post A Gentle Introduction to Model Selection for Machine Learning appeared first on Machine Learning Mastery.

]]>The challenge of applied machine learning, therefore, becomes how to choose among a range of different models that you can use for your problem.

Naively, you might believe that model performance is sufficient, but should you consider other concerns, such as how long the model takes to train or how easy it is to explain to project stakeholders. Their concerns become more pressing if a chosen model must be used operationally for months or years.

Also, what are you choosing exactly: just the algorithm used to fit the model or the entire data preparation and model fitting pipeline?

In this post, you will discover the challenge of model selection for machine learning.

After reading this post, you will know:

- Model selection is the process of choosing one among many candidate models for a predictive modeling problem.
- There may be many competing concerns when performing model selection beyond model performance, such as complexity, maintainability, and available resources.
- The two main classes of model selection techniques are probabilistic measures and resampling methods.

Let’s get started.

This tutorial is divided into three parts; they are:

- What Is Model Selection
- Considerations for Model Selection
- Model Selection Techniques

Model selection is the process of selecting one final machine learning model from among a collection of candidate machine learning models for a training dataset.

Model selection is a process that can be applied both across different types of models (e.g. logistic regression, SVM, KNN, etc.) and across models of the same type configured with different model hyperparameters (e.g. different kernels in an SVM).

When we have a variety of models of different complexity (e.g., linear or logistic regression models with different degree polynomials, or KNN classifiers with different values of K), how should we pick the right one?

— Page 22, Machine Learning: A Probabilistic Perspective, 2012.

For example, we may have a dataset for which we are interested in developing a classification or regression predictive model. We do not know beforehand as to which model will perform best on this problem, as it is unknowable. Therefore, we fit and evaluate a suite of different models on the problem.

**Model selection** is the process of choosing one of the models as the final model that addresses the problem.

Model selection is different from **model assessment**.

For example, we evaluate or assess candidate models in order to choose the best one, and this is model selection. Whereas once a model is chosen, it can be evaluated in order to communicate how well it is expected to perform in general; this is model assessment.

The process of evaluating a model’s performance is known as model assessment, whereas the process of selecting the proper level of flexibility for a model is known as model selection.

— Page 175, An Introduction to Statistical Learning: with Applications in R, 2017.

Fitting models is relatively straightforward, although selecting among them is the true challenge of applied machine learning.

Firstly, we need to get over the idea of a “*best*” model.

All models have some predictive error, given the statistical noise in the data, the incompleteness of the data sample, and the limitations of each different model type. Therefore, the notion of a perfect or best model is not useful. Instead, we must seek a model that is “*good enough*.”

**What do we care about when choosing a final model?**

The project stakeholders may have specific requirements, such as maintainability and limited model complexity. As such, a model that has lower skill but is simpler and easier to understand may be preferred.

Alternately, if model skill is prized above all other concerns, then the ability of the model to perform well on out-of-sample data will be preferred regardless of the computational complexity involved.

Therefore, a “*good enough*” model may refer to many things and is specific to your project, such as:

- A model that meets the requirements and constraints of project stakeholders.
- A model that is sufficiently skillful given the time and resources available.
- A model that is skillful as compared to naive models.
- A model that is skillful relative to other tested models.
- A model that is skillful relative to the state-of-the-art.

Next, we must consider what is being selected.

For example, we are not selecting a fit model, as all models will be discarded. This is because once we choose a model, we will fit a new final model on all available data and start using it to make predictions.

Therefore, are we choosing among algorithms used to fit the models on the training dataset?

Some algorithms require specialized data preparation in order to best expose the structure of the problem to the learning algorithm. Therefore, we must go one step further and consider **model selection as the process of selecting among model development pipelines**.

Each pipeline may take in the same raw training dataset and outputs a model that can be evaluated in the same manner but may require different or overlapping computational steps, such as:

- Data filtering.
- Data transformation.
- Feature selection.
- Feature engineering.
- And more…

The closer you look at the challenge of model selection, the more nuance you will discover.

Now that we are familiar with some considerations involved in model selection, let’s review some common methods for selecting a model.

The best approach to model selection requires “*sufficient*” data, which may be nearly infinite depending on the complexity of the problem.

In this ideal situation, we would split the data into training, validation, and test sets, then fit candidate models on the training set, evaluate and select them on the validation set, and report the performance of the final model on the test set.

If we are in a data-rich situation, the best approach […] is to randomly divide the dataset into three parts: a training set, a validation set, and a test set. The training set is used to fit the models; the validation set is used to estimate prediction error for model selection; the test set is used for assessment of the generalization error of the final chosen model.

— Page 222, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2017.

This is impractical on most predictive modeling problems given that we rarely have sufficient data, or are able to even judge what would be sufficient.

In many applications, however, the supply of data for training and testing will be limited, and in order to build good models, we wish to use as much of the available data as possible for training. However, if the validation set is small, it will give a relatively noisy estimate of predictive performance.

– Page 32, Pattern Recognition and Machine Learning, 2006.

Instead, there are two main classes of techniques to approximate the ideal case of model selection; they are:

**Probabilistic Measures**: Choose a model via in-sample error and complexity.**Resampling Methods**: Choose a model via estimated out-of-sample error.

Let’s take a closer look at each in turn.

Probabilistic measures involve analytically scoring a candidate model using both its performance on the training dataset and the complexity of the model.

It is known that training error is optimistically biased, and therefore is not a good basis for choosing a model. The performance can be penalized based on how optimistic the training error is believed to be. This is typically achieved using algorithm-specific methods, often linear, that penalize the score based on the complexity of the model.

Historically various ‘information criteria’ have been proposed that attempt to correct for the bias of maximum likelihood by the addition of a penalty term to compensate for the over-fitting of more complex models.

– Page 33, Pattern Recognition and Machine Learning, 2006.

A model with fewer parameters is less complex, and because of this, is preferred because it is likely to generalize better on average.

Four commonly used probabilistic model selection measures include:

- Akaike Information Criterion (AIC).
- Bayesian Information Criterion (BIC).
- Minimum Description Length (MDL).
- Structural Risk Minimization (SRM).

Probabilistic measures are appropriate when using simpler linear models like linear regression or logistic regression where the calculating of model complexity penalty (e.g. in sample bias) is known and tractable.

Resampling methods seek to estimate the performance of a model (or more precisely, the model development process) on out-of-sample data.

This is achieved by splitting the training dataset into sub train and test sets, fitting a model on the sub train set, and evaluating it on the test set. This process may then be repeated multiple times and the mean performance across each trial is reported.

It is a type of Monte Carlo estimate of model performance on out-of-sample data, although each trial is not strictly independent as depending on the resampling method chosen, the same data may appear multiple times in different training datasets, or test datasets.

Three common resampling model selection methods include:

- Random train/test splits.
- Cross-Validation (k-fold, LOOCV, etc.).
- Bootstrap.

Most of the time probabilistic measures (described in the previous section) are not available, therefore resampling methods are used.

By far the most popular is the cross-validation family of methods that includes many subtypes.

Probably the simplest and most widely used method for estimating prediction error is cross-validation.

— Page 241, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2017.

An example is the widely used k-fold cross-validation that splits the training dataset into k folds where each example appears in a test set only once.

Another is the leave one out (LOOCV) where the test set is comprised of a single sample and each sample is given an opportunity to be the test set, requiring N (the number of samples in the training set) models to be constructed and evaluated.

This section provides more resources on the topic if you are looking to go deeper.

- Probabilistic Model Selection with AIC, BIC, and MDL
- A Gentle Introduction to Statistical Sampling and Resampling
- A Gentle Introduction to Monte Carlo Sampling for Probability
- A Gentle Introduction to k-fold Cross-Validation
- What is the Difference Between Test and Validation Datasets?

- Applied Predictive Modeling, 2013.
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2017.
- An Introduction to Statistical Learning: with Applications in R, 2017.
- Pattern Recognition and Machine Learning, 2006.
- Machine Learning: A Probabilistic Perspective, 2012.

In this post, you discovered the challenge of model selection for machine learning.

Specifically, you learned:

- Model selection is the process of choosing one among many candidate models for a predictive modeling problem.
- There may be many competing concerns when performing model selection beyond model performance, such as complexity, maintainability, and available resources.
- The two main classes of model selection techniques are probabilistic measures and resampling methods.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Model Selection for Machine Learning appeared first on Machine Learning Mastery.

]]>The post How to Reduce Variance in a Final Machine Learning Model appeared first on Machine Learning Mastery.

]]>A problem with most final models is that they suffer variance in their predictions.

This means that each time you fit a model, you get a slightly different set of parameters that in turn will make slightly different predictions. Sometimes more and sometimes less skillful than what you expected.

This can be frustrating, especially when you are looking to deploy a model into an operational environment.

In this post, you will discover how to think about model variance in a final model and techniques that you can use to reduce the variance in predictions from a final model.

After reading this post, you will know:

- The problem with variance in the predictions made by a final model.
- How to measure model variance and how variance is addressed generally when estimating parameters.
- Techniques you can use to reduce the variance in predictions made by a final model.

Let’s get started.

Once you have discovered which model and model hyperparameters result in the best skill on your dataset, you’re ready to prepare a final model.

A final model is trained on all available data, e.g. the training and the test sets.

It is the model that you will use to make predictions on new data were you do not know the outcome.

The final model is the outcome of your applied machine learning project.

To learn more about preparing a final model, see the post:

The bias-variance trade-off is a conceptual idea in applied machine learning to help understand the sources of error in models.

**Bias**refers to assumptions in the learning algorithm that narrow the scope of what can be learned. This is useful as it can accelerate learning and lead to stable results, at the cost of the assumption differing from reality.**Variance**refers to the sensitivity of the learning algorithm to the specifics of the training data, e.g. the noise and specific observations. This is good as the model will be specialized to the data at the cost of learning random noise and varying each time it is trained on different data.

The bias-variance tradeoff is a conceptual tool to think about these sources of error and how they are always kept in balance.

More bias in an algorithm means that there is less variance, and the reverse is also true.

You can learn more about the bias-variance tradeoff in this post:

You can control this balance.

Many machine learning algorithms have hyperparameters that directly or indirectly allow you to control the bias-variance tradeoff.

For example, the *k* in *k*-nearest neighbors is one example. A small *k* results in predictions with high variance and low bias. A large *k* results in predictions with a small variance and a large bias.

Most final models have a problem: they suffer from variance.

Each time a model is trained by an algorithm with high variance, you will get a slightly different result.

The slightly different model in turn will make slightly different predictions, for better or worse.

This is a problem with training a final model as we are required to use the model to make predictions on real data where we do not know the answer and we want those predictions to as good as possible.

We want to the best possible version of the model that we can get.

We want the variance to play out in our favor.

If we can’t achieve that, at least we want the variance to not fall against us when making predictions.

There are two common sources of variance in a final model:

- The noise in the training data.
- The use of randomness in the machine learning algorithm.

The first type we introduced above.

The second type impacts those algorithms that harness randomness during learning.

Three common examples include:

- Choice of random split points in random forest.
- Random weight initialization in neural networks.
- Shuffling training data in stochastic gradient descent.

You can measure both types of variance in your specific model using your training data.

**Measure Algorithm Variance**: The variance introduced by the stochastic nature of the algorithm can be measured by repeating the evaluation of the algorithm on the same training dataset and calculating the variance or standard deviation of the model skill.**Measure Training Data Variance**: The variance introduced by the training data can be measured by repeating the evaluation of the algorithm on different samples of training data, but keeping the seed for the pseudorandom number generator fixed then calculating the variance or standard deviation of the model skill.

Often, the combined variance is estimated by running repeated k-fold cross-validation on a training dataset then calculating the variance or standard deviation of the model skill.

If we want to reduce the amount of variance in a prediction, we must add bias.

Consider the case of a simple statistical estimate of a population parameter, such as estimating the mean from a small random sample of data.

A single estimate of the mean will have high variance and low bias.

This is intuitive because if we repeated this process 30 times and calculated the standard deviation of the estimated mean values, we would see a large spread.

The solutions for reducing the variance are also intuitive.

Repeat the estimate on many different small samples of data from the domain and calculate the mean of the estimates, leaning on the central limit theorem.

The mean of the estimated means will have a lower variance. We have increased the bias by assuming that the average of the estimates will be a more accurate estimate than a single estimate.

Another approach would be to dramatically increase the size of the data sample on which we estimate the population mean, leaning on the law of large numbers.

The principles used to reduce the variance for a population statistic can also be used to reduce the variance of a final model.

We must add bias.

Depending on the specific form of the final model (e.g. tree, weights, etc.) you can get creative with this idea.

Below are three approaches that you may want to try.

If possible, I recommend designing a test harness to experiment and discover an approach that works best or makes the most sense for your specific data set and machine learning algorithm.

Instead of fitting a single final model, you can fit multiple final models.

Together, the group of final models may be used as an ensemble.

For a given input, each model in the ensemble makes a prediction and the final output prediction is taken as the average of the predictions of the models.

A sensitivity analysis can be used to measure the impact of ensemble size on prediction variance.

As above, multiple final models can be created instead of a single final model.

Instead of calculating the mean of the predictions from the final models, a single final model can be constructed as an ensemble of the parameters of the group of final models.

This would only make sense in cases where each model has the same number of parameters, such as neural network weights or regression coefficients.

For example, consider a linear regression model with three coefficients [b0, b1, b2]. We could fit a group of linear regression models and calculate a final b0 as the average of b0 parameters in each model, and repeat this process for b1 and b2.

Again, a sensitivity analysis can be used to measure the impact of ensemble size on prediction variance.

Leaning on the law of large numbers, perhaps the simplest approach to reduce the model variance is to fit the model on more training data.

In those cases where more data is not readily available, perhaps data augmentation methods can be used instead.

A sensitivity analysis of training dataset size to prediction variance is recommended to find the point of diminishing returns.

There are approaches to preparing a final model that aim to get the variance in the final model to work for you rather than against you.

The commonality in these approaches is that they seek a single best final model.

Two examples include:

**Why not fix the random seed?**You could fix the random seed when fitting the final model. This will constrain the variance introduced by the stochastic nature of the algorithm.**Why not use early stopping?**You could check the skill of the model against a holdout set during training and stop training when the skill of the model on the hold set starts to degrade.

I would argue that these approaches and others like them are fragile.

Perhaps you can gamble and aim for the variance to play-out in your favor. This might be a good approach for machine learning competitions where there is no real downside to losing the gamble.

I won’t.

I think it’s safer to aim for the best average performance and limit the downside.

I think that the trick with navigating the bias-variance tradeoff for a final model is to think in samples, not in terms of single models. To optimize for average model performance.

This section provides more resources on the topic if you are looking to go deeper.

- How to Train a Final Machine Learning Model
- Gentle Introduction to the Bias-Variance Trade-Off in Machine Learning
- Bias-Variance Tradeoff on Wikipedia
- Checkpoint Ensembles: Ensemble Methods from a Single Training Process, 2017.

In this post, you discovered how to think about model variance in a final model and techniques that you can use to reduce the variance in predictions from a final model.

Specifically, you learned:

- The problem with variance in the predictions made by a final model.
- How to measure model variance and how variance is addressed generally when estimating parameters.
- Techniques you can use to reduce the variance in predictions made by a final model.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Reduce Variance in a Final Machine Learning Model appeared first on Machine Learning Mastery.

]]>The post How To Know if Your Machine Learning Model Has Good Performance appeared first on Machine Learning Mastery.

]]>This is a common question I am asked by beginners.

As a beginner, you often seek an answer to this question, e.g. you want someone to tell you whether an accuracy of *x*% or an error score of *x* is good or not.

In this post, you will discover how to answer this question for yourself definitively and know whether your model skill is good or not.

After reading this post, you will know:

- That a baseline model can be used to discover the bedrock in performance on your problem by which all other models can be evaluated.
- That all predictive models contain errors and that a perfect score is not possible in practice given the stochastic nature of data and algorithms.
- That the true job of applied machine learning is to explore the space of possible models and discover what a good model score looks like relative to the baseline on your specific dataset.

Let’s get started.

This post is divided into 4 parts; they are:

- Model Skill Is Relative
- Baseline Model Skill
- What Is the Best Score?
- Discover Limits of Model Skill

Your predictive modeling problem is unique.

This includes the specific data you have, the tools you’re using, and the skill you will achieve.

Your predictive modeling problem has not been solved before. Therefore, we cannot know what a good model looks like or what skill it might have.

You may have ideas of what a skillful model looks like based on knowledge of the domain, but you don’t know whether those skill scores are achievable.

The best that we can do is to compare the performance of machine learning models on your specific data to other models also trained on the same data.

Machine learning model performance is relative and ideas of what score a good model can achieve only make sense and can only be interpreted in the context of the skill scores of other models also trained on the same data.

Because machine learning model performance is relative, it is critical to develop a robust baseline.

A baseline is a simple and well understood procedure for making predictions on your predictive modeling problem. The skill of this model provides the bedrock for the lowest acceptable performance of a machine learning model on your specific dataset.

The results for the baseline model provide the point from which the skill of all other models trained on your data can be evaluated.

Three examples of baseline models include:

- Predict the mean outcome value for a regression problem.
- Predict the mode outcome value for a classification problem.
- Predict the input as the output (called persistence) for a univariate time series forecasting problem.

The baseline performance on your problem can then be used as the yardstick by which all other models can be compared and evaluated.

If a model achieves a performance below the baseline, something is wrong (e.g. there’s a bug) or the model is not appropriate for your problem.

If you are working on a classification problem, the best score is 100% accuracy.

If you are working on a regression problem, the best score is 0.0 error.

These scores are an impossible to achieve upper/lower bound. All predictive modeling problems have prediction error. Expect it. The error comes from a range of sources such as:

- Incompleteness of data sample.
- Noise in the data.
- Stochastic nature of the modeling algorithm.

You cannot achieve the best score, but it is good to know what the best possible performance is for your chosen measure. You know that true model performance will fall within a range between the baseline and the best possible score.

Instead, you must search the space of possible models on your dataset and discover what good and bad scores look like.

Once you have the baseline, you can explore the extent of model performance on your predictive modeling problem.

In fact, this is the hard work and the objective of the project: to find a model that you can demonstrate works reliably well in making predictions on your specific dataset.

There are many strategies to this problem; two that you may wish to consider are:

**Start High**. Select a machine learning method that is sophisticated and known to perform well on a range of predictive model problems, such as random forest or gradient boosting. Evaluate the model on your problem and use the result as an approximate top-end benchmark, then find the simplest model that achieves similar performance.**Exhaustive Search**. Evaluate all of the machine learning methods that you can think of on the problem and select the method that achieves the best performance relative to the baseline.

The “*Start High*” approach is fast and can help you define the bounds of model skill to expect on the problem and find a simple (e.g. Occam’s Razor) model that can achieve similar results. It can also help you find out whether the problem is solvable/predictable fast, which is important because not all problems are predictable.

The “*Exhaustive Search*” is slow and is really intended for long-running projects where model skill is more important than almost any other concern. I often perform variations of this approach testing suites of similar methods in batches and call it the spot-checking approach.

Both methods will give you a population of model performance scores that you can compare to the baseline.

You will know what a good score looks like and what a bad score looks like.

This section provides more resources on the topic if you are looking to go deeper.

- How to Make Baseline Predictions for Time Series Forecasting with Python
- How To Implement Baseline Machine Learning Algorithms From Scratch With Python
- Machine Learning Performance Improvement Cheat Sheet

In this post, you discovered that your predictive modeling problem is unique and that good model performance scores can only be known relative to a baseline performance.

Specifically, you learned:

- That a baseline model can be used to discover the bedrock in performance on your problem by which all other models can be evaluated.
- That all predictive models contain error and that a perfect score is not possible in practice given the stochastic nature of data and algorithms.
- That the true job of applied machine learning is to explore the space of possible models and discover what a good model score looks like relative to the baseline on your specific dataset.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How To Know if Your Machine Learning Model Has Good Performance appeared first on Machine Learning Mastery.

]]>The post The Model Performance Mismatch Problem (and what to do about it) appeared first on Machine Learning Mastery.

]]>The procedure when evaluating machine learning models is to fit and evaluate them on training data, then verify that the model has good skill on a held-back test dataset.

Often, you will get a very promising performance when evaluating the model on the training dataset and poor performance when evaluating the model on the test set.

In this post, you will discover techniques and issues to consider when you encounter this common problem.

After reading this post, you will know:

- The problem of model performance mismatch that may occur when evaluating machine learning algorithms.
- The causes of overfitting, under-representative data samples, and stochastic algorithms.
- Ways to harden your test harness to avoid the problem in the first place.

This post was based on a reader question; thanks! Keep the questions coming!

Let’s get started.

This post is divided into 4 parts; they are:

- Model Evaluation
- Model Performance Mismatch
- Possible Causes and Remedies
- More Robust Test Harness

When developing a model for a predictive modeling problem, you need a test harness.

The test harness defines how the sample of data from the domain will be used to evaluate and compare candidate models for your predictive modeling problem.

There are many ways to structure a test harness, and no single best way for all projects.

One popular approach is to use a portion of data for fitting and tuning the model and a portion for providing an objective estimate of the skill of the tuned model on out-of-sample data.

The data sample is split into a training and test dataset. The model is evaluated on the training dataset using a resampling method such as k-fold cross-validation, and the set itself may be further divided into a validation dataset used to tune the hyperparameters of the model.

The test set is held back and used to evaluate and compare tuned models.

For more on training, validation, and test sets, see the post:

The resampling method will give you an estimate of the skill of your model on unseen data by using the training dataset.

The test dataset provides a second data point and ideally an objective idea of how well the model is expected to perform, corroborating the estimated model skill.

What if the estimate of model skill on the training dataset does not match the skill of the model on the test dataset?

The scores will not match in general. We do expect some differences because some small overfitting of the training dataset is inevitable given hyperparameter tuning, making the training scores optimistic.

But what if the difference is worryingly large?

- Which score do you trust?
- Can you still compare models using the test dataset?
- Is the model tuning process invalidated?

It is a challenging and very common situation in applied machine learning.

We can call this concern the “*model performance mismatch*” problem.

**Note**: ideas of “*large differences*” in model performance are relative to your chosen performance measures, datasets, and models. We cannot talk objectively about differences in general, only relative differences that you must interpret yourself.

There are many possible causes for the model performance mismatch problem.

Ultimately, your goal is to have a test harness that you know allows you to make good decisions regarding which model and model configuration to use as a final model.

In this section, we will look at some possible causes, diagnostics, and techniques you can use to investigate the problem.

Let’s look at three main areas: model overfitting, the quality of the data sample, and the stochastic nature of the learning algorithm.

Perhaps the most common cause is that you have overfit the training data.

You have hit upon a model, a set of model hyperparameters, a view of the data, or a combination of these elements and more that just so happens to give a good skill estimate on the training dataset.

The use of k-fold cross-validation will help to some degree. The use of tuning models with a separate dataset too will help. Nevertheless, it is possible to keep pushing and overfit on the training dataset.

If this is the case, the test skill may be more representative of the true skill of the chosen model and configuration.

One simple (but not easy) way to diagnose whether you have overfit the training dataset, is to get another data point on model skill. Evaluate the chosen model on another set of data. For example, some ideas to try include:

- Try a k-fold cross-validation evaluation of the model on the test dataset.
- Try a fit of the model on the training dataset and an evaluation on the test and a new data sample.

If you’re overfit, you have options.

- Perhaps you can scrap your current training dataset and collect a new training dataset.
- Perhaps you can re-split your sample into train/test in a softer approach to getting a new training dataset.

I would suggest that the results that you have obtained to-date are suspect and should be re-considered. Especially those where you may have spent too long tuning.

Overfitting may be the ultimate cause for the discrepancy in model scores, though it may not be the area to attack first.

It is possible that your training or test datasets are an unrepresentative sample of data from the domain.

This means that the sample size is too small or the examples in the sample do not effectively “cover” the cases observed in the broader domain.

This can be obvious to spot if you see noisy model performance results. For example:

- A large variance on cross-validation scores.
- A large variance on similar model types on the test dataset.

In addition, you will see the discrepancy between train and test scores.

Another good second test is to check summary statistics for each variable on the train and test sets, and ideally on the cross-validation folds. You are looking for a large variance in sample means and standard deviation.

The remedy is often to get a larger and more representative sample of data from the domain. Alternately, to use more discriminating methods in preparing the data sample and splits. Think stratified k-fold cross validation, but applied to input variables in an attempt to maintain population means and standard deviations for real-valued variables in addition to the distribution of categorical variables.

Often when I see overfitting on a project, it is because the test harness is not as robust as it should be, not because of hill climbing the test dataset.

It is possible that you are seeing a discrepancy in model scores because of the stochastic nature of the algorithm.

Many machine learning algorithms involve a stochastic component. For example, the random initial weights in a neural network, the shuffling of data and in turn the gradient updates in stochastic gradient descent, and much more.

This means, that each time the same algorithm is run on the same data, different sequences of random numbers are used and, in turn, a different model with different skill will result.

You can learn more about this in the post:

This issue can be seen by the variance in model skill scores from cross-validation, much like having an unrepresentative data sample.

The difference here is that the variance can be cleared up by repeating the model evaluation process, e.g. cross-validation, in order to control for the randomness in training the model.

This is often called the multiple repeats k-fold cross-validation and is used for neural networks and stochastic optimization algorithms, when resources permit.

I have more on this approach to evaluating models in the post:

A lot of these problems can be addressed early by designing a robust test harness and then gathering evidence to demonstrate that indeed your test harness is robust.

This might include running experiments before you get started evaluating models for real. Experiments such:

- A sensitivity analysis of train/test splits.
- A sensitivity analysis of k values for cross-validation.
- A sensitivity analysis of a given model’s behavior.
- A sensitivity analysis on the number of repeats.

On this last point, see the post:

You are looking for:

- Low variance and consistent mean in evaluation scores between tests in a cross-validation.
- Correlated population means between model scores on train and test sets.

Use statistical tools like standard error and significance tests if needed.

Use a modern and un-tuned model that performs well in general for such testing, such as random forest.

- If you discover a difference in skill scores between training and test sets, and it is consistent, that may be fine. You know what to expect.
- If you measure a variance in mean skill scores within a given test, you have error bars you can use to interpret the results.

I would go so far as to say that without a robust test harness, the results you achieve will be a mess. You will not be able to effectively interpret them. There will be an element of risk (or fraud, if you’re an academic) in the presentation of the outcomes from a fragile test harness. And reproducibility/robustness is a massive problem in numerical fields like applied machine learning.

Finally, avoid using the test dataset too much. Once you have strong evidence that your harness is robust, do not touch the test dataset until it comes time for final model selection.

This section provides more resources on the topic if you are looking to go deeper.

- What is the Difference Between Test and Validation Datasets?
- Embrace Randomness in Machine Learning
- Estimate the Number of Experiment Repeats for Stochastic Machine Learning Algorithms
- How to Evaluate the Skill of Deep Learning Models

In this post, you discovered the model performance mismatch problem where model performance differs greatly between training and test sets, and techniques to diagnose and address the issue.

Specifically, you learned:

- The problem of model performance mismatch that may occur when evaluating machine learning algorithms.
- The causes of overfitting, under-representative data samples, and stochastic algorithms.
- Ways to harden your test harness to avoid the problem in the first place.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post The Model Performance Mismatch Problem (and what to do about it) appeared first on Machine Learning Mastery.

]]>The post So, You are Working on a Machine Learning Problem… appeared first on Machine Learning Mastery.

]]>I want to really nail down where you’re at right now.

Let me make some guesses…

So you have a problem that you need to solve.

Maybe it’s your problem, an idea you have, a question, or something you want to address.

Or maybe it is a problem that was provided to you by someone else, such as a supervisor or boss.

This problem involves some historical data you have or can access. It also involves some predictions required from new or related data in the future.

Let’s dig deeper.

Let’s look at your problem in more detail.

You have historical data.

You have observations about something, like customers, voltages, prices, etc. collected over time.

You also have some outcome related to each observation, maybe a label like “*good*” or “*bad*” or maybe a quantity like *50.1*.

The problem you want to solve is, given new observations in the future, what is the most likely related outcome?

So far so good?

You need a program. A piece of software.

You need a thing that will take observational data as input and give you the most likely outcome as output.

The outcomes provided by the program need to be right, or really close to right. The program needs to be skillful at providing good outcomes for observations.

With such a piece of software, you could run it multiple times for each observation you have.

You could integrate it into some other software, like an app or webpage, and make use of it.

Am I right?

You want to solve this problem with machine learning or artificial intelligence, or something.

Someone told you to use machine learning or you just think it is the right tool for this job.

But, it’s confusing.

- How do you use machine learning on problems like this?
- Where do you start?
- What math do you need to know before solving this problem?

Does this describe you?

Or maybe you’ve started working on your problem, but you’re stuck.

- What data transforms should you use?
- What algorithm should you use?
- What algorithm configurations should you use?

Is this a better fit for where you’re at?

I am thinking about writing a step-by-step playbook that will walk you through the process of defining your problem, preparing your data, selecting algorithms, and ultimately developing a final model that you can use to make predictions for your problem.

But to make this playbook as useful as possible, I need to know where you are having trouble in this process.

Please, describe where you’re stuck in the comments below.

Share your story. Or even just a small piece.

I promise to read every single one, and even offer advice where possible.

If you are struggling, I strongly recommend following this process when working through a predictive modeling problem:

The post So, You are Working on a Machine Learning Problem… appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Applied Machine Learning as a Search Problem appeared first on Machine Learning Mastery.

]]>There is no best training data or best algorithm for your problem, only the best that you can discover.

The application of machine learning is best thought of as search problem for the best mapping of inputs to outputs given the knowledge and resources available to you for a given project.

In this post, you will discover the conceptualization of applied machine learning as a search problem.

After reading this post, you will know:

- That applied machine learning is the problem of approximating an unknown underlying mapping function from inputs to outputs.
- That design decisions such as the choice of data and choice of algorithm narrow the scope of possible mapping functions that you may ultimately choose.
- That the conceptualization of machine learning as a search helps to rationalize the use of ensembles, the spot checking of algorithms and the understanding of what is happening when algorithms learn.

Let’s get started.

This post is divided into 5 parts; they are:

- Problem of Function Approximation
- Function Approximation as Search
- Choice of Data
- Choice of Algorithm
- Implications of Machine Learning as Search

Applied machine learning is the development of a learning system to address a specific learning problem.

The learning problem is characterized by observations comprised of input data and output data and some unknown but coherent relationship between the two.

The goal of the learning system is to learn a generalized mapping between input and output data such that skillful predictions can be made for new instances drawn from the domain where the output variable is unknown.

In statistical learning, a statistical perspective on machine learning, the problem is framed as the learning of a mapping function (*f*) given input data (*X*) and associated output data (*y*).

y = f(X)

We have a sample of *X* and *y* and do our best to come up with a function that approximates *f*, e.g. *fprime*, such that we can make predictions (*yhat*) given new examples (*Xhat*) in the future.

yhat = fprime(Xhat)

As such, applied machine learning can be thought of as the problem of function approximation.

The learned mapping will be imperfect.

The problem of designing and developing a learning system is the problem of learning a useful approximate of the unknown underlying function that maps the input variables to the output variables.

We do not know the form of the function, because if we did, we would not need a learning system; we could specify the solution directly.

Because we do not know the true underlying function, we must approximate it, meaning we do not know and may never know how close of an approximation the learning system is to the true mapping.

We must search for an approximation of the true underlying function that is good enough for our purposes.

There are many sources of noise that introduce error into the learning process that can make the process more challenging and in turn result in a less useful mapping. For example:

- The choice of the framing of the learning problem.
- The choice of the observations used to train the system.
- The choice of how the training data is prepared.
- The choice of the representational form for the predictive model.
- The choice of the learning algorithm to fit the model on the training data.
- The choice of the performance measure by which to evaluate predictive skill.

And so much more.

You can see that there are many decision points in the development of a learning system, and none of the answers are known beforehand.

You can think of all possible learning systems for a learning problem as a huge search space, where each decision point narrows the search.

For example, if the learning problem was to predict the species of flowers, one of millions of possible learning systems could be narrowed down as follows:

- Choose to frame the problem as predicting a species class label, e.g. classification.
- Choose measurements of the flowers of a given species and their associated sub-species.
- Choose flowers in one specific nursery to measure in order to collect training data.
- Choose a decision tree model representation so that predictions can be explained to stakeholders.
- Choose the CART algorithm to fit the decision tree model.
- Choose classification accuracy to evaluate the skill of models.

And so on.

You can also see that there may be a natural hierarchy for many of the decisions involved in developing a learning system, each of which further narrows the space of possible learning systems that we could build.

This narrowing introduces a useful bias that intentionally selects one subset of possible learning systems over another with the goal of getting closer to a useful mapping that we can use in practice. This biasing applies both at the top level in the framing of the problem and at low levels, such as the choice of machine learning algorithm or algorithm configuration.

The chosen framing of the learning problem and the data used to train the system are a big point of leverage in the development of your learning system.

You do not have access to all data: that is all pairs of inputs and outputs. If you did, you would not need a predictive model in order to make output predictions for new input observations.

You do have some historical input-output pairs. If you didn’t, you would not have any data with which to train a predictive model.

But maybe you have a lot of data and you need to select only some of it for training. Or maybe you have the freedom to generate data at will and are challenged by what and how much data to generate or collect.

The data that you choose to model your learning system on must sufficiently capture the relationship between the input and output data for both the data that you have available and data that the model will be expected to make predictions on in the future.

You must choose the representation of the model and the algorithm used to fit the model on the training data. This, again, is another big point of leverage on the development of your learning system.

Often this decision is simplified to the selection of an algorithm, although it is common for the project stakeholders to impose constraints on the project, such as the model being able to explain predictions which in turn imposes constraints on the form of the final model representation and in turn on the scope of mappings that you can search.

This conceptualization of developing learning systems as a search problem helps to make clear many related concerns in applied machine learning.

This section looks at a few.

The algorithm used to learn the mapping will impose further constraints, and it, along with the chosen algorithm configuration, will control how the space of possible candidate mappings is navigated as the model is fit (e.g. for machine learning algorithms that learn iteratively).

Here, we can see that the act of learning from training data by a machine learning algorithm is in effect navigating the space of possible mappings for the learning system, hopefully moving from poor mappings to better mappings (e.g. hill climbing).

This provides a conceptual rationale for the role of optimization algorithms in the heart of machine learning algorithms to get the most out of the model representation for the specific training data.

We can also see that different model representations will occupy quite different locations in the space of all possible function mappings, and in turn have quite different behavior when making predictions (e.g. uncorrelated prediction errors).

This provides a conceptual rationale for the role of ensemble methods that combine the predictions from different but skillful predictive models.

Different algorithms with different representations may start in different positions in the space of possible function mappings, and will navigate the space differently.

If the constrained space that these algorithms are navigating is well specified by an appropriating framing and good data, then most algorithms will likely discover good and similar mapping functions.

We can also see how a good framing and careful selection of training data can open up a space of candidate mappings that may be found by a suite of modern powerful machine learning algorithms.

This provides rationale for spot checking a suite of algorithms on a given machine learning problem and doubling down on the one that shows the most promise, or selecting the most parsimonious solution (e.g. Occam’s razor).

This section provides more resources on the topic if you are looking to go deeper.

- Chapter 2, Machine Learning, 1997.
- Generalization as Search, 1982.
- Chapter 1, Data Mining: Practical Machine Learning Tools and Techniques, 2016.
- On algorithm selection, with an application to combinatorial search problems, 2012.
- Algorithm Selection on Wikipedia

In this post, you discovered the conceptualization of applied machine learning as a search problem.

Specifically, you learned:

- That applied machine learning is the problem of approximating an unknown underlying mapping function from inputs to outputs.
- That design decisions such as the choice of data and choice of algorithm narrow the scope of possible mapping functions that you may ultimately choose.
- That the conceptualization of machine learning as search helps to rationalize the use of ensembles, the spot checking of algorithms and the understanding of what is happening when algorithms learn.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Applied Machine Learning as a Search Problem appeared first on Machine Learning Mastery.

]]>The post Why Applied Machine Learning Is Hard appeared first on Machine Learning Mastery.

]]>Applied machine learning is challenging.

You must make many decisions where there is no known “*right answer*” for your specific problem, such as:

- What framing of the problem to use?
- What input and output data to use?
- What learning algorithm to use?
- What algorithm configuration to use?

This is challenging for beginners that expect that you can calculate or be told what data to use or how to best configure an algorithm.

In this post, you will discover the intractable nature of designing learning systems and how to deal with it.

After reading this post, you will know:

- How to develop a clear definition of your learning problem for yourself and others.
- The 4 decision points you must consider when designing a learning system for your problem.
- The 3 strategies that you can use to specifically address the intractable problem of designing learning systems in practice.

Let’s get started.

This post is divided into 6 sections inspired by chapter 1 of Tom Mitchell’s excellent 1997 book Machine Learning; they are:

- Well-Posed Learning Problems
- Choose the Training Data
- Choose the Target Function
- Choose a Representation of the Target Function
- Choose a Learning Algorithm
- How to Design Learning Systems

We can define a general learning task in the field of applied machine learning as a program that learns from experience on some task against a specific performance measure.

Tom Mitchell in his 1997 book Machine Learning states this clearly as:

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

— Page 2, Machine Learning, 1997.

We take this as a general definition for the types of learning tasks that we may be interested in for applied machine learning such as predictive modeling. Tom lists a few examples to make this clear, such as:

- Learning to recognize spoken words.
- Learning to drive an autonomous vehicle.
- Learning to classify new astronomical structures.
- Learning to play world-class backgammon.

We can use the above definition to define our own predictive modeling problem. Once defined, the task becomes that of designing a learning system to address it.

Designing a learning system, i.e. an application of machine learning, involves four design choices:

- Choosing the training data.
- Choosing the target function.
- Choosing the representation.
- Choosing the learning algorithm.

There might be a best set of choices that you can make for a given problem given infinite resources, but we don’t have infinite time, compute resources, and knowledge about the domain or learning systems.

Therefore, although we can prepare a well-posed description of a learning problem, designing the best possible learning system is intractable.

The best we can do is use knowledge, skill, and available resources to work through the design choices.

Let’s look at each of these design choices in more detail.

You must choose the data your learning system will use as experience from which to learn.

This is the data from past observations.

The type of training experience available can have a significant impact on success or failure of the learner.

— Page 5, Machine Learning, 1997.

It is rarely well formatted and ready to use; often you must collect the data you need (or think you might need) for the learning problem.

This may mean:

- Scraping documents.
- Querying databases.
- Processing files.
- Collating disparate sources
- Consolidating entities.

You need to get all of the data together and into a normalized form such that one observation represents one entity for which an outcome is available.

Next, you must choose the framing of the learning problem.

Machine learning is really a problem of learning a mapping function (f) from inputs (X) to outputs (y).

y = f(X)

This function can then be used on new data in the future in order to predict the most likely output.

The goal of the learning system is to prepare a function that best maps inputs to outputs given the resources available. The underlying function that actually exists is unknown. If we knew the form of this function, we could use it directly and we would not need machine learning to learn it.

More generally, this is a problem called function approximation. The result will be an approximation, meaning that it will have error. We will do our best to minimize this error, but some error will always exist given noise in the data.

… we have reduced the learning task in this case to the problem of discovering an operational description of the ideal target function V. It may be very difficult in general to learn such an operational form of V perfectly. In fact, we often expect learning algorithms to acquire only some approximation to the target function, and for this reason the process of learning the target function is often called function approximation.

— Page 8, Machine Learning, 1997.

This step is about selecting exactly what data to use as input to the function, e.g. the input features or input variables and what exactly will be predicted, e.g. the output variable.

Often, I refer to this as the framing of your learning problem. Choosing the inputs and outputs essentially chooses the nature of the target function we will seek to approximate.

Next, you must choose the representation you wish to use for the mapping function.

Think of this as the type of final model you wish to have that you can then use to make predictions. You must choose the form of this model, the data structure if you’d like.

Now that we have specified the ideal target function V, we must choose a representation that the learning program will use to describe the function Vprime that it will learn.

— Page 8, Machine Learning, 1997.

For example:

- Perhaps your project requires a decision tree that is easy to understand and explain to stakeholders.
- Perhaps your stakeholders prefer a linear model that the stats guys can easily interpret.
- Perhaps your stakeholders don’t care about anything other than model performance so all model representations are up for grabs.

The choice of representation will impose constraints on the types of learning algorithms that you can use to learn the mapping function.

Finally, you must choose the learning algorithm that will take the input and output data and learn a model of your preferred representation.

If there were few constraints on the choice of representation, as is often the case, then you may be able to evaluate a suite of different algorithms and representations.

If there were strong constraints on the choice of function representation, e.g. a weighted sum linear model or a decision tree, then the choice of algorithms will be limited to those that that can operate on the specific representations.

The choice of algorithm may impose its own constraints, such as specific data preparation transforms like data normalization.

Developing a learning system is challenging.

No one can tell you the best answer to each decision along the way; the best answer is unknown for your specific learning problem.

Mitchell helps to clarify this with a depiction of the choices made in designing a learning system for playing checkers.

The choices act as points of constraint on the design process. Mitchell goes on to say:

These design choices have constrained the learning task in a number of ways. We have restricted the type of knowledge that can be acquired to a single linear evaluation function. Furthermore, we have constrained this evaluation function to depend on only the six specific board features provided. If the true target function V can indeed be represented by a linear combination of these particular features, then our program has a good chance to learn it. If not, then the best we can hope for is that it will learn a good approximation, since a program can certainly never learn anything that it cannot at least represent.

— Pages 13-14, Machine Learning, 1997.

I like this passage as it really drives home both the importance of these constraints to simplify the problem, and the risk of making choices that limit or prevent the system from learning the problem sufficiently.

Generally, you cannot analytically calculate the answer to these choices, e.g. what data to use, what algorithm to use, or what algorithm configuration to use.

Nevertheless, all is not lost; here are 3 tactics that you can use in practice:

**Copy**. Look to the literature or experts for learning systems on problems the same or similar to your problem and copy the design of the learning system. It is very likely you are not the first to work on a problem of a given type. At the very worst, the copied design provides a starting point for your own design.**Search**. List available options at each decision point and empirically evaluate each to see what works best on your specific data. This may be the most robust and most practiced approach in applied machine learning.**Design**. After completing many projects via the Copy and Search methods above, you will develop an intuition for how to design machine learning systems.

Developing learning systems is not a science; it is engineering.

Developing new machine learning algorithms and describing how and why they work is a science, and this is often not required when developing a learning system.

Developing a learning system is a lot like developing software. You must combine (1) copies of past designed that work, (2) prototypes that show promising results, and (3) design experience when developing a new system in order to get the best results.

This section provides more resources on the topic if you are looking to go deeper.

- Chapter 1, Machine Learning, 1997.
- Tom Mitchell’s homepage
- Well Posed Problem on Wikipedia
- Intractability on Wikipedia
- How to Define Your Machine Learning Problem
- Applied Machine Learning Process

In this post, you discovered the intractable nature of designing learning systems in applied machine learning and how to deal with it.

Specifically, you learned:

- How to develop a clear definition of your learning problem for yourself and others.
- The 4 decision points you must consider when designing a learning system for your problem.
- The 3 strategies that you can use to specifically address the intractable problem of designing learning systems in practice.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Why Applied Machine Learning Is Hard appeared first on Machine Learning Mastery.

]]>The post How to Plan and Run Machine Learning Experiments Systematically appeared first on Machine Learning Mastery.

]]>This gives you a lot of time to think and plan for additional experiments to perform.

In addition, the average applied machine learning project may require tens to hundreds of discrete experiments in order to find a data preparation model and model configuration that gives good or great performance.

The drawn-out nature of the experiments means that you need to carefully plan and manage the order and type of experiments that you run.

You need to be systematic.

In this post, you will discover a simple approach to plan and manage your machine learning experiments.

With this approach, you will be able to:

- Stay on top of the most important questions and findings in your project.
- Keep track of what experiments you have completed and would like to run.
- Zoom in on the data preparations, models, and model configurations that give the best performance.

Let’s dive in.

I like to run experiments overnight. Lots of experiments.

This is so that when I wake up, I can check results, update my ideas of what is working (and what is not), and kick off the next round of experiments, then spend some time analyzing the findings.

I hate wasting time.

And I hate running experiments that do not get me closer to the goal of finding the most skillful model, given the time and resources I have available.

It is easy to lose track of where you’re up to. Especially after you have results, analysis, and findings from hundreds of experiments.

Poor management of your experiments can lead to bad situations where:

- You’re watching experiments run.
- You’re trying to come up with good ideas of experiments to run right after a current batch has finished.
- You run an experiment that you had already run before.

You never want to be in any of these situations!

If you are on top of your game, then:

- You know exactly what experiments you have run at a glance and what the findings were.
- You have a long list of experiments to run, ordered by their expected payoff.
- You have the time to dive into the analysis of results and think up new and wild ideas to try.

But how can we stay on top of hundreds of experiments?

One way that I have found to help me be systematic with experiments on a project is to use a spreadsheet.

Manage the experiments you have done, that are running, and that you want to run in a spreadsheet.

It is simple and effective.

It is simple in that I or anyone can access it from anywhere and see where we’re at.

I use Google Docs to host the spreadsheet.

There’s no code. No notebook. No fancy web app.

Just a spreadsheet.

It’s effective because it only contains the information needed with one line per experiment and one column for each piece of information to track on the experiment.

Experiments that are done can be separated from those that are planned.

Only experiments that are planned are set-up and run and their order ensures that the most important experiments are run first.

You will be surprised at how much such a simple approach can free up your time and get you thinking deeply about your project.

Let’s look at an example.

We can imagine a spreadsheet with the columns below.

These are just an example from the last project I worked on. I recommend adapting these to your own needs.

**Sub-Project**: A subproject may be a group of ideas you are exploring, a technique, a data preparation, and so on.**Context**: The context may be the specific objective such as beating a baseline, tuning, a diagnostic, and so on.**Setup**: The setup is the fixed configuration of the experiment.**Name**: The name is the unique identifier, perhaps the filename of the script.**Parameter**: The parameter is the thing being varied or looked at in the experiment.**Values**: The value is the value or values of the parameter that are being explored in the experiment.**Status**: The status is the status of the experiment, such as planned, running, or done.**Skill**: The skill is the North Star metric that really matters on the project, like accuracy or error.**Question**: The question is the motivating question the experiment seeks to address.**Finding**: The finding is the one line summary of the outcome of the experiment, the answer to the question.

To make this concrete, below is a screenshot of a Google Doc spreadsheet with these column headings and a contrived example.

I cannot say how much time this approach has saved me. And the number of assumptions that it proved wrong in the pursuit of getting top results.

In fact, I’ve discovered that deep learning methods are often quite hostile to assumptions and defaults. Keep this in mind when designing experiments!

Below are some tips that will help you get the most out of this simple approach on your project.

**Brainstorm**: Make the time to frequently review findings and list new questions and experiments to answer them.**Challenge**: Challenge assumptions and challenge previous findings. Play the scientist and design experiments that would falsify your findings or expectations.**Sub-Projects**: Consider the use of sub-projects to structure your investigation where you follow leads or investigate specific methods.**Experimental Order**: Use the row order as a priority to ensure that the most important experiments are run first.**Deeper Analysis**: Save deeper analysis of results and aggregated findings to another document; the spreadsheet is not the place.**Experiment Types**: Don’t be afraid to mix-in different experiment types such as grid searching, spot checks, and model diagnostics.

You will know that this approach is working well when:

- You are scouring API documentation and papers for more ideas of things to try.
- You have far more experiments queued up than resources to run them.
- You are thinking seriously about hiring a ton more EC2 instances.

In this post, you discovered how you can effectively manage hundreds of experiments that have run, are running, and that you want to run in a spreadsheet.

You discovered that a simple spreadsheet can help you:

- Keep track of what experiments you have run and what you discovered.
- Keep track of what experiments you want to run and what questions they will answer.
- Zoom in on the most effective data preparation, model, and model configuration for your predictive modeling problem.

Do you have any questions about this approach? Have you done something similar yourself?

Let me know in the comments below.

The post How to Plan and Run Machine Learning Experiments Systematically appeared first on Machine Learning Mastery.

]]>The post What is the Difference Between a Parameter and a Hyperparameter? appeared first on Machine Learning Mastery.

]]>There are so many terms to use and many of the terms may not be used consistently. This is especially true if you have come from another field of study that may use some of the same terms as machine learning, but they are used differently.

For example: the terms “*model parameter*” and “*model hyperparameter*.”

Not having a clear definition for these terms is a common struggle for beginners, especially those that have come from the fields of statistics or economics.

In this post, we will take a closer look at these terms.

A model parameter is a configuration variable that is internal to the model and whose value can be estimated from data.

- They are required by the model when making predictions.
- They values define the skill of the model on your problem.
- They are estimated or learned from data.
- They are often not set manually by the practitioner.
- They are often saved as part of the learned model.

Parameters are key to machine learning algorithms. They are the part of the model that is learned from historical training data.

In classical machine learning literature, we may think of the model as the hypothesis and the parameters as the tailoring of the hypothesis to a specific set of data.

Often model parameters are estimated using an optimization algorithm, which is a type of efficient search through possible parameter values.

**Statistics**: In statistics, you may assume a distribution for a variable, such as a Gaussian distribution. Two parameters of the Gaussian distribution are the mean (*mu*) and the standard deviation (*sigma*). This holds in machine learning, where these parameters may be estimated from data and used as part of a predictive model.**Programming**: In programming, you may pass a parameter to a function. In this case, a parameter is a function argument that could have one of a range of values. In machine learning, the specific model you are using is the function and requires parameters in order to make a prediction on new data.

Whether a model has a fixed or variable number of parameters determines whether it may be referred to as “*parametric*” or “*nonparametric*“.

Some examples of model parameters include:

- The weights in an artificial neural network.
- The support vectors in a support vector machine.
- The coefficients in a linear regression or logistic regression.

A model hyperparameter is a configuration that is external to the model and whose value cannot be estimated from data.

- They are often used in processes to help estimate model parameters.
- They are often specified by the practitioner.
- They can often be set using heuristics.
- They are often tuned for a given predictive modeling problem.

We cannot know the best value for a model hyperparameter on a given problem. We may use rules of thumb, copy values used on other problems, or search for the best value by trial and error.

When a machine learning algorithm is tuned for a specific problem, such as when you are using a grid search or a random search, then you are tuning the hyperparameters of the model or order to discover the parameters of the model that result in the most skillful predictions.

Many models have important parameters which cannot be directly estimated from the data. For example, in the K-nearest neighbor classification model … This type of model parameter is referred to as a tuning parameter because there is no analytical formula available to calculate an appropriate value.

— Page 64-65, Applied Predictive Modeling, 2013

Model hyperparameters are often referred to as model parameters which can make things confusing. A good rule of thumb to overcome this confusion is as follows:

**If you have to specify a model parameter manually then
it is probably a model hyperparameter.**

Some examples of model hyperparameters include:

- The learning rate for training a neural network.
- The C and sigma hyperparameters for support vector machines.
- The k in k-nearest neighbors.

- Hyperparameter on Wikipedia
- What are hyperparameters in machine learning? on Quora
- What is the difference between model hyperparameters and model parameters? on StackExchange
- What is considered a hyperparameter? on Reddit

In this post, you discovered the clear definitions and the difference between model parameters and model hyperparameters.

In summary, model parameters are estimated from data automatically and model hyperparameters are set manually and are used in processes to help estimate model parameters.

Model hyperparameters are often referred to as parameters because they are the parts of the machine learning that must be set manually and tuned.

Did this post help you clear up the confusion?

Let me know in the comments below.

Are there model parameters or hyperparameters that you are still unsure about?

Post them in the comments and I’ll do my best to help clear things up further.

The post What is the Difference Between a Parameter and a Hyperparameter? appeared first on Machine Learning Mastery.

]]>The post How Much Training Data is Required for Machine Learning? appeared first on Machine Learning Mastery.

]]>This is a fact, but does not help you if you are at the pointy end of a machine learning project.

A common question I get asked is:

*How much data do I need?*

I cannot answer this question directly for you, or for anyone. But I can give you a handful of ways of thinking about this question.

In this post, I lay out a suite of methods that you can use to think about how much training data you need to apply machine learning to your problem.

My hope that one or more of these methods may help you understand the difficulty of the question and how it is tightly coupled with the heart of the induction problem that you are trying to solve.

Let’s dive into it.

Note: Do you have your own heuristic methods for deciding how much data is required for machine learning? Please share them in the comments.

It is important to know why you are asking about the required size of the training dataset.

The answer may influence your next step.

For example:

**Do you have too much data?**Consider developing some learning curves to find out just how big a representative sample is (below). Or, consider using a big data framework in order to use all available data.**Do you have too little data?**Consider confirming that you indeed have too little data. Consider collecting more data, or using data augmentation methods to artificially increase your sample size.**Have you not collected data yet?**Consider collecting some data and evaluating whether it is enough. Or, if it is for a study or data collection is expensive, consider talking to a domain expert and a statistician.

More generally, you may have more pedestrian questions such as:

- How many records should I export from the database?
- How many samples are required to achieve a desired level of performance?
- How large must the training set be to achieve a sufficient estimate of model performance?
- How much data is required to demonstrate that one model is better than another?
- Should I use a train/test split or k-fold cross validation?

It may be these latter questions that the suggestions in this post seek to address.

In practice, I answer this question myself using learning curves (see below), using resampling methods on small datasets (e.g. k-fold cross validation and the bootstrap), and by adding confidence intervals to final results.

What is your reason for asking about the number of samples required for machine learning?

Please let me know in the comments.

*So, how much data do you need?*

No one can tell you how much data you need for your predictive modeling problem.

It is unknowable: an intractable problem that you must discover answers to through empirical investigation.

The amount of data required for machine learning depends on many factors, such as:

**The complexity of the problem**, nominally the unknown underlying function that best relates your input variables to the output variable.**The complexity of the learning algorithm**, nominally the algorithm used to inductively learn the unknown underlying mapping function from specific examples.

This is our starting point.

And “*it depends*” is the answer that most practitioners will give you the first time you ask.

A lot of people have worked on a lot of applied machine learning problems before you.

Some of them have published their results.

Perhaps you can look at studies on problems similar to yours as an estimate for the amount of data that may be required.

Similarly, it is common to perform studies on how algorithm performance scales with dataset size. Perhaps such studies can inform you how much data you require to use a specific algorithm.

Perhaps you can average over multiple studies.

Search for papers on Google, Google Scholar, and Arxiv.

You need a sample of data from your problem that is representative of the problem you are trying to solve.

In general, the examples must be independent and identically distributed.

Remember, in machine learning we are learning a function to map input data to output data. The mapping function learned will only be as good as the data you provide it from which to learn.

This means that there needs to be enough data to reasonably capture the relationships that may exist both between input features and between input features and output features.

Use your domain knowledge, or find a domain expert and reason about the domain and the scale of data that may be required to reasonably capture the useful complexity in the problem.

There are statistical heuristic methods available that allow you to calculate a suitable sample size.

Most of the heuristics I have seen have been for classification problems as a function of the number of classes, input features or model parameters. Some heuristics seem rigorous, others seem completely ad hoc.

Here are some examples you may consider:

**Factor of the number of classes**: There must be x independent examples for each class, where x could be tens, hundreds, or thousands (e.g. 5, 50, 500, 5000).**Factor of the number of input features**: There must be x% more examples than there are input features, where x could be tens (e.g. 10).**Factor of the number of model parameters**: There must be x independent examples for each parameter in the model, where x could be tens (e.g. 10).

They all look like ad hoc scaling factors to me.

Have you used any of these heuristics?

How did it go? Let me know in the comments.

In theoretical work on this topic (not my area of expertise!), a classifier (e.g. k-nearest neighbors) is often contrasted against the optimal Bayesian decision rule and the difficulty is characterized in the context of the curse of dimensionality; that is there is an exponential increase in difficulty of the problem as the number of input features is increased.

For example:

- Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners, 1991
- Dimensionality and sample size considerations in pattern recognition practice, 1982

Findings suggest avoiding local methods (like k-nearest neighbors) for sparse samples from high dimensional problems (e.g. few samples and many input features).

For a kinder discussion of this topic, see:

- Section 2.5 Local Methods in High Dimensions, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2008.

The more powerful machine learning algorithms are often referred to as nonlinear algorithms.

By definition, they are able to learn complex nonlinear relationships between input and output features. You may very well be using these types of algorithms or intend to use them.

These algorithms are often more flexible and even nonparametric (they can figure out how many parameters are required to model your problem in addition to the values of those parameters). They are also high-variance, meaning predictions vary based on the specific data used to train them. This added flexibility and power comes at the cost of requiring more training data, often a lot more data.

In fact, some nonlinear algorithms like deep learning methods can continue to improve in skill as you give them more data.

If a linear algorithm achieves good performance with hundreds of examples per class, you may need thousands of examples per class for a nonlinear algorithm, like random forest, or an artificial neural network.

It is common when developing a new machine learning algorithm to demonstrate and even explain the performance of the algorithm in response to the amount of data or problem complexity.

These studies may or may not be performed and published by the author of the algorithm, and may or may not exist for the algorithms or problem types that you are working with.

I would suggest performing your own study with your available data and a single well-performing algorithm, such as random forest.

Design a study that evaluates model skill versus the size of the training dataset.

Plotting the result as a line plot with training dataset size on the x-axis and model skill on the y-axis will give you an idea of how the size of the data affects the skill of the model on your specific problem.

This graph is called a learning curve.

From this graph, you may be able to project the amount of data that is required to develop a skillful model, or perhaps how little data you actually need before hitting an inflection point of diminishing returns.

I highly recommend this approach in general in order to develop robust models in the context of a well-rounded understanding of the problem.

You need lots of data when applying machine learning algorithms.

Often, you need more data than you may reasonably require in classical statistics.

I often answer the question of how much data is required with the flippant response:

*Get and use as much data as you can.*

If pressed with the question, and with zero knowledge of the specifics of your problem, I would say something naive like:

- You need thousands of examples.
- No fewer than hundreds.
- Ideally, tens or hundreds of thousands for “average” modeling problems.
- Millions or tens-of-millions for “hard” problems like those tackled by deep learning.

Again, this is just more ad hoc guesstimating, but it’s a starting point if you need it. So get started!

Big data is often discussed along with machine learning, but you may not require big data to fit your predictive model.

Some problems require big data, all the data you have. For example, simple statistical machine translation:

If you are performing traditional predictive modeling, then there will likely be a point of diminishing returns in the training set size, and you should study your problems and your chosen model/s to see where that point is.

Keep in mind that machine learning is a process of induction. The model can only capture what it has seen. If your training data does not include edge cases, they will very likely not be supported by the model.

Now, stop getting ready to model your problem, and model it.

Do not let the problem of the training set size stop you from getting started on your predictive modeling problem.

In many cases, I see this question as a reason to procrastinate.

Get all the data you can, use what you have, and see how effective models are on your problem.

Learn something, then take action to better understand what you have with further analysis, extend the data you have with augmentation, or gather more data from your domain.

This section provides more resources on the topic if you are looking go deeper.

There is a lot of discussion around this question on Q&A sites like Quora, StackOverflow, and CrossValidated. Below are few choice examples that may help.

- How large a training set is needed?
- Training set size for neural networks considering curse of dimensionality
- How to decrease training set size?
- Does increase in training set size help in increasing the accuracy perpetually or is there a saturation point?
- How to choose the training, cross-validation, and test set sizes for small sample-size data?
- How few training examples is too few when training a neural network?
- What is the recommended minimum training dataset size to train a deep neural network?

I expect that there are some great statistical studies on this question; here are a few I could find.

- Sample size planning for classification models, 1991
- Dimensionality and sample size considerations in pattern recognition practice, 1982
- Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practioners, 1991
- Predicting Sample Size Required for Classification Performance, 2012

Other related articles.

- How much training data do you need?
- Do We Need More Training Data?
- The Unreasonable Effectiveness of Data, (and Peter Norvig’s talk)

If you know of more, please let me know in the comments below.

In this post, you discovered a suite of ways to think and reason about the problem of answering the common question:

*How much training data do I need for machine learning?*

Did any of these methods help?

Let me know in the comments below.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

Except, of course, the question of how much data that *you* specifically need.

The post How Much Training Data is Required for Machine Learning? appeared first on Machine Learning Mastery.

]]>