The post How to Reduce Variance in a Final Machine Learning Model appeared first on Machine Learning Mastery.

]]>A problem with most final models is that they suffer variance in their predictions.

This means that each time you fit a model, you get a slightly different set of parameters that in turn will make slightly different predictions. Sometimes more and sometimes less skillful than what you expected.

This can be frustrating, especially when you are looking to deploy a model into an operational environment.

In this post, you will discover how to think about model variance in a final model and techniques that you can use to reduce the variance in predictions from a final model.

After reading this post, you will know:

- The problem with variance in the predictions made by a final model.
- How to measure model variance and how variance is addressed generally when estimating parameters.
- Techniques you can use to reduce the variance in predictions made by a final model.

Let’s get started.

Once you have discovered which model and model hyperparameters result in the best skill on your dataset, you’re ready to prepare a final model.

A final model is trained on all available data, e.g. the training and the test sets.

It is the model that you will use to make predictions on new data were you do not know the outcome.

The final model is the outcome of your applied machine learning project.

To learn more about preparing a final model, see the post:

The bias-variance trade-off is a conceptual idea in applied machine learning to help understand the sources of error in models.

**Bias**refers to assumptions in the learning algorithm that narrow the scope of what can be learned. This is useful as it can accelerate learning and lead to stable results, at the cost of the assumption differing from reality.**Variance**refers to the sensitivity of the learning algorithm to the specifics of the training data, e.g. the noise and specific observations. This is good as the model will be specialized to the data at the cost of learning random noise and varying each time it is trained on different data.

The bias-variance tradeoff is a conceptual tool to think about these sources of error and how they are always kept in balance.

More bias in an algorithm means that there is less variance, and the reverse is also true.

You can learn more about the bias-variance tradeoff in this post:

You can control this balance.

Many machine learning algorithms have hyperparameters that directly or indirectly allow you to control the bias-variance tradeoff.

For example, the *k* in *k*-nearest neighbors is one example. A small *k* results in predictions with high variance and low bias. A large *k* results in predictions with a small variance and a large bias.

Most final models have a problem: they suffer from variance.

Each time a model is trained by an algorithm with high variance, you will get a slightly different result.

The slightly different model in turn will make slightly different predictions, for better or worse.

This is a problem with training a final model as we are required to use the model to make predictions on real data where we do not know the answer and we want those predictions to as good as possible.

We want to the best possible version of the model that we can get.

We want the variance to play out in our favor.

If we can’t achieve that, at least we want the variance to not fall against us when making predictions.

There are two common sources of variance in a final model:

- The noise in the training data.
- The use of randomness in the machine learning algorithm.

The first type we introduced above.

The second type impacts those algorithms that harness randomness during learning.

Three common examples include:

- Choice of random split points in random forest.
- Random weight initialization in neural networks.
- Shuffling training data in stochastic gradient descent.

You can measure both types of variance in your specific model using your training data.

**Measure Algorithm Variance**: The variance introduced by the stochastic nature of the algorithm can be measured by repeating the evaluation of the algorithm on the same training dataset and calculating the variance or standard deviation of the model skill.**Measure Training Data Variance**: The variance introduced by the training data can be measured by repeating the evaluation of the algorithm on different samples of training data, but keeping the seed for the pseudorandom number generator fixed then calculating the variance or standard deviation of the model skill.

Often, the combined variance is estimated by running repeated k-fold cross-validation on a training dataset then calculating the variance or standard deviation of the model skill.

If we want to reduce the amount of variance in a prediction, we must add bias.

Consider the case of a simple statistical estimate of a population parameter, such as estimating the mean from a small random sample of data.

A single estimate of the mean will have high variance and low bias.

This is intuitive because if we repeated this process 30 times and calculated the standard deviation of the estimated mean values, we would see a large spread.

The solutions for reducing the variance are also intuitive.

Repeat the estimate on many different small samples of data from the domain and calculate the mean of the estimates, leaning on the central limit theorem.

The mean of the estimated means will have a lower variance. We have increased the bias by assuming that the average of the estimates will be a more accurate estimate than a single estimate.

Another approach would be to dramatically increase the size of the data sample on which we estimate the population mean, leaning on the law of large numbers.

The principles used to reduce the variance for a population statistic can also be used to reduce the variance of a final model.

We must add bias.

Depending on the specific form of the final model (e.g. tree, weights, etc.) you can get creative with this idea.

Below are three approaches that you may want to try.

If possible, I recommend designing a test harness to experiment and discover an approach that works best or makes the most sense for your specific data set and machine learning algorithm.

Instead of fitting a single final model, you can fit multiple final models.

Together, the group of final models may be used as an ensemble.

For a given input, each model in the ensemble makes a prediction and the final output prediction is taken as the average of the predictions of the models.

A sensitivity analysis can be used to measure the impact of ensemble size on prediction variance.

As above, multiple final models can be created instead of a single final model.

Instead of calculating the mean of the predictions from the final models, a single final model can be constructed as an ensemble of the parameters of the group of final models.

This would only make sense in cases where each model has the same number of parameters, such as neural network weights or regression coefficients.

For example, consider a linear regression model with three coefficients [b0, b1, b2]. We could fit a group of linear regression models and calculate a final b0 as the average of b0 parameters in each model, and repeat this process for b1 and b2.

Again, a sensitivity analysis can be used to measure the impact of ensemble size on prediction variance.

Leaning on the law of large numbers, perhaps the simplest approach to reduce the model variance is to fit the model on more training data.

In those cases where more data is not readily available, perhaps data augmentation methods can be used instead.

A sensitivity analysis of training dataset size to prediction variance is recommended to find the point of diminishing returns.

There are approaches to preparing a final model that aim to get the variance in the final model to work for you rather than against you.

The commonality in these approaches is that they seek a single best final model.

Two examples include:

**Why not fix the random seed?**You could fix the random seed when fitting the final model. This will constrain the variance introduced by the stochastic nature of the algorithm.**Why not use early stopping?**You could check the skill of the model against a holdout set during training and stop training when the skill of the model on the hold set starts to degrade.

I would argue that these approaches and others like them are fragile.

Perhaps you can gamble and aim for the variance to play-out in your favor. This might be a good approach for machine learning competitions where there is no real downside to losing the gamble.

I won’t.

I think it’s safer to aim for the best average performance and limit the downside.

I think that the trick with navigating the bias-variance tradeoff for a final model is to think in samples, not in terms of single models. To optimize for average model performance.

This section provides more resources on the topic if you are looking to go deeper.

- How to Train a Final Machine Learning Model
- Gentle Introduction to the Bias-Variance Trade-Off in Machine Learning
- Bias-Variance Tradeoff on Wikipedia
- Checkpoint Ensembles: Ensemble Methods from a Single Training Process, 2017.

In this post, you discovered how to think about model variance in a final model and techniques that you can use to reduce the variance in predictions from a final model.

Specifically, you learned:

- The problem with variance in the predictions made by a final model.
- How to measure model variance and how variance is addressed generally when estimating parameters.
- Techniques you can use to reduce the variance in predictions made by a final model.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Reduce Variance in a Final Machine Learning Model appeared first on Machine Learning Mastery.

]]>The post How To Know if Your Machine Learning Model Has Good Performance appeared first on Machine Learning Mastery.

]]>This is a common question I am asked by beginners.

As a beginner, you often seek an answer to this question, e.g. you want someone to tell you whether an accuracy of *x*% or an error score of *x* is good or not.

In this post, you will discover how to answer this question for yourself definitively and know whether your model skill is good or not.

After reading this post, you will know:

- That a baseline model can be used to discover the bedrock in performance on your problem by which all other models can be evaluated.
- That all predictive models contain errors and that a perfect score is not possible in practice given the stochastic nature of data and algorithms.
- That the true job of applied machine learning is to explore the space of possible models and discover what a good model score looks like relative to the baseline on your specific dataset.

Let’s get started.

This post is divided into 4 parts; they are:

- Model Skill Is Relative
- Baseline Model Skill
- What Is the Best Score?
- Discover Limits of Model Skill

Your predictive modeling problem is unique.

This includes the specific data you have, the tools you’re using, and the skill you will achieve.

Your predictive modeling problem has not been solved before. Therefore, we cannot know what a good model looks like or what skill it might have.

You may have ideas of what a skillful model looks like based on knowledge of the domain, but you don’t know whether those skill scores are achievable.

The best that we can do is to compare the performance of machine learning models on your specific data to other models also trained on the same data.

Machine learning model performance is relative and ideas of what score a good model can achieve only make sense and can only be interpreted in the context of the skill scores of other models also trained on the same data.

Because machine learning model performance is relative, it is critical to develop a robust baseline.

A baseline is a simple and well understood procedure for making predictions on your predictive modeling problem. The skill of this model provides the bedrock for the lowest acceptable performance of a machine learning model on your specific dataset.

The results for the baseline model provide the point from which the skill of all other models trained on your data can be evaluated.

Three examples of baseline models include:

- Predict the mean outcome value for a regression problem.
- Predict the mode outcome value for a classification problem.
- Predict the input as the output (called persistence) for a univariate time series forecasting problem.

The baseline performance on your problem can then be used as the yardstick by which all other models can be compared and evaluated.

If a model achieves a performance below the baseline, something is wrong (e.g. there’s a bug) or the model is not appropriate for your problem.

If you are working on a classification problem, the best score is 100% accuracy.

If you are working on a regression problem, the best score is 0.0 error.

These scores are an impossible to achieve upper/lower bound. All predictive modeling problems have prediction error. Expect it. The error comes from a range of sources such as:

- Incompleteness of data sample.
- Noise in the data.
- Stochastic nature of the modeling algorithm.

You cannot achieve the best score, but it is good to know what the best possible performance is for your chosen measure. You know that true model performance will fall within a range between the baseline and the best possible score.

Instead, you must search the space of possible models on your dataset and discover what good and bad scores look like.

Once you have the baseline, you can explore the extent of model performance on your predictive modeling problem.

In fact, this is the hard work and the objective of the project: to find a model that you can demonstrate works reliably well in making predictions on your specific dataset.

There are many strategies to this problem; two that you may wish to consider are:

**Start High**. Select a machine learning method that is sophisticated and known to perform well on a range of predictive model problems, such as random forest or gradient boosting. Evaluate the model on your problem and use the result as an approximate top-end benchmark, then find the simplest model that achieves similar performance.**Exhaustive Search**. Evaluate all of the machine learning methods that you can think of on the problem and select the method that achieves the best performance relative to the baseline.

The “*Start High*” approach is fast and can help you define the bounds of model skill to expect on the problem and find a simple (e.g. Occam’s Razor) model that can achieve similar results. It can also help you find out whether the problem is solvable/predictable fast, which is important because not all problems are predictable.

The “*Exhaustive Search*” is slow and is really intended for long-running projects where model skill is more important than almost any other concern. I often perform variations of this approach testing suites of similar methods in batches and call it the spot-checking approach.

Both methods will give you a population of model performance scores that you can compare to the baseline.

You will know what a good score looks like and what a bad score looks like.

This section provides more resources on the topic if you are looking to go deeper.

- How to Make Baseline Predictions for Time Series Forecasting with Python
- How To Implement Baseline Machine Learning Algorithms From Scratch With Python
- Machine Learning Performance Improvement Cheat Sheet

In this post, you discovered that your predictive modeling problem is unique and that good model performance scores can only be known relative to a baseline performance.

Specifically, you learned:

- That a baseline model can be used to discover the bedrock in performance on your problem by which all other models can be evaluated.
- That all predictive models contain error and that a perfect score is not possible in practice given the stochastic nature of data and algorithms.
- That the true job of applied machine learning is to explore the space of possible models and discover what a good model score looks like relative to the baseline on your specific dataset.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How To Know if Your Machine Learning Model Has Good Performance appeared first on Machine Learning Mastery.

]]>The post The Model Performance Mismatch Problem (and what to do about it) appeared first on Machine Learning Mastery.

]]>The procedure when evaluating machine learning models is to fit and evaluate them on training data, then verify that the model has good skill on a held-back test dataset.

Often, you will get a very promising performance when evaluating the model on the training dataset and poor performance when evaluating the model on the test set.

In this post, you will discover techniques and issues to consider when you encounter this common problem.

After reading this post, you will know:

- The problem of model performance mismatch that may occur when evaluating machine learning algorithms.
- The causes of overfitting, under-representative data samples, and stochastic algorithms.
- Ways to harden your test harness to avoid the problem in the first place.

This post was based on a reader question; thanks! Keep the questions coming!

Let’s get started.

This post is divided into 4 parts; they are:

- Model Evaluation
- Model Performance Mismatch
- Possible Causes and Remedies
- More Robust Test Harness

When developing a model for a predictive modeling problem, you need a test harness.

The test harness defines how the sample of data from the domain will be used to evaluate and compare candidate models for your predictive modeling problem.

There are many ways to structure a test harness, and no single best way for all projects.

One popular approach is to use a portion of data for fitting and tuning the model and a portion for providing an objective estimate of the skill of the tuned model on out-of-sample data.

The data sample is split into a training and test dataset. The model is evaluated on the training dataset using a resampling method such as k-fold cross-validation, and the set itself may be further divided into a validation dataset used to tune the hyperparameters of the model.

The test set is held back and used to evaluate and compare tuned models.

For more on training, validation, and test sets, see the post:

The resampling method will give you an estimate of the skill of your model on unseen data by using the training dataset.

The test dataset provides a second data point and ideally an objective idea of how well the model is expected to perform, corroborating the estimated model skill.

What if the estimate of model skill on the training dataset does not match the skill of the model on the test dataset?

The scores will not match in general. We do expect some differences because some small overfitting of the training dataset is inevitable given hyperparameter tuning, making the training scores optimistic.

But what if the difference is worryingly large?

- Which score do you trust?
- Can you still compare models using the test dataset?
- Is the model tuning process invalidated?

It is a challenging and very common situation in applied machine learning.

We can call this concern the “*model performance mismatch*” problem.

**Note**: ideas of “*large differences*” in model performance are relative to your chosen performance measures, datasets, and models. We cannot talk objectively about differences in general, only relative differences that you must interpret yourself.

There are many possible causes for the model performance mismatch problem.

Ultimately, your goal is to have a test harness that you know allows you to make good decisions regarding which model and model configuration to use as a final model.

In this section, we will look at some possible causes, diagnostics, and techniques you can use to investigate the problem.

Let’s look at three main areas: model overfitting, the quality of the data sample, and the stochastic nature of the learning algorithm.

Perhaps the most common cause is that you have overfit the training data.

You have hit upon a model, a set of model hyperparameters, a view of the data, or a combination of these elements and more that just so happens to give a good skill estimate on the training dataset.

The use of k-fold cross-validation will help to some degree. The use of tuning models with a separate dataset too will help. Nevertheless, it is possible to keep pushing and overfit on the training dataset.

If this is the case, the test skill may be more representative of the true skill of the chosen model and configuration.

One simple (but not easy) way to diagnose whether you have overfit the training dataset, is to get another data point on model skill. Evaluate the chosen model on another set of data. For example, some ideas to try include:

- Try a k-fold cross-validation evaluation of the model on the test dataset.
- Try a fit of the model on the training dataset and an evaluation on the test and a new data sample.

If you’re overfit, you have options.

- Perhaps you can scrap your current training dataset and collect a new training dataset.
- Perhaps you can re-split your sample into train/test in a softer approach to getting a new training dataset.

I would suggest that the results that you have obtained to-date are suspect and should be re-considered. Especially those where you may have spent too long tuning.

Overfitting may be the ultimate cause for the discrepancy in model scores, though it may not be the area to attack first.

It is possible that your training or test datasets are an unrepresentative sample of data from the domain.

This means that the sample size is too small or the examples in the sample do not effectively “cover” the cases observed in the broader domain.

This can be obvious to spot if you see noisy model performance results. For example:

- A large variance on cross-validation scores.
- A large variance on similar model types on the test dataset.

In addition, you will see the discrepancy between train and test scores.

Another good second test is to check summary statistics for each variable on the train and test sets, and ideally on the cross-validation folds. You are looking for a large variance in sample means and standard deviation.

The remedy is often to get a larger and more representative sample of data from the domain. Alternately, to use more discriminating methods in preparing the data sample and splits. Think stratified k-fold cross validation, but applied to input variables in an attempt to maintain population means and standard deviations for real-valued variables in addition to the distribution of categorical variables.

Often when I see overfitting on a project, it is because the test harness is not as robust as it should be, not because of hill climbing the test dataset.

It is possible that you are seeing a discrepancy in model scores because of the stochastic nature of the algorithm.

Many machine learning algorithms involve a stochastic component. For example, the random initial weights in a neural network, the shuffling of data and in turn the gradient updates in stochastic gradient descent, and much more.

This means, that each time the same algorithm is run on the same data, different sequences of random numbers are used and, in turn, a different model with different skill will result.

You can learn more about this in the post:

This issue can be seen by the variance in model skill scores from cross-validation, much like having an unrepresentative data sample.

The difference here is that the variance can be cleared up by repeating the model evaluation process, e.g. cross-validation, in order to control for the randomness in training the model.

This is often called the multiple repeats k-fold cross-validation and is used for neural networks and stochastic optimization algorithms, when resources permit.

I have more on this approach to evaluating models in the post:

A lot of these problems can be addressed early by designing a robust test harness and then gathering evidence to demonstrate that indeed your test harness is robust.

This might include running experiments before you get started evaluating models for real. Experiments such:

- A sensitivity analysis of train/test splits.
- A sensitivity analysis of k values for cross-validation.
- A sensitivity analysis of a given model’s behavior.
- A sensitivity analysis on the number of repeats.

On this last point, see the post:

You are looking for:

- Low variance and consistent mean in evaluation scores between tests in a cross-validation.
- Correlated population means between model scores on train and test sets.

Use statistical tools like standard error and significance tests if needed.

Use a modern and un-tuned model that performs well in general for such testing, such as random forest.

- If you discover a difference in skill scores between training and test sets, and it is consistent, that may be fine. You know what to expect.
- If you measure a variance in mean skill scores within a given test, you have error bars you can use to interpret the results.

I would go so far as to say that without a robust test harness, the results you achieve will be a mess. You will not be able to effectively interpret them. There will be an element of risk (or fraud, if you’re an academic) in the presentation of the outcomes from a fragile test harness. And reproducibility/robustness is a massive problem in numerical fields like applied machine learning.

Finally, avoid using the test dataset too much. Once you have strong evidence that your harness is robust, do not touch the test dataset until it comes time for final model selection.

This section provides more resources on the topic if you are looking to go deeper.

- What is the Difference Between Test and Validation Datasets?
- Embrace Randomness in Machine Learning
- Estimate the Number of Experiment Repeats for Stochastic Machine Learning Algorithms
- How to Evaluate the Skill of Deep Learning Models

In this post, you discovered the model performance mismatch problem where model performance differs greatly between training and test sets, and techniques to diagnose and address the issue.

Specifically, you learned:

- The problem of model performance mismatch that may occur when evaluating machine learning algorithms.
- The causes of overfitting, under-representative data samples, and stochastic algorithms.
- Ways to harden your test harness to avoid the problem in the first place.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post The Model Performance Mismatch Problem (and what to do about it) appeared first on Machine Learning Mastery.

]]>The post How to Get the Most From Your Machine Learning Data appeared first on Machine Learning Mastery.

]]>Data and the framing of your problem may be the point of biggest leverage on your project.

Choosing the wrong data or the wrong framing for your problem may lead to a model with poor performance or, at worst, a model that cannot converge.

It is not possible to analytically calculate what data to use or how to use it, but it is possible to use a trial-and-error process to discover how to best use the data that you have.

In this post, you will discover to get the most from your data on your machine learning project.

After reading this post, you will know:

- The importance of exploring alternate framings of your predictive modeling problem.
- The need to develop a suite of “
*views*” on your input data and to systematically test each. - The notion that feature selection, engineering, and preparation are ways of creating more views on your problem.

Let’s get started.

This post is divided into 8 parts; they are:

- Problem Framing
- Collect More Data
- Study Your Data
- Training Data Sample Size
- Feature Selection
- Feature Engineering
- Data Preparation
- Go Further

Brainstorm multiple ways to frame your predictive modeling problem.

The framing of the problem means the combination of:

- Inputs
- Outputs
- Problem Type

For example:

- Can you use more or less data as inputs to the model?
- Can you predict something else instead?
- Can you change the problem to be regression/classification/sequence/etc.?

The more creative you get, the better.

Use ideas from other projects, papers, and the domain itself.

Brainstorm. Write down all of the ideas, even if they are crazy.

I have some frameworks that will help with brainstorming the framing here:

I talk a little about changing the problem type in this post:

Get more data than you need, even data that is tangentially related to the outcome being predicted.

We cannot know how much data will be needed.

Data is the currency spent during model development. It is the oxygen needed by the project to breathe. Each time you use some data, it is less data available for other tasks.

You need to spend data on tasks like:

- Model training.
- Model evaluation.
- Model tuning.
- Model validation.

Further, the project is new. No one has done your specific project before, modeled your specific data. You don’t really know what features will be useful yet. You might have ideas, but you don’t know. Collect them all; make them all available at this stage.

Use every data visualization you can think of to look at your data from every angle.

- Looking at raw data helps. You will notice things.
- Looking at summary statistics helps. Again, you will notice things.
- Data visualization is like a beautiful combination of these two ways of learning. You will notice a lot more things.

Spend a long time with your raw data and summary statistics. Then move on to the visualizations last as they can take more time to prepare.

Use every data visualization you can think of and glean from books and papers on your data.

- Review plots.
- Save plots.
- Annotate plots.
- Show plots to domain experts.

You are seeking a little more insight into the data. Ideas that you can use to help better select, engineer, and prepare data for modeling. It will pay off.

Perform a sensitivity analysis with your data sample to see how much (or little) data you actually need.

You do not have all observations. If you did, you would not need to make predictions for new data.

Instead, you are working with a sample of the data. Therefore, there is an open question as to how much data will be needed to fit the model.

Don’t assume that more is better. Test.

- Design experiments to see how model skill changes with sample size.
- Use statistics to see how important trends and tendencies change with sample size.

Without this knowledge, you won’t know enough about your test harness to comment on model skill sensibly.

Learn more about sample size in this post:

Create many different views of your input features and test each one.

You don’t know what variables will be helpful or most helpful in your predictive modeling problem.

- You can guess.
- You can use advice from domain experts.
- You can even use suggestions from feature selection methods.

But they are all just guesses.

Each set of suggested input features is a “view” on your problem. An idea on what features might be useful for modeling and predicting the output variable.

Brainstorm, compute, and collect as many different views of your input data as you can.

Design experiments and carefully test and compare each view. Use data to inform you which features and which view are the most predictive.

For more on feature selection, see this post:

Use feature engineering to create additional features and views on your predictive modeling problem.

Sometimes you have all of the data you can get, but a given feature or set of features locks up knowledge that is too dense for the machine learning methods to learn and map to the outcome variable.

Examples include:

- Date/Times.
- Transactions.
- Descriptions.

Break down these data into simpler additional component features, such as counts, flags, and other elements.

Make things as simple as you can for the modeling process.

For more on feature engineering, see the post:

Pre-process data every way you can think of to meet the expectations of algorithms and more.

Pre-processing data, like feature selection and feature engineering, creates additional views on your input features.

Some algorithms have preferences regarding pre-processing, such as:

- Normalized input features.
- Standardized input features.
- Make input features stationary.

Prepare the data in anticipation of these expectation, but then go further.

Apply every data pre-processing method you can think of on your data. Keep creating new views on your problem and test them with one or a suite of models to see what works best.

Your objective here is to discover the view on the data that best exposes the unknown underlying structure of the mapping problem to the learning algorithm.

You can always go further.

There is usually more data you can collect, more views you can create on your data.

Brainstorm.

One easy win once you feel like you are at the end of the road is to begin investigating ensembles of models created from different views of your modeling problem.

It’s simple and highly effective, especially if the views expose different structures of the underlying mapping problem (e.g. the models have uncorrelated errors).

This section provides more resources on the topic if you are looking to go deeper.

- Why Applied Machine Learning Is Hard
- A Gentle Introduction to Applied Machine Learning as a Search Problem
- How to Define Your Machine Learning Problem
- Machine Learning Performance Improvement Cheat Sheet
- How Much Training Data is Required for Machine Learning?
- An Introduction to Feature Selection
- Discover Feature Engineering, How to Engineer Features and How to Get Good at It

In this post, you discovered techniques that you can use to get the most out of your data on your predictive modeling problem.

Specifically, you learned:

- The importance of exploring alternate framings of your predictive modeling problem.
- The need to develop a suite of “views” on your input data and to systematically test each.
- The notion that feature selection, engineering, and preparation are ways of creating more views on your problem.

Are there more ideas that you have for getting the most out of your data?

What do you normally do on a project?

Let me know in the comments below.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Get the Most From Your Machine Learning Data appeared first on Machine Learning Mastery.

]]>The post So, You are Working on a Machine Learning Problem… appeared first on Machine Learning Mastery.

]]>I want to really nail down where you’re at right now.

Let me make some guesses…

So you have a problem that you need to solve.

Maybe it’s your problem, an idea you have, a question, or something you want to address.

Or maybe it is a problem that was provided to you by someone else, such as a supervisor or boss.

This problem involves some historical data you have or can access. It also involves some predictions required from new or related data in the future.

Let’s dig deeper.

Let’s look at your problem in more detail.

You have historical data.

You have observations about something, like customers, voltages, prices, etc. collected over time.

You also have some outcome related to each observation, maybe a label like “*good*” or “*bad*” or maybe a quantity like *50.1*.

The problem you want to solve is, given new observations in the future, what is the most likely related outcome?

So far so good?

You need a program. A piece of software.

You need a thing that will take observational data as input and give you the most likely outcome as output.

The outcomes provided by the program need to be right, or really close to right. The program needs to be skillful at providing good outcomes for observations.

With such a piece of software, you could run it multiple times for each observation you have.

You could integrate it into some other software, like an app or webpage, and make use of it.

Am I right?

You want to solve this problem with machine learning or artificial intelligence, or something.

Someone told you to use machine learning or you just think it is the right tool for this job.

But, it’s confusing.

- How do you use machine learning on problems like this?
- Where do you start?
- What math do you need to know before solving this problem?

Does this describe you?

Or maybe you’ve started working on your problem, but you’re stuck.

- What data transforms should you use?
- What algorithm should you use?
- What algorithm configurations should you use?

Is this a better fit for where you’re at?

I am thinking about writing a step-by-step playbook that will walk you through the process of defining your problem, preparing your data, selecting algorithms, and ultimately developing a final model that you can use to make predictions for your problem.

But to make this playbook as useful as possible, I need to know where you are having trouble in this process.

Please, describe where you’re stuck in the comments below.

Share your story. Or even just a small piece.

I promise to read every single one, and even offer advice where possible.

If you are struggling, I strongly recommend following this process when working through a predictive modeling problem:

The post So, You are Working on a Machine Learning Problem… appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Applied Machine Learning as a Search Problem appeared first on Machine Learning Mastery.

]]>There is no best training data or best algorithm for your problem, only the best that you can discover.

The application of machine learning is best thought of as search problem for the best mapping of inputs to outputs given the knowledge and resources available to you for a given project.

In this post, you will discover the conceptualization of applied machine learning as a search problem.

After reading this post, you will know:

- That applied machine learning is the problem of approximating an unknown underlying mapping function from inputs to outputs.
- That design decisions such as the choice of data and choice of algorithm narrow the scope of possible mapping functions that you may ultimately choose.
- That the conceptualization of machine learning as a search helps to rationalize the use of ensembles, the spot checking of algorithms and the understanding of what is happening when algorithms learn.

Let’s get started.

This post is divided into 5 parts; they are:

- Problem of Function Approximation
- Function Approximation as Search
- Choice of Data
- Choice of Algorithm
- Implications of Machine Learning as Search

Applied machine learning is the development of a learning system to address a specific learning problem.

The learning problem is characterized by observations comprised of input data and output data and some unknown but coherent relationship between the two.

The goal of the learning system is to learn a generalized mapping between input and output data such that skillful predictions can be made for new instances drawn from the domain where the output variable is unknown.

In statistical learning, a statistical perspective on machine learning, the problem is framed as the learning of a mapping function (*f*) given input data (*X*) and associated output data (*y*).

y = f(X)

We have a sample of *X* and *y* and do our best to come up with a function that approximates *f*, e.g. *fprime*, such that we can make predictions (*yhat*) given new examples (*Xhat*) in the future.

yhat = fprime(Xhat)

As such, applied machine learning can be thought of as the problem of function approximation.

The learned mapping will be imperfect.

The problem of designing and developing a learning system is the problem of learning a useful approximate of the unknown underlying function that maps the input variables to the output variables.

We do not know the form of the function, because if we did, we would not need a learning system; we could specify the solution directly.

Because we do not know the true underlying function, we must approximate it, meaning we do not know and may never know how close of an approximation the learning system is to the true mapping.

We must search for an approximation of the true underlying function that is good enough for our purposes.

There are many sources of noise that introduce error into the learning process that can make the process more challenging and in turn result in a less useful mapping. For example:

- The choice of the framing of the learning problem.
- The choice of the observations used to train the system.
- The choice of how the training data is prepared.
- The choice of the representational form for the predictive model.
- The choice of the learning algorithm to fit the model on the training data.
- The choice of the performance measure by which to evaluate predictive skill.

And so much more.

You can see that there are many decision points in the development of a learning system, and none of the answers are known beforehand.

You can think of all possible learning systems for a learning problem as a huge search space, where each decision point narrows the search.

For example, if the learning problem was to predict the species of flowers, one of millions of possible learning systems could be narrowed down as follows:

- Choose to frame the problem as predicting a species class label, e.g. classification.
- Choose measurements of the flowers of a given species and their associated sub-species.
- Choose flowers in one specific nursery to measure in order to collect training data.
- Choose a decision tree model representation so that predictions can be explained to stakeholders.
- Choose the CART algorithm to fit the decision tree model.
- Choose classification accuracy to evaluate the skill of models.

And so on.

You can also see that there may be a natural hierarchy for many of the decisions involved in developing a learning system, each of which further narrows the space of possible learning systems that we could build.

This narrowing introduces a useful bias that intentionally selects one subset of possible learning systems over another with the goal of getting closer to a useful mapping that we can use in practice. This biasing applies both at the top level in the framing of the problem and at low levels, such as the choice of machine learning algorithm or algorithm configuration.

The chosen framing of the learning problem and the data used to train the system are a big point of leverage in the development of your learning system.

You do not have access to all data: that is all pairs of inputs and outputs. If you did, you would not need a predictive model in order to make output predictions for new input observations.

You do have some historical input-output pairs. If you didn’t, you would not have any data with which to train a predictive model.

But maybe you have a lot of data and you need to select only some of it for training. Or maybe you have the freedom to generate data at will and are challenged by what and how much data to generate or collect.

The data that you choose to model your learning system on must sufficiently capture the relationship between the input and output data for both the data that you have available and data that the model will be expected to make predictions on in the future.

You must choose the representation of the model and the algorithm used to fit the model on the training data. This, again, is another big point of leverage on the development of your learning system.

Often this decision is simplified to the selection of an algorithm, although it is common for the project stakeholders to impose constraints on the project, such as the model being able to explain predictions which in turn imposes constraints on the form of the final model representation and in turn on the scope of mappings that you can search.

This conceptualization of developing learning systems as a search problem helps to make clear many related concerns in applied machine learning.

This section looks at a few.

The algorithm used to learn the mapping will impose further constraints, and it, along with the chosen algorithm configuration, will control how the space of possible candidate mappings is navigated as the model is fit (e.g. for machine learning algorithms that learn iteratively).

Here, we can see that the act of learning from training data by a machine learning algorithm is in effect navigating the space of possible mappings for the learning system, hopefully moving from poor mappings to better mappings (e.g. hill climbing).

This provides a conceptual rationale for the role of optimization algorithms in the heart of machine learning algorithms to get the most out of the model representation for the specific training data.

We can also see that different model representations will occupy quite different locations in the space of all possible function mappings, and in turn have quite different behavior when making predictions (e.g. uncorrelated prediction errors).

This provides a conceptual rationale for the role of ensemble methods that combine the predictions from different but skillful predictive models.

Different algorithms with different representations may start in different positions in the space of possible function mappings, and will navigate the space differently.

If the constrained space that these algorithms are navigating is well specified by an appropriating framing and good data, then most algorithms will likely discover good and similar mapping functions.

We can also see how a good framing and careful selection of training data can open up a space of candidate mappings that may be found by a suite of modern powerful machine learning algorithms.

This provides rationale for spot checking a suite of algorithms on a given machine learning problem and doubling down on the one that shows the most promise, or selecting the most parsimonious solution (e.g. Occam’s razor).

This section provides more resources on the topic if you are looking to go deeper.

- Chapter 2, Machine Learning, 1997.
- Generalization as Search, 1982.
- Chapter 1, Data Mining: Practical Machine Learning Tools and Techniques, 2016.
- On algorithm selection, with an application to combinatorial search problems, 2012.
- Algorithm Selection on Wikipedia

In this post, you discovered the conceptualization of applied machine learning as a search problem.

Specifically, you learned:

- That applied machine learning is the problem of approximating an unknown underlying mapping function from inputs to outputs.
- That design decisions such as the choice of data and choice of algorithm narrow the scope of possible mapping functions that you may ultimately choose.
- That the conceptualization of machine learning as search helps to rationalize the use of ensembles, the spot checking of algorithms and the understanding of what is happening when algorithms learn.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Applied Machine Learning as a Search Problem appeared first on Machine Learning Mastery.

]]>The post Why Applied Machine Learning Is Hard appeared first on Machine Learning Mastery.

]]>Applied machine learning is challenging.

You must make many decisions where there is no known “*right answer*” for your specific problem, such as:

- What framing of the problem to use?
- What input and output data to use?
- What learning algorithm to use?
- What algorithm configuration to use?

This is challenging for beginners that expect that you can calculate or be told what data to use or how to best configure an algorithm.

In this post, you will discover the intractable nature of designing learning systems and how to deal with it.

After reading this post, you will know:

- How to develop a clear definition of your learning problem for yourself and others.
- The 4 decision points you must consider when designing a learning system for your problem.
- The 3 strategies that you can use to specifically address the intractable problem of designing learning systems in practice.

Let’s get started.

This post is divided into 6 sections inspired by chapter 1 of Tom Mitchell’s excellent 1997 book Machine Learning; they are:

- Well-Posed Learning Problems
- Choose the Training Data
- Choose the Target Function
- Choose a Representation of the Target Function
- Choose a Learning Algorithm
- How to Design Learning Systems

We can define a general learning task in the field of applied machine learning as a program that learns from experience on some task against a specific performance measure.

Tom Mitchell in his 1997 book Machine Learning states this clearly as:

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

— Page 2, Machine Learning, 1997.

We take this as a general definition for the types of learning tasks that we may be interested in for applied machine learning such as predictive modeling. Tom lists a few examples to make this clear, such as:

- Learning to recognize spoken words.
- Learning to drive an autonomous vehicle.
- Learning to classify new astronomical structures.
- Learning to play world-class backgammon.

We can use the above definition to define our own predictive modeling problem. Once defined, the task becomes that of designing a learning system to address it.

Designing a learning system, i.e. an application of machine learning, involves four design choices:

- Choosing the training data.
- Choosing the target function.
- Choosing the representation.
- Choosing the learning algorithm.

There might be a best set of choices that you can make for a given problem given infinite resources, but we don’t have infinite time, compute resources, and knowledge about the domain or learning systems.

Therefore, although we can prepare a well-posed description of a learning problem, designing the best possible learning system is intractable.

The best we can do is use knowledge, skill, and available resources to work through the design choices.

Let’s look at each of these design choices in more detail.

You must choose the data your learning system will use as experience from which to learn.

This is the data from past observations.

The type of training experience available can have a significant impact on success or failure of the learner.

— Page 5, Machine Learning, 1997.

It is rarely well formatted and ready to use; often you must collect the data you need (or think you might need) for the learning problem.

This may mean:

- Scraping documents.
- Querying databases.
- Processing files.
- Collating disparate sources
- Consolidating entities.

You need to get all of the data together and into a normalized form such that one observation represents one entity for which an outcome is available.

Next, you must choose the framing of the learning problem.

Machine learning is really a problem of learning a mapping function (f) from inputs (X) to outputs (y).

y = f(X)

This function can then be used on new data in the future in order to predict the most likely output.

The goal of the learning system is to prepare a function that best maps inputs to outputs given the resources available. The underlying function that actually exists is unknown. If we knew the form of this function, we could use it directly and we would not need machine learning to learn it.

More generally, this is a problem called function approximation. The result will be an approximation, meaning that it will have error. We will do our best to minimize this error, but some error will always exist given noise in the data.

… we have reduced the learning task in this case to the problem of discovering an operational description of the ideal target function V. It may be very difficult in general to learn such an operational form of V perfectly. In fact, we often expect learning algorithms to acquire only some approximation to the target function, and for this reason the process of learning the target function is often called function approximation.

— Page 8, Machine Learning, 1997.

This step is about selecting exactly what data to use as input to the function, e.g. the input features or input variables and what exactly will be predicted, e.g. the output variable.

Often, I refer to this as the framing of your learning problem. Choosing the inputs and outputs essentially chooses the nature of the target function we will seek to approximate.

Next, you must choose the representation you wish to use for the mapping function.

Think of this as the type of final model you wish to have that you can then use to make predictions. You must choose the form of this model, the data structure if you’d like.

Now that we have specified the ideal target function V, we must choose a representation that the learning program will use to describe the function Vprime that it will learn.

— Page 8, Machine Learning, 1997.

For example:

- Perhaps your project requires a decision tree that is easy to understand and explain to stakeholders.
- Perhaps your stakeholders prefer a linear model that the stats guys can easily interpret.
- Perhaps your stakeholders don’t care about anything other than model performance so all model representations are up for grabs.

The choice of representation will impose constraints on the types of learning algorithms that you can use to learn the mapping function.

Finally, you must choose the learning algorithm that will take the input and output data and learn a model of your preferred representation.

If there were few constraints on the choice of representation, as is often the case, then you may be able to evaluate a suite of different algorithms and representations.

If there were strong constraints on the choice of function representation, e.g. a weighted sum linear model or a decision tree, then the choice of algorithms will be limited to those that that can operate on the specific representations.

The choice of algorithm may impose its own constraints, such as specific data preparation transforms like data normalization.

Developing a learning system is challenging.

No one can tell you the best answer to each decision along the way; the best answer is unknown for your specific learning problem.

Mitchell helps to clarify this with a depiction of the choices made in designing a learning system for playing checkers.

The choices act as points of constraint on the design process. Mitchell goes on to say:

These design choices have constrained the learning task in a number of ways. We have restricted the type of knowledge that can be acquired to a single linear evaluation function. Furthermore, we have constrained this evaluation function to depend on only the six specific board features provided. If the true target function V can indeed be represented by a linear combination of these particular features, then our program has a good chance to learn it. If not, then the best we can hope for is that it will learn a good approximation, since a program can certainly never learn anything that it cannot at least represent.

— Pages 13-14, Machine Learning, 1997.

I like this passage as it really drives home both the importance of these constraints to simplify the problem, and the risk of making choices that limit or prevent the system from learning the problem sufficiently.

Generally, you cannot analytically calculate the answer to these choices, e.g. what data to use, what algorithm to use, or what algorithm configuration to use.

Nevertheless, all is not lost; here are 3 tactics that you can use in practice:

**Copy**. Look to the literature or experts for learning systems on problems the same or similar to your problem and copy the design of the learning system. It is very likely you are not the first to work on a problem of a given type. At the very worst, the copied design provides a starting point for your own design.**Search**. List available options at each decision point and empirically evaluate each to see what works best on your specific data. This may be the most robust and most practiced approach in applied machine learning.**Design**. After completing many projects via the Copy and Search methods above, you will develop an intuition for how to design machine learning systems.

Developing learning systems is not a science; it is engineering.

Developing new machine learning algorithms and describing how and why they work is a science, and this is often not required when developing a learning system.

Developing a learning system is a lot like developing software. You must combine (1) copies of past designed that work, (2) prototypes that show promising results, and (3) design experience when developing a new system in order to get the best results.

This section provides more resources on the topic if you are looking to go deeper.

- Chapter 1, Machine Learning, 1997.
- Tom Mitchell’s homepage
- Well Posed Problem on Wikipedia
- Intractability on Wikipedia
- How to Define Your Machine Learning Problem
- Applied Machine Learning Process

In this post, you discovered the intractable nature of designing learning systems in applied machine learning and how to deal with it.

Specifically, you learned:

- How to develop a clear definition of your learning problem for yourself and others.
- The 4 decision points you must consider when designing a learning system for your problem.
- The 3 strategies that you can use to specifically address the intractable problem of designing learning systems in practice.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Why Applied Machine Learning Is Hard appeared first on Machine Learning Mastery.

]]>The post How to Plan and Run Machine Learning Experiments Systematically appeared first on Machine Learning Mastery.

]]>This gives you a lot of time to think and plan for additional experiments to perform.

In addition, the average applied machine learning project may require tens to hundreds of discrete experiments in order to find a data preparation model and model configuration that gives good or great performance.

The drawn-out nature of the experiments means that you need to carefully plan and manage the order and type of experiments that you run.

You need to be systematic.

In this post, you will discover a simple approach to plan and manage your machine learning experiments.

With this approach, you will be able to:

- Stay on top of the most important questions and findings in your project.
- Keep track of what experiments you have completed and would like to run.
- Zoom in on the data preparations, models, and model configurations that give the best performance.

Let’s dive in.

I like to run experiments overnight. Lots of experiments.

This is so that when I wake up, I can check results, update my ideas of what is working (and what is not), and kick off the next round of experiments, then spend some time analyzing the findings.

I hate wasting time.

And I hate running experiments that do not get me closer to the goal of finding the most skillful model, given the time and resources I have available.

It is easy to lose track of where you’re up to. Especially after you have results, analysis, and findings from hundreds of experiments.

Poor management of your experiments can lead to bad situations where:

- You’re watching experiments run.
- You’re trying to come up with good ideas of experiments to run right after a current batch has finished.
- You run an experiment that you had already run before.

You never want to be in any of these situations!

If you are on top of your game, then:

- You know exactly what experiments you have run at a glance and what the findings were.
- You have a long list of experiments to run, ordered by their expected payoff.
- You have the time to dive into the analysis of results and think up new and wild ideas to try.

But how can we stay on top of hundreds of experiments?

One way that I have found to help me be systematic with experiments on a project is to use a spreadsheet.

Manage the experiments you have done, that are running, and that you want to run in a spreadsheet.

It is simple and effective.

It is simple in that I or anyone can access it from anywhere and see where we’re at.

I use Google Docs to host the spreadsheet.

There’s no code. No notebook. No fancy web app.

Just a spreadsheet.

It’s effective because it only contains the information needed with one line per experiment and one column for each piece of information to track on the experiment.

Experiments that are done can be separated from those that are planned.

Only experiments that are planned are set-up and run and their order ensures that the most important experiments are run first.

You will be surprised at how much such a simple approach can free up your time and get you thinking deeply about your project.

Let’s look at an example.

We can imagine a spreadsheet with the columns below.

These are just an example from the last project I worked on. I recommend adapting these to your own needs.

**Sub-Project**: A subproject may be a group of ideas you are exploring, a technique, a data preparation, and so on.**Context**: The context may be the specific objective such as beating a baseline, tuning, a diagnostic, and so on.**Setup**: The setup is the fixed configuration of the experiment.**Name**: The name is the unique identifier, perhaps the filename of the script.**Parameter**: The parameter is the thing being varied or looked at in the experiment.**Values**: The value is the value or values of the parameter that are being explored in the experiment.**Status**: The status is the status of the experiment, such as planned, running, or done.**Skill**: The skill is the North Star metric that really matters on the project, like accuracy or error.**Question**: The question is the motivating question the experiment seeks to address.**Finding**: The finding is the one line summary of the outcome of the experiment, the answer to the question.

To make this concrete, below is a screenshot of a Google Doc spreadsheet with these column headings and a contrived example.

I cannot say how much time this approach has saved me. And the number of assumptions that it proved wrong in the pursuit of getting top results.

In fact, I’ve discovered that deep learning methods are often quite hostile to assumptions and defaults. Keep this in mind when designing experiments!

Below are some tips that will help you get the most out of this simple approach on your project.

**Brainstorm**: Make the time to frequently review findings and list new questions and experiments to answer them.**Challenge**: Challenge assumptions and challenge previous findings. Play the scientist and design experiments that would falsify your findings or expectations.**Sub-Projects**: Consider the use of sub-projects to structure your investigation where you follow leads or investigate specific methods.**Experimental Order**: Use the row order as a priority to ensure that the most important experiments are run first.**Deeper Analysis**: Save deeper analysis of results and aggregated findings to another document; the spreadsheet is not the place.**Experiment Types**: Don’t be afraid to mix-in different experiment types such as grid searching, spot checks, and model diagnostics.

You will know that this approach is working well when:

- You are scouring API documentation and papers for more ideas of things to try.
- You have far more experiments queued up than resources to run them.
- You are thinking seriously about hiring a ton more EC2 instances.

In this post, you discovered how you can effectively manage hundreds of experiments that have run, are running, and that you want to run in a spreadsheet.

You discovered that a simple spreadsheet can help you:

- Keep track of what experiments you have run and what you discovered.
- Keep track of what experiments you want to run and what questions they will answer.
- Zoom in on the most effective data preparation, model, and model configuration for your predictive modeling problem.

Do you have any questions about this approach? Have you done something similar yourself?

Let me know in the comments below.

The post How to Plan and Run Machine Learning Experiments Systematically appeared first on Machine Learning Mastery.

]]>The post Why One-Hot Encode Data in Machine Learning? appeared first on Machine Learning Mastery.

]]>Often, machine learning tutorials will recommend or require that you prepare your data in specific ways before fitting a machine learning model.

One good example is to use a one-hot encoding on categorical data.

- Why is a one-hot encoding required?
- Why can’t you fit a model on your data directly?

In this post, you will discover the answer to these important questions and better understand data preparation in general in applied machine learning.

Let’s get started.

Categorical data are variables that contain label values rather than numeric values.

The number of possible values is often limited to a fixed set.

Categorical variables are often called nominal.

Some examples include:

- A “
*pet*” variable with the values: “*dog*” and “*cat*“. - A “
*color*” variable with the values: “*red*“, “*green*” and “*blue*“. - A “
*place*” variable with the values: “first”, “*second*”*and*“*third*“.

Each value represents a different category.

Some categories may have a natural relationship to each other, such as a natural ordering.

The “*place*” variable above does have a natural ordering of values. This type of categorical variable is called an ordinal variable.

Some algorithms can work with categorical data directly.

For example, a decision tree can be learned directly from categorical data with no data transform required (this depends on the specific implementation).

Many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric.

In general, this is mostly a constraint of the efficient implementation of machine learning algorithms rather than hard limitations on the algorithms themselves.

This means that categorical data must be converted to a numerical form. If the categorical variable is an output variable, you may also want to convert predictions by the model back into a categorical form in order to present them or use them in some application.

This involves two steps:

- Integer Encoding
- One-Hot Encoding

As a first step, each unique category value is assigned an integer value.

For example, “*red*” is 1, “*green*” is 2, and “*blue*” is 3.

This is called a label encoding or an integer encoding and is easily reversible.

For some variables, this may be enough.

The integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship.

For example, ordinal variables like the “place” example above would be a good example where a label encoding would be sufficient.

For categorical variables where no such ordinal relationship exists, the integer encoding is not enough.

In fact, using this encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).

In this case, a one-hot encoding can be applied to the integer representation. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value.

In the “*color*” variable example, there are 3 categories and therefore 3 binary variables are needed. A “1” value is placed in the binary variable for the color and “0” values for the other colors.

For example:

red, green, blue 1, 0, 0 0, 1, 0 0, 0, 1

The binary variables are often called “dummy variables” in other fields, such as statistics.

Looking for some tutorials on how to one hot encode your data in Python, see:

- Data Preparation for Gradient Boosting with XGBoost in Python
- How to One Hot Encode Sequence Data in Python

- Categorical variable on Wikipedia
- Nominal category on Wikipedia
- Dummy variable on Wikipedia

In this post, you discovered why categorical data often must be encoded when working with machine learning algorithms.

Specifically:

- That categorical data is defined as variables with a finite set of label values.
- That most machine learning algorithms require numerical input and output variables.
- That an integer and one hot encoding is used to convert categorical data to integer data.

Do you have any questions?

Post your questions to comments below and I will do my best to answer.

The post Why One-Hot Encode Data in Machine Learning? appeared first on Machine Learning Mastery.

]]>The post What is the Difference Between a Parameter and a Hyperparameter? appeared first on Machine Learning Mastery.

]]>There are so many terms to use and many of the terms may not be used consistently. This is especially true if you have come from another field of study that may use some of the same terms as machine learning, but they are used differently.

For example: the terms “*model parameter*” and “*model hyperparameter*.”

Not having a clear definition for these terms is a common struggle for beginners, especially those that have come from the fields of statistics or economics.

In this post, we will take a closer look at these terms.

A model parameter is a configuration variable that is internal to the model and whose value can be estimated from data.

- They are required by the model when making predictions.
- They values define the skill of the model on your problem.
- They are estimated or learned from data.
- They are often not set manually by the practitioner.
- They are often saved as part of the learned model.

Parameters are key to machine learning algorithms. They are the part of the model that is learned from historical training data.

In classical machine learning literature, we may think of the model as the hypothesis and the parameters as the tailoring of the hypothesis to a specific set of data.

Often model parameters are estimated using an optimization algorithm, which is a type of efficient search through possible parameter values.

**Statistics**: In statistics, you may assume a distribution for a variable, such as a Gaussian distribution. Two parameters of the Gaussian distribution are the mean (*mu*) and the standard deviation (*sigma*). This holds in machine learning, where these parameters may be estimated from data and used as part of a predictive model.**Programming**: In programming, you may pass a parameter to a function. In this case, a parameter is a function argument that could have one of a range of values. In machine learning, the specific model you are using is the function and requires parameters in order to make a prediction on new data.

Whether a model has a fixed or variable number of parameters determines whether it may be referred to as “*parametric*” or “*nonparametric*“.

Some examples of model parameters include:

- The weights in an artificial neural network.
- The support vectors in a support vector machine.
- The coefficients in a linear regression or logistic regression.

A model hyperparameter is a configuration that is external to the model and whose value cannot be estimated from data.

- They are often used in processes to help estimate model parameters.
- They are often specified by the practitioner.
- They can often be set using heuristics.
- They are often tuned for a given predictive modeling problem.

We cannot know the best value for a model hyperparameter on a given problem. We may use rules of thumb, copy values used on other problems, or search for the best value by trial and error.

When a machine learning algorithm is tuned for a specific problem, such as when you are using a grid search or a random search, then you are tuning the hyperparameters of the model or order to discover the parameters of the model that result in the most skillful predictions.

Many models have important parameters which cannot be directly estimated from the data. For example, in the K-nearest neighbor classification model … This type of model parameter is referred to as a tuning parameter because there is no analytical formula available to calculate an appropriate value.

— Page 64-65, Applied Predictive Modeling, 2013

Model hyperparameters are often referred to as model parameters which can make things confusing. A good rule of thumb to overcome this confusion is as follows:

**If you have to specify a model parameter manually then
it is probably a model hyperparameter.**

Some examples of model hyperparameters include:

- The learning rate for training a neural network.
- The C and sigma hyperparameters for support vector machines.
- The k in k-nearest neighbors.

- Hyperparameter on Wikipedia
- What are hyperparameters in machine learning? on Quora
- What is the difference between model hyperparameters and model parameters? on StackExchange
- What is considered a hyperparameter? on Reddit

In this post, you discovered the clear definitions and the difference between model parameters and model hyperparameters.

In summary, model parameters are estimated from data automatically and model hyperparameters are set manually and are used in processes to help estimate model parameters.

Model hyperparameters are often referred to as parameters because they are the parts of the machine learning that must be set manually and tuned.

Did this post help you clear up the confusion?

Let me know in the comments below.

Are there model parameters or hyperparameters that you are still unsure about?

Post them in the comments and I’ll do my best to help clear things up further.

The post What is the Difference Between a Parameter and a Hyperparameter? appeared first on Machine Learning Mastery.

]]>