Overfitting and Underfitting With Machine Learning Algorithms

The cause of poor performance in machine learning is either overfitting or underfitting the data.

In this post, you will discover the concept of generalization in machine learning and the problems of overfitting and underfitting that go along with it.

Let’s get started.

Overfitting and Underfitting With Machine Learning Algorithms

Overfitting and Underfitting With Machine Learning Algorithms
Photo by Ian Carroll, some rights reserved.

Approximate a Target Function in Machine Learning

Supervised machine learning is best understood as approximating a target function (f) that maps input variables (X) to an output variable (Y).

Y = f(X)

This characterization describes the range of classification and prediction problems and the machine algorithms that can be used to address them.

An important consideration in learning the target function from the training data is how well the model generalizes to new data. Generalization is important because the data we collect is only a sample, it is incomplete and noisy.

Get your FREE Algorithms Mind Map

Machine Learning Algorithms Mind Map

Sample of the handy machine learning algorithms mind map.

I've created a handy mind map of 60+ algorithms organized by type.

Download it, print it and use it. 

Download For Free


Also get exclusive access to the machine learning algorithms email mini-course.

 

 

Generalization in Machine Learning

In machine learning we describe the learning of the target function from training data as inductive learning.

Induction refers to learning general concepts from specific examples which is exactly the problem that supervised machine learning problems aim to solve. This is different from deduction that is the other way around and seeks to learn specific concepts from general rules.

Generalization refers to how well the concepts learned by a machine learning model apply to specific examples not seen by the model when it was learning.

The goal of a good machine learning model is to generalize well from the training data to any data from the problem domain. This allows us to make predictions in the future on data the model has never seen.

There is a terminology used in machine learning when we talk about how well a machine learning model learns and generalizes to new data, namely overfitting and underfitting.

Overfitting and underfitting are the two biggest causes for poor performance of machine learning algorithms.

Statistical Fit

In statistics, a fit refers to how well you approximate a target function.

This is good terminology to use in machine learning, because supervised machine learning algorithms seek to approximate the unknown underlying mapping function for the output variables given the input variables.

Statistics often describe the goodness of fit which refers to measures used to estimate how well the approximation of the function matches the target function.

Some of these methods are useful in machine learning (e.g. calculating the residual errors), but some of these techniques assume we know the form of the target function we are approximating, which is not the case in machine learning.

If we knew the form of the target function, we would use it directly to make predictions, rather than trying to learn an approximation from samples of noisy training data.

Overfitting in Machine Learning

Overfitting refers to a model that models the training data too well.

Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. This means that the noise or random fluctuations in the training data is picked up and learned as concepts by the model. The problem is that these concepts do not apply to new data and negatively impact the models ability to generalize.

Overfitting is more likely with nonparametric and nonlinear models that have more flexibility when learning a target function. As such, many nonparametric machine learning algorithms also include parameters or techniques to limit and constrain how much detail the model learns.

For example, decision trees are a nonparametric machine learning algorithm that is very flexible and is subject to overfitting training data. This problem can be addressed by pruning a tree after it has learned in order to remove some of the detail it has picked up.

Underfitting in Machine Learning

Underfitting refers to a model that can neither model the training data nor generalize to new data.

An underfit machine learning model is not a suitable model and will be obvious as it will have poor performance on the training data.

Underfitting is often not discussed as it is easy to detect given a good performance metric. The remedy is to move on and try alternate machine learning algorithms. Nevertheless, it does provide a good contrast to the problem of overfitting.

A Good Fit in Machine Learning

Ideally, you want to select a model at the sweet spot between underfitting and overfitting.

This is the goal, but is very difficult to do in practice.

To understand this goal, we can look at the performance of a machine learning algorithm over time as it is learning a training data. We can plot both the skill on the training data and the skill on a test dataset we have held back from the training process.

Over time, as the algorithm learns, the error for the model on the training data goes down and so does the error on the test dataset. If we train for too long, the performance on the training dataset may continue to decrease because the model is overfitting and learning the irrelevant detail and noise in the training dataset. At the same time the error for the test set starts to rise again as the model’s ability to generalize decreases.

The sweet spot is the point just before the error on the test dataset starts to increase where the model has good skill on both the training dataset and the unseen test dataset.

You can perform this experiment with your favorite machine learning algorithms. This is often not useful technique in practice, because by choosing the stopping point for training using the skill on the test dataset it means that the testset is no longer “unseen” or a standalone objective measure. Some knowledge (a lot of useful knowledge) about that data has leaked into the training procedure.

There are two additional techniques you can use to help find the sweet spot in practice: resampling methods and a validation dataset.

How To Limit Overfitting

Both overfitting and underfitting can lead to poor model performance. But by far the most common problem in applied machine learning is overfitting.

Overfitting is such a problem because the evaluation of machine learning algorithms on training data is different from the evaluation we actually care the most about, namely how well the algorithm performs on unseen data.

There are two important techniques that you can use when evaluating machine learning algorithms to limit overfitting:

  1. Use a resampling technique to estimate model accuracy.
  2. Hold back a validation dataset.

The most popular resampling technique is k-fold cross validation. It allows you to train and test your model k-times on different subsets of training data and build up an estimate of the performance of a machine learning model on unseen data.

A validation dataset is simply a subset of your training data that you hold back from your machine learning algorithms until the very end of your project. After you have selected and tuned your machine learning algorithms on your training dataset you can evaluate the learned models on the validation dataset to get a final objective idea of how the models might perform on unseen data.

Using cross validation is a gold standard in applied machine learning for estimating model accuracy on unseen data. If you have the data, using a validation dataset is also an excellent practice.

Further Reading

This section lists some recommended resources if you are looking to learn more about generalization, overfitting and underfitting in machine learning.

Summary

In this post, you discovered that machine learning is solving problems by the method of induction.

You learned that generalization is a description of how well the concepts learned by a model apply to new data. Finally, you learned about the terminology of generalization in machine learning of overfitting and underfitting:

  • Overfitting: Good performance on the training data, poor generliazation to other data.
  • Underfitting: Poor performance on the training data and poor generalization to other data

Do you have any questions about overfitting, underfitting or this post? Leave a comment and ask your question and I will do my best to answer it.

Frustrated With Machine Learning Math?

See How Algorithms Work in Minutes

...with just arithmetic and simple examples

Discover how in my new Ebook: Master Machine Learning Algorithms

It covers explanations and examples of 10 top algorithms, including:
Linear Regression, k-Nearest Neighbors, Support Vector Machines and much more...

Finally, Pull Back the Curtain on
Machine Learning Algorithms

Skip the Academics. Just Results.

Click to learn more.

19 Responses to Overfitting and Underfitting With Machine Learning Algorithms

  1. Kieron March 21, 2016 at 5:08 am #

    Loving the website Jason! We are currently looking at building a platform to connect data scientists and the like to companies http://www.kortx.co.uk

  2. Gary August 25, 2016 at 8:03 pm #

    Hi Jason,
    Are resampling and hold-out (the two options for limiting overfitting) mutually exclusive, or do they often get used together? Would using both render one of the techniques redundant?
    Thanks

    • Jason Brownlee August 26, 2016 at 10:32 am #

      Hi Gary,

      Typically you want to pick one method to estimate the performance of your algorithm. Generally, k-fold cross validation is the recommended method. Using two such approaches in conjunction does not really make sense, at least to me off the cuff.

  3. Bruno August 29, 2016 at 6:27 pm #

    Thank you Jason for these article,

    I applied you recipe quite successfully! Nevertheless, I remarked that cross validation does not prevent from overfitting.
    Depending on data and algorithm, it can be very easy to get low error rate using cross validation but overfitting.
    Did use saw that already ?

    This is the reason why I’m using two steps :
    1. compare means of train set score with test set score to check I do not overfit to much, and adjust algorithm parameters
    2. compute cross validation to get the general performance using previous parameters.

    Am I wrong while doing this procedure ?
    Many Thanks!

    • Jason Brownlee August 30, 2016 at 8:27 am #

      I agree Bruno, CV is a technique to reduce overfitting, but must be employed carefully (e.g. no of folds).

      The human is biased, so you also limit the number of human-in-the-loop iterations, because we will encourage the method to overfit, even with CV. Therefore it is also a good idea to hold back a validation dataset that is only evaluated once for final model selection.

      You procedure looks fine, consider adding a validation dataset for use after CV.

  4. Bruno September 5, 2016 at 6:43 pm #

    Jason, I can’t figure out how to use the validation set
    Do you use it to check for performance agreement/no overfiting with CV score ?

    Which score (and error) do you use as model performance : the one computed from the validation set or the one from the CV ?

    Thank a lot

    • Jason Brownlee September 6, 2016 at 9:46 am #

      Hi Bruno,

      I would suggest using the CV to estimate the skill of a model.

      I would further suggest that the validation dataset can be used as a smoke test. For example, the error should be within 2-3 standard deviations of the mean error estimate of the CV, e.g. “reasonable”.

      Finally, once you pick a model, you can go ahead and train it on all your training data and start using it to make predictions.

      I hope that helps.

  5. Bruno September 7, 2016 at 5:39 pm #

    Hi Jason,
    ‘got it!
    Many thanks for your clear answers and your time.

  6. Lijo November 2, 2016 at 7:23 am #

    Hi Jason,
    I was wondering why cant we use a validation dataset and then find the sweet spot by comparing our model with the training set and validation dataset, what are the disadvantages of using this procedure?

    • Jason Brownlee November 2, 2016 at 9:13 am #

      Hi Lijo,

      It’s hard. Your approach may work fine, but on some problems it may lead to overfitting. Experiment and see.

      The knowledge of how well the model does on the held out (invisible) validation set is being fed back into model selection, influencing the training process.

  7. Wan. November 18, 2016 at 11:43 am #

    my constant value is around 111.832 , is that called overfitting? I’m doing a logistic regression to predict malware detection with data traffic 5000 records, i did feature selection technique in rapid miner extracting 7 features out of 56 and do the statistical logistic regression in SPSS . three, significant feature selected out of 7, At last, I need to draw threshold graph where cut off is 80% from the probability value. like I said, my constant is high and performance 97%. please advise.

    • Jason Brownlee November 19, 2016 at 8:40 am #

      Hi Wan,

      Overfitting refers to learning the training dataset set so well that it costs you performance on new unseen data. That the model cannot generalize as well to new examples.

      You can evaluate this my evaluating your model on new data, or using resampling techniques like k-fold cross validation to estimate the performance on new data.

      Does that help?

  8. John January 27, 2017 at 9:42 pm #

    Great explanation, as always. If you feel like correcting a small typo:
    “Underfitting refers to a model that can neither model the training data **not** generalize to new data.” (I’m not a native English speaker but think it should be “nor”).

    • Jason Brownlee January 28, 2017 at 7:39 am #

      Thanks John, I fixed a few typos including the one you pointed out.

  9. Saqib Qamar May 1, 2017 at 5:28 pm #

    Hi Jason,
    Great tutorial regarding overfitting…

    Thanks a lot

  10. hang May 11, 2017 at 5:35 am #

    Is the solution to a XOR problem a overfit?

    It cannot be solved with 2 units, and one output?

    • Jason Brownlee May 11, 2017 at 8:35 am #

      Perhaps underfit – as in under-provisioned to be able to solve it. Or even ill-suited.

Leave a Reply