How to use Learning Curves to Diagnose Machine Learning Model Performance

A learning curve is a plot of model learning performance over experience or time.

Learning curves are a widely used diagnostic tool in machine learning for algorithms that learn from a training dataset incrementally. The model can be evaluated on the training dataset and on a hold out validation dataset after each update during training and plots of the measured performance can created to show learning curves.

Reviewing learning curves of models during training can be used to diagnose problems with learning, such as an underfit or overfit model, as well as whether the training and validation datasets are suitably representative.

In this post, you will discover learning curves and how they can be used to diagnose the learning and generalization behavior of machine learning models, with example plots showing common learning problems.

After reading this post, you will know:

  • Learning curves are plots that show changes in learning performance over time in terms of experience.
  • Learning curves of model performance on the train and validation datasets can be used to diagnose an underfit, overfit, or well-fit model.
  • Learning curves of model performance can be used to diagnose whether the train or validation datasets are not relatively representative of the problem domain.

Discover how to train faster, reduce overfitting, and make better predictions with deep learning models in my new book, with 26 step-by-step tutorials and full source code.

Let’s get started.

A Gentle Introduction to Learning Curves for Diagnosing Deep Learning Model Performance

A Gentle Introduction to Learning Curves for Diagnosing Deep Learning Model Performance
Photo by Mike Sutherland, some rights reserved.

Overview

This tutorial is divided into three parts; they are:

  1. Learning Curves
  2. Diagnosing Model Behavior
  3. Diagnosing Unrepresentative Datasets

Learning Curves in Machine Learning

Generally, a learning curve is a plot that shows time or experience on the x-axis and learning or improvement on the y-axis.

Learning curves (LCs) are deemed effective tools for monitoring the performance of workers exposed to a new task. LCs provide a mathematical representation of the learning process that takes place as task repetition occurs.

Learning curve models and applications: Literature review and research directions, 2011.

For example, if you were learning a musical instrument, your skill on the instrument could be evaluated and assigned a numerical score each week for one year. A plot of the scores over the 52 weeks is a learning curve and would show how your learning of the instrument has changed over time.

  • Learning Curve: Line plot of learning (y-axis) over experience (x-axis).

Learning curves are widely used in machine learning for algorithms that learn (optimize their internal parameters) incrementally over time, such as deep learning neural networks.

The metric used to evaluate learning could be maximizing, meaning that better scores (larger numbers) indicate more learning. An example would be classification accuracy.

It is more common to use a score that is minimizing, such as loss or error whereby better scores (smaller numbers) indicate more learning and a value of 0.0 indicates that the training dataset was learned perfectly and no mistakes were made.

During the training of a machine learning model, the current state of the model at each step of the training algorithm can be evaluated. It can be evaluated on the training dataset to give an idea of how well the model is “learning.” It can also be evaluated on a hold-out validation dataset that is not part of the training dataset. Evaluation on the validation dataset gives an idea of how well the model is “generalizing.”

  • Train Learning Curve: Learning curve calculated from the training dataset that gives an idea of how well the model is learning.
  • Validation Learning Curve: Learning curve calculated from a hold-out validation dataset that gives an idea of how well the model is generalizing.

It is common to create dual learning curves for a machine learning model during training on both the training and validation datasets.

In some cases, it is also common to create learning curves for multiple metrics, such as in the case of classification predictive modeling problems, where the model may be optimized according to cross-entropy loss and model performance is evaluated using classification accuracy. In this case, two plots are created, one for the learning curves of each metric, and each plot can show two learning curves, one for each of the train and validation datasets.

  • Optimization Learning Curves: Learning curves calculated on the metric by which the parameters of the model are being optimized, e.g. loss.
  • Performance Learning Curves: Learning curves calculated on the metric by which the model will be evaluated and selected, e.g. accuracy.

Now that we are familiar with the use of learning curves in machine learning, let’s look at some common shapes observed in learning curve plots.

Want Better Results with Deep Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Diagnosing Model Behavior

The shape and dynamics of a learning curve can be used to diagnose the behavior of a machine learning model and in turn perhaps suggest at the type of configuration changes that may be made to improve learning and/or performance.

There are three common dynamics that you are likely to observe in learning curves; they are:

  • Underfit.
  • Overfit.
  • Good Fit.

We will take a closer look at each with examples. The examples will assume that we are looking at a minimizing metric, meaning that smaller relative scores on the y-axis indicate more or better learning.

Underfit Learning Curves

Underfitting refers to a model that cannot learn the training dataset.

Underfitting occurs when the model is not able to obtain a sufficiently low error value on the training set.

— Page 111, Deep Learning, 2016.

An underfit model can be identified from the learning curve of the training loss only.

It may show a flat line or noisy values of relatively high loss, indicating that the model was unable to learn the training dataset at all.

An example of this is provided below and is common when the model does not have a suitable capacity for the complexity of the dataset.

Example of Training Learning Curve Showing An Underfit Model That Does Not Have Sufficient Capacity

Example of Training Learning Curve Showing An Underfit Model That Does Not Have Sufficient Capacity

An underfit model may also be identified by a training loss that is decreasing and continues to decrease at the end of the plot.

This indicates that the model is capable of further learning and possible further improvements and that the training process was halted prematurely.

Example of Training Learning Curve Showing an Underfit Model That Does Not Have Sufficient Capacity

Example of Training Learning Curve Showing an Underfit Model That Requires Further Training

A plot of learning curves shows underfitting if:

  • The training loss remains flat regardless of training.
  • The training loss continues to decrease until the end of training.

Overfit Learning Curves

Overfitting refers to a model that has learned the training dataset too well, including the statistical noise or random fluctuations in the training dataset.

… fitting a more flexible model requires estimating a greater number of parameters. These more complex models can lead to a phenomenon known as overfitting the data, which essentially means they follow the errors, or noise, too closely.

— Page 22, An Introduction to Statistical Learning: with Applications in R, 2013.

The problem with overfitting, is that the more specialized the model becomes to training data, the less well it is able to generalize to new data, resulting in an increase in generalization error. This increase in generalization error can be measured by the performance of the model on the validation dataset.

This is an example of overfitting the data, […]. It is an undesirable situation because the fit obtained will not yield accurate estimates of the response on new observations that were not part of the original training data set.

— Page 24, An Introduction to Statistical Learning: with Applications in R, 2013.

This often occurs if the model has more capacity than is required for the problem, and, in turn, too much flexibility. It can also occur if the model is trained for too long.

A plot of learning curves shows overfitting if:

  • The plot of training loss continues to decrease with experience.
  • The plot of validation loss decreases to a point and begins increasing again.

The inflection point in validation loss may be the point at which training could be halted as experience after that point shows the dynamics of overfitting.

The example plot below demonstrates a case of overfitting.

Example of Train and Validation Learning Curves Showing an Overfit Model

Example of Train and Validation Learning Curves Showing an Overfit Model

Good Fit Learning Curves

A good fit is the goal of the learning algorithm and exists between an overfit and underfit model.

A good fit is identified by a training and validation loss that decreases to a point of stability with a minimal gap between the two final loss values.

The loss of the model will almost always be lower on the training dataset than the validation dataset. This means that we should expect some gap between the train and validation loss learning curves. This gap is referred to as the “generalization gap.”

A plot of learning curves shows a good fit if:

  • The plot of training loss decreases to a point of stability.
  • The plot of validation loss decreases to a point of stability and has a small gap with the training loss.

Continued training of a good fit will likely lead to an overfit.

The example plot below demonstrates a case of a good fit.

Example of Train and Validation Learning Curves Showing a Good Fit

Example of Train and Validation Learning Curves Showing a Good Fit

Diagnosing Unrepresentative Datasets

Learning curves can also be used to diagnose properties of a dataset and whether it is relatively representative.

An unrepresentative dataset means a dataset that may not capture the statistical characteristics relative to another dataset drawn from the same domain, such as between a train and a validation dataset. This can commonly occur if the number of samples in a dataset is too small, relative to another dataset.

There are two common cases that could be observed; they are:

  • Training dataset is relatively unrepresentative.
  • Validation dataset is relatively unrepresentative.

Unrepresentative Train Dataset

An unrepresentative training dataset means that the training dataset does not provide sufficient information to learn the problem, relative to the validation dataset used to evaluate it.

This may occur if the training dataset has too few examples as compared to the validation dataset.

This situation can be identified by a learning curve for training loss that shows improvement and similarly a learning curve for validation loss that shows improvement, but a large gap remains between both curves.

Example of Train and Validation Learning Curves Showing a Training Dataset That May Be too Small Relative to the Validation Dataset

Example of Train and Validation Learning Curves Showing a Training Dataset That May Be too Small Relative to the Validation Dataset

Unrepresentative Validation Dataset

An unrepresentative validation dataset means that the validation dataset does not provide sufficient information to evaluate the ability of the model to generalize.

This may occur if the validation dataset has too few examples as compared to the training dataset.

This case can be identified by a learning curve for training loss that looks like a good fit (or other fits) and a learning curve for validation loss that shows noisy movements around the training loss.

Example of Train and Validation Learning Curves Showing a Validation Dataset That May Be too Small Relative to the Training Dataset

Example of Train and Validation Learning Curves Showing a Validation Dataset That May Be too Small Relative to the Training Dataset

It may also be identified by a validation loss that is lower than the training loss. In this case, it indicates that the validation dataset may be easier for the model to predict than the training dataset.

Example of Train and Validation Learning Curves Showing a Validation Dataset That Is Easier to Predict Than the Training Dataset

Example of Train and Validation Learning Curves Showing a Validation Dataset That Is Easier to Predict Than the Training Dataset

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Books

Papers

Posts

Articles

Summary

In this post, you discovered learning curves and how they can be used to diagnose the learning and generalization behavior of machine learning models.

Specifically, you learned:

  • Learning curves are plots that show changes in learning performance over time in terms of experience.
  • Learning curves of model performance on the train and validation datasets can be used to diagnose an underfit, overfit, or well-fit model.
  • Learning curves of model performance can be used to diagnose whether the train or validation datasets are not relatively representative of the problem domain.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.


Develop Better Deep Learning Models Today!

Better Deep Learning

Train Faster, Reduce Overftting, and Ensembles

…with just a few lines of python code

Discover how in my new Ebook:
Better Deep Learning

It provides self-study tutorials on topics like: weight decay, batch normalization, dropout, model stacking and much more…

Bring better deep learning to your projects!

Skip the Academics. Just Results.

Click to learn more.


65 Responses to How to use Learning Curves to Diagnose Machine Learning Model Performance

  1. Roland Fernandez February 28, 2019 at 4:09 am #

    Thanks for article on this core ML technique. First learning curve shown seems a poor example of underfutting since loss on y axis is already so low. Also, maybe condition shown on 2nd plot should be called “under trained” to avoid confusion with “having trouble learning more” condition of under fitting. Also the summary paragraph for underfitting has typo and data “overfitting”.

  2. Roland Fernandez February 28, 2019 at 4:11 am #

    My own typo :). 2nd to last word above should be “says”

  3. Angelos Angelidakis February 28, 2019 at 5:26 am #

    Very informative!

    • Jason Brownlee February 28, 2019 at 6:46 am #

      Thanks.

      • phz April 3, 2019 at 7:18 pm #

        Theres still a typo here:
        A plot of learning curves shows overfitting if:

        The training loss remains flat regardless of training.
        The training loss continues to decrease until the end of training.

        => this is underfitting.

  4. Ashish March 6, 2019 at 1:10 am #

    The methods like genaralization are used for these conditions only or not?

    • Jason Brownlee March 6, 2019 at 7:56 am #

      Sorry, I don’t understand, can you please elaborate or rephrase the question?

  5. Adrien Kinart March 21, 2019 at 8:06 pm #

    I would have said that the error from the training set should increase to converge to the error from the validation set to indicate good fit. What do you think about that? (https://www.dataquest.io/blog/learning-curves-machine-learning )

    • Jason Brownlee March 22, 2019 at 8:24 am #

      Does not happen in practice in my experience because often the test/val are smaller and less representative than the train and have different error profile.

  6. George April 3, 2019 at 6:22 pm #

    Hi Jason and thanks for the post.

    I have one question not related with this post though and I wanted your opinion.

    Lets’s say I have I am training some data and during the preprocessing I am cleaning that data. I remove some weird/wrong values from it.

    Now, when I am going to use the predict to the unseen new data, do I need to apply the same cleaning to that data before making the prediction?

    Are there any caveats for doing or not doing this?

    I guess I should the same cleaning but it confuses me that we have unseen data and it can be anything..

    (I am not talking about scaling or that kind of preprocessing which I already apply to the train and unseen data)

    Thank you very much!

    George

    • Jason Brownlee April 4, 2019 at 7:41 am #

      Great question.

      Yes, if you can use generic but domain-specific knowledge to prepare/filter data, then it is a good idea to use this process consistently when fitting and evaluating a model, as well as when making predictions in the future.

      The risk is data leakage, e.g. using knowledge about “unseen”/test data to help better fit the model. This might help (and be a bit too strict):
      https://machinelearningmastery.com/data-leakage-machine-learning/

  7. JG April 3, 2019 at 9:35 pm #

    Great post Jason. Tahnks.

    – My summary, that I appreciate if you can evaluate if am I right about all this stuff is:

    overfitting appears when we learn so much details that are irrelevant to the main stream ideas to be learned (general concepts). This can be the situation when you have, on one side a very big complex model (with many layers and many weight to be adjusted.i.e. with a very “hight entropic information capacity”) and on the other side a few amount of data to be trained …so the solution could be the simplify the model or increase de train dataset.
    On the other side underfitting appears when we need more experience (more epochs) to train the model, so learning curves trend are continually down..until you get the right stabilization with the appropriate set of epochs …

    – My second question it is , how do you interpret the case when validation data get better performance (high level) than training data…is it a good indication of good generalization ?.

    thank you Jason to allow us to share your knowledge !!

    • Jason Brownlee April 4, 2019 at 7:56 am #

      Yes, but you can underfit if the model does not have sufficient capacity to learn from the data. This can be from epochs or from model complexity/size.

      It is a sign that the validation dataset is too small and not representative of the problem – very common.

  8. Jakub May 21, 2019 at 8:27 pm #

    Great post!
    Thank you very much.

  9. Pritam June 29, 2019 at 10:15 pm #

    Sir, though is something of the track question, still felt like asking. How can I “mathematically” explain the benefit of centered and scaled data for machine learning models instead of raw data. Accuracy and convergence no doubt improves for the normalized data, but can I show it mathematically?

  10. Frank July 4, 2019 at 3:32 am #

    It is correct to create a learning curve graph using three sets of data (training, validation, and testing). Using the “training” set to train the model and use the “validation” and “test” sets to generate the learning curves?

  11. Chen July 5, 2019 at 12:25 pm #

    Thank you for your post!! It helps a lot!! Could you please help me to check the learning curve I got (http://zhuchen.org.cn/wp-content/uploads/2019/07/lc.png), is it underfitted? It’s a multi-classification problem using random forest.

  12. zeinab July 22, 2019 at 9:11 am #

    A very great and useful tutorial, thank you

  13. zeinab July 22, 2019 at 10:54 am #

    Can I ask about the meaning of “flat line” in case of under-fitting?

    • Jason Brownlee July 22, 2019 at 2:05 pm #

      It suggests the model does not have sufficient capacity for the problem.

  14. zeinab July 23, 2019 at 12:58 am #

    If the loss increases then decreases then increases then decreases and so on..
    What does this means?
    Does it means that the data is unrepresentative in that model? or
    Does it means that an overfitting happens?

    • Jason Brownlee July 23, 2019 at 8:04 am #

      Great question!

      It could mean that the data is noisy/unrepresentative or that the model is unstable (e.g. the batch size or scaling of input data).

  15. zeinab July 23, 2019 at 1:43 pm #

    I use Pearson correlation coefficient as the accuracy metric for a regression problem.

    Can I use the correlation coefficient as the Optimization learning curve?

  16. jake July 27, 2019 at 3:28 am #

    Hi Jason.

    I post two pictures of my training model here

    https://stackoverflow.com/questions/57224353/is-my-training-data-set-too-complex-for-my-neural-network

    would you be able to tell me if my model is over fitting or under fitting. I believe it is under fitting.

    how can i fix this problems?

    Thanks once again Jaso, You dont know how much you have helped me

  17. zeinab August 4, 2019 at 11:40 pm #

    can I ask you about the need for the performance learning curve?
    I understand from this tutorial that the optimization learning curves are used for checking the model fitness?
    But what is the importance of the performance learning curves?

    • Jason Brownlee August 5, 2019 at 6:53 am #

      What do you mean by performance learning curve?

      • zeinab August 5, 2019 at 12:23 pm #

        performance learning curve that represent the accuracy over epochs

        • Jason Brownlee August 5, 2019 at 2:04 pm #

          I see, good question.

          The performance curve can give you an idea of whether changes in loss connect with real tangible gains in skill on the problem.

  18. zeinab August 4, 2019 at 11:41 pm #

    should I stop training the model when the it reaches the minimum loss?

    • Jason Brownlee August 5, 2019 at 6:53 am #

      Yes, on the validation set.

      • Zeinab August 5, 2019 at 8:22 pm #

        If I reaches the minimum validation loss value,
        However, the validation accuracy value is not high.
        In this case, Have I stop learning?

        • Jason Brownlee August 6, 2019 at 6:35 am #

          Minimum loss is 0, if you hit zero loss it suggests the problem is trivial (ML is not needed) or the model has overfit.

          • zeinab August 6, 2019 at 11:16 pm #

            Sorry, I want to say, if I reach a minimum validation loss value (not 0) but at this epoch the validation accuracy is not the highest value(after this epoch, the validation accuracy is higher).

            At this situation, should I stop training?

          • Jason Brownlee August 7, 2019 at 7:57 am #

            Perhaps try it and see.

  19. zeinab August 5, 2019 at 12:26 pm #

    Can I measure the model fitness from the accuracy learning curves instead of the loss learning curves?

    • Jason Brownlee August 5, 2019 at 2:04 pm #

      Sure. It just may not be as helpful in diagnosing learning dynamics.

      • zeinab August 5, 2019 at 10:50 pm #

        what do you mean by learning dynamics ?

        • Jason Brownlee August 6, 2019 at 6:38 am #

          How the model learns over time, reflected in the learning curve.

  20. zeinab August 5, 2019 at 12:37 pm #

    Is there is a problem , if the loss curve is a straight line that decreases over the epochs?

  21. zeinab August 5, 2019 at 12:38 pm #

    If you please, Can you suggest for me a good reference to read more about learning curves?

  22. Zeinab August 5, 2019 at 8:01 pm #

    Does the validation loss value must be lower than the training loss value?

    • Jason Brownlee August 6, 2019 at 6:34 am #

      For a well fit model, validation and training loss should be very similar.

  23. zeinab August 6, 2019 at 4:22 am #

    which is preferred using:
    – the early stopping or
    – analyzing the output to find the minimum validation loss

    • Jason Brownlee August 6, 2019 at 6:41 am #

      It depends on the model and on the dataset.

      Perhaps experiment and see what is reliable for your specific scenario.

  24. Zeinab August 6, 2019 at 11:19 am #

    Which is preferred using early stop with low patencie value or high value

    • Jason Brownlee August 6, 2019 at 2:05 pm #

      It depends on your choice of model and the dataset. Perhaps experiment?

  25. Zeinab August 6, 2019 at 11:22 am #

    If I reaches the minimum validation loss value, while at this epoch there is a gap between the training accuracy and the validation accuracy.
    Should i stop learning or not?

  26. zeinab August 6, 2019 at 11:19 pm #

    Why should I stop when I reaches a minimum validation loss and not when I reaches the minimum gap between the validation and training loss?

    • Jason Brownlee August 7, 2019 at 7:58 am #

      Try a range of approaches and see what results in a robust and skillful model for your dataset.

      In general, you want to stop training when the train and validation loss is lowest and before validation loss starts to rise.

  27. Jim Peyton August 17, 2019 at 12:12 am #

    Great tutorial!

    On the second graph showing an undertrained model, it seems like the validation data loss should track higher than the training data loss, which is different then what the graph shows. Perhaps an editing error?

    Again, great work here. Thanks for sharing.

    • Jason Brownlee August 17, 2019 at 5:48 am #

      No error, the val set in that case was perhaps under-representative. The important point was the shape of the train/val curves showing that more meaningful training is very possible.

Leave a Reply