A Simple Intuition for Overfitting, or Why Testing on Training Data is a Bad Idea

When you first start out with machine learning you load a dataset and try models. You might think to yourself, why can’t I just build a model with all of the data and evaluate it on the same dataset?

It seems reasonable. More data to train the model is better, right? Evaluating the model and reporting results on the same dataset will tell you how good the model is, right?

Wrong.

In this post you will discover the difficulties with this reasoning and develop an intuition for why it is important to test a model on unseen data.

Train and Test on the Same Dataset

If you have a dataset, say the iris flower dataset, what is the best model of that dataset?

Irises

Irises
Photo by dottieg2007, some rights reserved

The best model is the dataset itself. If you take a given data instance and ask for it’s classification, you can look that instance up in the dataset and report the correct result every time.

This is the problem you are solving when you train and test a model on the same dataset.

You are asking the model to make predictions to data that it has “seen” before. Data that was used to create the model. The best model for this problem is the look-up model described above.

Descriptive Model

There are some circumstances where you do want to train a model and evaluate it with the same dataset.

You may want to simplify the explanation of a predictive variable from data. For example, you may want a set of simple rules or a decision tree that best describes the observations you have collected.

In this case, you are building a descriptive model.

These models can be vey useful and can help you in your project or your business to better understand how the attributes relate to the predictive value. You can add meaning to the results with the domain expertise that you have.

The important limitation of a descriptive model is that it is limited to describing the data on which it was trained. You have no idea how accurate a predictive the model it is.

Modeling a Target Function

Consider a made up classification problem that goal of which is to classify data instances as either red or green.

Modeling a Target Function

Modeling a Target Function
Photo by seantoyer, some rights reserved.

For this problem, assume that there exists a perfect model, or a perfect function that can correctly discriminate any data instance from the domain as red or green. In the context of a specific problem, the perfect discrimination function very likely has profound meaning in the problem domain to the domain experts. We want to think about that and try to tap into that perspective. We want to deliver that result.

Our goal when making a predictive model for this problem is to best approximate this perfect discrimination function.

We build our approximation of the perfect discrimination function using sample data collected from the domain. It’s not all the possible data, it’s a sample or subset of all possible data. If we had all the data, there would be no need to make predictions because the answers could just be looked up.

The data we use to build our approximate model contains structure within it pertaining the the ideal discrimination function. Your goal with data preparation is to best expose that structure to the modeling algorithm. The data also contains things that are irrelevant to the discrimination function such as biases from the selection of the data and random noise that perturbs and hides the structure. The model you select to approximate the function must navigate these obstacles.

The framework helps us understand the deeper difference between a descriptive and predictive model.

Descriptive vs Predictive Models

The descriptive model is only concerned with modeling the structure in the observed data. It makes sense to train and evaluate it on the same dataset.

The predictive model is attempting a much more difficult problem, approximating the true discrimination function from a sample of data. We want to use algorithms that do not pick out and model all of the noise in our sample. We do want to chose algorithms that generalize beyond the observed data. It makes sense that we could only evaluate the ability of the model to generalize from a data sample on data that it had not see before during training.

The best descriptive model is accurate on the observed data. The best predictive model is accurate on unobserved data.

Overfitting

The flaw with evaluating a predictive model on training data is that it does not inform you on how well the model has generalized to new unseen data.

A model that is selected for its accuracy on the training dataset rather than its accuracy on an unseen test dataset is very likely have lower accuracy on an unseen test dataset. The reason is that the model is not as generalized. It has specalized to the structure in the training dataset. This is called overfitting, and it’s more insidious than you think.

For example, you may want to stop training your model once the accuracy stops improving. In this situation, there will be a point where the accuracy on the training set continues to improve but the accuracy on unseen data starts to degrade.

You may be thinking to yourself: “so I’ll train on the training dataset and peek at the test dataset as I go“. A fine idea, but now the test dataset is no longer unseen data as it has been involved and influenced the training dataset.

Tackling Overfitting

You must test your model on unseen data to counter overfitting.

Tackling Overfitting

Tackling Overfitting
Photo by Adrian Fallace Design & Photography, some rights reserved.

A split of data 66%/34% for training to test datasets is a good start. Using cross validation is better, and using multiple runs of cross validation is better again. You want to spend the time and get the best estimate of the models accurate on unseen data.

You can increase the accuracy of your model by decreasing its complexity.

In the case of decision trees for example, you can prune the tree (delete leaves) after training. This will decrease the amount of specialisation in the specific training dataset and increase generalisation on unseen data. If you are using regression for example, you can use regularisation to constrain the complexity (magnitude of the coefficients) during the training process.

Summary

In this post you learned the important framework of phrasing the development of a predictive model as an approximation of an unknown ideal discrimination function.

Under this framework you learned that evaluating the model on training data alone is insufficient. You learned that the best and most meaningful way to evaluate the ability of a predictive model to generalize is to evaluate it on unseen data.

This intuition provided the basis for why it is critical to use train/test split tests, cross validation and ideally multiple cross validation in your test harness when evaluating predictive models.

30 Responses to A Simple Intuition for Overfitting, or Why Testing on Training Data is a Bad Idea

  1. hapse July 21, 2015 at 2:19 pm #

    want information about testing verses training data.

  2. Andy October 6, 2016 at 7:18 am #

    Great article, I just started learning machine learning and was wondering why they split the data.

    Question: So suppose I split my data as 80% Training and 20% Testing (100, 20 in numbers). The Root Mean Square Error (RMSE) of the training data is calculated using 80 observations. On the other hand is the RMSE of the Test data is calculated using only 20 observations. Is that correct?

    Thanks

    • Jason Brownlee October 6, 2016 at 9:43 am #

      I’m glad you found it useful Andy.

      Yes. Training RMSE is calculated on the training dataset, testing RMSE on the test dataset. The test dataset RMSE gives you a rough idea of how well the method will perform on new data.

      Cross-validation will give an even better idea as it is more robust.

      Once you find a model that looks like it will do very well, you train it on all of your training data and start using it in production to make predictions.

      I hope that helps.

  3. Nikos January 11, 2017 at 12:02 am #

    Hi Jason, is there any way to evaluate how many attributes are needed to achieve the peak of performance of your algorithm? Lets say you have 30000 instances and 300 attributes in our original set, could someone use the experimenter in weka to perform feature selection (reduction of dimensionality)?
    Thanks

    • Jason Brownlee January 11, 2017 at 9:27 am #

      Hi Nikos,

      This is problem specific. Experimentation and trial and error would get you a clear answer.

  4. Will Roscoe January 12, 2017 at 7:08 pm #

    Jason, thanks for the post. Do you know if there is any practice to ‘finish off’ training a model with all the data? Some way to keep the generalized nature of the model but include information from all the data. I was thinking something like….
    1. Train a model using only the training data until the accuracy of the test data starts to decrease.
    2. Train the model for one last epoch with a very small learning rate to use all the data.

    I’ve tried this and it hasn’t worked for me. Wondered if you knew of a better solution.

    Thanks,

    • Jason Brownlee January 13, 2017 at 9:10 am #

      Hi Will,

      Generally, we use cross-validation and such to find the algorithm and parameters that best suit the problem.

      Then we train the model + parameters using all of the available data and start using it to make predictions.

      Does that help?

  5. Baouche February 1, 2017 at 9:16 pm #

    Good and very good post in machine learnin. Thanks Jason Brownlee

  6. Adithya July 6, 2017 at 9:51 pm #

    When a linear model is used to predict the train dataset itself, it always results such that the (Sum of Actual values=Sum of Predicted values). Is this normal? Any reason for this behaviour?

  7. Usman September 12, 2017 at 11:15 pm #

    Thanks for the post, it’s really helpful… Please is it a standard practice to split a dataset randomly, say 70-30, and if so, what is the method called?
    can it still be regarded as a form of cross validation?

    • Jason Brownlee September 13, 2017 at 12:32 pm #

      Yes, or similar. It is called a train/test split – a type of data resampling method.

  8. Sanjay Joshi September 26, 2017 at 4:02 am #

    HI Jason

    currently working on classifier model..

    data has been generated from random sampling method..

    every time i receive new data from machine we are feeding to run model, we are not splitting it into train , validate and test

    we are training model on new data almost daily basis

    our model(using xgboost) ..ROC_AUC score & confusion matrix doesn’t show drastic changes..

    do you think this is good method, pls advice..

    • Jason Brownlee September 26, 2017 at 5:40 am #

      There is no one best way.

      I would recommend brainstorming alternative approaches (e.g. updating the model instead of refitting it from scratch, do nothing, etc…) and compare the approaches to see how they impact the skill of predictions.

  9. William Gunawan October 4, 2017 at 2:29 pm #

    Hi Jason,

    Now i work on classifier model using keras. i have 20k image data, 80 for training and 20 for testing. i use validation split as my validation data when fit the model.. but i got training and validation accuracy so fast get convergence. from 0.7 can get to 0.9 and in the end my validation accuracy get 1.0, i dont understand my classifier get overfitting or not ?

    • Jason Brownlee October 4, 2017 at 3:40 pm #

      Your model is overfitting if skill on the training data continues to improve while skill on the validation data gets worse.

      I would recommend collecting loss information during training and plotting learning curves to diagnose whether your model is overfitting.

  10. Raj October 30, 2017 at 1:57 pm #

    Hi Jason,

    Suppose I am building a predictive model using Logistic regression. I split the data 70-30 and my model predicts fine on validation data. Now when I implement this in a new dataset but model somehow fails to predict (accuracy goes down).
    1) What steps should one take in this situation to improve models accuracy on the new dataset?
    2 What is the good practice to avoid such scenarios in future?

  11. Jarad January 19, 2018 at 7:52 am #

    Hey Jason, I’m curious your take on this scenario.
    Let’s say you’re using ExtraTreesClassifier from sklearn to do binary classification. You use train/test split with 20/80 split where you’re training on only 20 %, and predicting on 80%. You don’t have max_depth specified on the model. You achieve 99%+ on both train and test.

    By definition, both scores don’t fall under the definition of over-fitting (I think). Is this a bad thing in your opinion? To take the scenario a bit further, imagine you up-sampled your dataset from 500,000 to 1,000,000 because of so few “1” (a major class imbalance).

    Curious your thoughts.

  12. Alim February 20, 2018 at 9:19 pm #

    Hello Jason,

    What is your opinion of online machine learning algorithms? I don’t think you have any posts about them. I suspect that these models are less vulnerable to overfitting. Unlike traditional algorithms that rely on batch learning methods, online models update their parameters after each training instance. I suspect that these algorithms are more capable of adapting to new information.

    Thanks

    • Jason Brownlee February 21, 2018 at 6:39 am #

      It really depends on the data.

      I hope to cover online learning in more detail, thanks for the prompt!

  13. Peter February 28, 2018 at 6:48 pm #

    Hello Jason,

    Great post! I have a related question.
    In my understanding the 7:3 / 8:2 rule of thumb ratio of training/test split supposed to ensure that your trained model does not overfit the data.

    What is the case when the distribution of the training and test set is different (co-variate shift)?

    Intuitively, in the case of the Iris dataset, a 9:1 split would mean that the model get over-specified: it would perform too good on a test set that were gathered form the same Iris dataset, while would perform probably bad on a test set gathered from a different set of data consisting of the same attributes of the same flower-types. Because the noise and thus the distribution is different, right?

    However, if the two datasets (the Iris, and the other one used for the test) are representative of the population enough (if they are large enough), even if the datasets are different, their distribution should be more or less the same. Thus even a 9:1 ratio wouldn’t mean overfitting, would it?

    To go even further, in the case when only the Iris dataset is representative of the population (large) enough, but it performs well on several much smaller datasets with slightly different distributions, wouldn’t even a crazy train/split ratio of e.g. 99/1 be legit?

    Would be very grateful for your insights on this.

    • Jason Brownlee March 1, 2018 at 6:11 am #

      You can still overfit, it really depends on the data/project/methods used, etc.

      A chosen train/test ratio may reduce the bias in the error estimate of the error. In that regard, k-fold CV does a better job in general.

      Ideally, we do seek similar distributions between the samples, e.g. we can look at univariate summary statistics.

      Not sure I follow your comment about extreme data splits sorry.

      • Peter March 1, 2018 at 3:39 pm #

        Thank you very much for the quick answer!

        I see, yes the data quality and the method also need to be considered.

        About the extreme data split:

        Lets say i am building a model to predict petal lengths. If i have a 99:1 train:split ratio it would definitely cause overfitting if the training and test sets are from the same dataset.

        However, if training and the test sets are from different sources (training set is from a huge dataset A, test set is from another dataset B) and

        1. the used machine learning method is adequate for the prediction of petal length
        2. the quality of the test set (dataset B) is good (containing very few outlaying petal lengths)
        3. the size of the test set is reasonably large enough (no smaller then lets say 2000 data points)
        3. the achieved recognition accuracy is good (e.g. 80% for 3 classes)

        then even a 99:1 train:split ratio would be legit to prove the adequacy of my model?
        And a good accuracy would also show that the distribution of petal lenghts are not only similar in dataset A and B but these datasets are also representative of the population?

        If you have time to comment on this one more time it is much appreciated!

        • Jason Brownlee March 2, 2018 at 5:28 am #

          I cannot agree in general.

          Generalized ideas like this do not survive in applied machine learning, each dataset is different and requires controlled experiments in order to gather data to understand what is going on.

          Indeed, it is good to fit models on data that is representative of the domain.

  14. Jesús Martínez March 7, 2018 at 10:43 pm #

    Nice take on the overfitting issue. I didn’t think of the data as the best model, but it makes so much sense! Of course, we create models or train machine learning algorithms to approach a level of accuracy, efficacy or confident similar to the one we would have if we have each possible label for each possible instance of data from the domain we are studying.

    Certainly a refreshing point of view. Thanks!

  15. Maria Chi April 27, 2018 at 1:31 am #

    Very good article.
    Do you think having test accuracy 100% while train accuracy 95% with logistic regression is caused by overfitting? Although with SVM, in the same datasets, test accuracy, is smaller than train accuracy, which sounds more logical.

    • Jason Brownlee April 27, 2018 at 6:06 am #

      Perhaps underfitting or a good fit.

      It would be overfitting if skill on train was much better than test.

Leave a Reply