Last Updated on August 21, 2016

When you first start out with machine learning you load a dataset and try models. You might think to yourself, why can’t I just build a model with all of the data and evaluate it on the same dataset?

It seems reasonable. More data to train the model is better, right? Evaluating the model and reporting results on the same dataset will tell you how good the model is, right?

Wrong.

In this post you will discover the difficulties with this reasoning and develop an intuition for why it is important to test a model on unseen data.

## Train and Test on the Same Dataset

If you have a dataset, say the iris flower dataset, what is the best model of that dataset?

The best model is the dataset itself. If you take a given data instance and ask for it’s classification, you can look that instance up in the dataset and report the correct result every time.

This is the problem you are solving when you train and test a model on the same dataset.

You are asking the model to make predictions to data that it has “seen” before. Data that was used to create the model. The best model for this problem is the look-up model described above.

## Descriptive Model

There are some circumstances where you do want to train a model and evaluate it with the same dataset.

You may want to simplify the explanation of a predictive variable from data. For example, you may want a set of simple rules or a decision tree that best describes the observations you have collected.

In this case, you are building a descriptive model.

These models can be vey useful and can help you in your project or your business to better understand how the attributes relate to the predictive value. You can add meaning to the results with the domain expertise that you have.

The important limitation of a descriptive model is that it is limited to describing the data on which it was trained. You have no idea how accurate a predictive the model it is.

## Modeling a Target Function

Consider a made up classification problem that goal of which is to classify data instances as either red or green.

For this problem, assume that there exists a perfect model, or a perfect function that can correctly discriminate any data instance from the domain as red or green. In the context of a specific problem, the perfect discrimination function very likely has profound meaning in the problem domain to the domain experts. We want to think about that and try to tap into that perspective. We want to deliver that result.

Our goal when making a predictive model for this problem is to best approximate this perfect discrimination function.

We build our approximation of the perfect discrimination function using sample data collected from the domain. It’s not all the possible data, it’s a sample or subset of all possible data. If we had all the data, there would be no need to make predictions because the answers could just be looked up.

The data we use to build our approximate model contains structure within it pertaining the the ideal discrimination function. Your goal with data preparation is to best expose that structure to the modeling algorithm. The data also contains things that are irrelevant to the discrimination function such as biases from the selection of the data and random noise that perturbs and hides the structure. The model you select to approximate the function must navigate these obstacles.

The framework helps us understand the deeper difference between a descriptive and predictive model.

## Descriptive vs Predictive Models

The descriptive model is only concerned with modeling the structure in the observed data. It makes sense to train and evaluate it on the same dataset.

The predictive model is attempting a much more difficult problem, approximating the true discrimination function from a sample of data. We want to use algorithms that do not pick out and model all of the noise in our sample. We do want to chose algorithms that generalize beyond the observed data. It makes sense that we could only evaluate the ability of the model to generalize from a data sample on data that it had not see before during training.

The best descriptive model is accurate on the observed data. The best predictive model is accurate on unobserved data.

## Overfitting

The flaw with evaluating a predictive model on training data is that it does not inform you on how well the model has generalized to new unseen data.

A model that is selected for its accuracy on the training dataset rather than its accuracy on an unseen test dataset is very likely have lower accuracy on an unseen test dataset. The reason is that the model is not as generalized. It has specalized to the structure in the training dataset. This is called overfitting, and it’s more insidious than you think.

For example, you may want to stop training your model once the accuracy stops improving. In this situation, there will be a point where the accuracy on the training set continues to improve but the accuracy on unseen data starts to degrade.

You may be thinking to yourself: “*so I’ll train on the training dataset and peek at the test dataset as I go*“. A fine idea, but now the test dataset is no longer unseen data as it has been involved and influenced the training dataset.

## Tackling Overfitting

You must test your model on unseen data to counter overfitting.

A split of data 66%/34% for training to test datasets is a good start. Using cross validation is better, and using multiple runs of cross validation is better again. You want to spend the time and get the best estimate of the models accurate on unseen data.

You can increase the accuracy of your model by decreasing its complexity.

In the case of decision trees for example, you can prune the tree (delete leaves) after training. This will decrease the amount of specialisation in the specific training dataset and increase generalisation on unseen data. If you are using regression for example, you can use regularisation to constrain the complexity (magnitude of the coefficients) during the training process.

## Summary

In this post you learned the important framework of phrasing the development of a predictive model as an approximation of an unknown ideal discrimination function.

Under this framework you learned that evaluating the model on training data alone is insufficient. You learned that the best and most meaningful way to evaluate the ability of a predictive model to generalize is to evaluate it on unseen data.

This intuition provided the basis for why it is critical to use train/test split tests, cross validation and ideally multiple cross validation in your test harness when evaluating predictive models.

want information about testing verses training data.

Great article, I just started learning machine learning and was wondering why they split the data.

Question: So suppose I split my data as 80% Training and 20% Testing (100, 20 in numbers). The Root Mean Square Error (RMSE) of the training data is calculated using 80 observations. On the other hand is the RMSE of the Test data is calculated using only 20 observations. Is that correct?

Thanks

I’m glad you found it useful Andy.

Yes. Training RMSE is calculated on the training dataset, testing RMSE on the test dataset. The test dataset RMSE gives you a rough idea of how well the method will perform on new data.

Cross-validation will give an even better idea as it is more robust.

Once you find a model that looks like it will do very well, you train it on all of your training data and start using it in production to make predictions.

I hope that helps.

Hi Jason, is there any way to evaluate how many attributes are needed to achieve the peak of performance of your algorithm? Lets say you have 30000 instances and 300 attributes in our original set, could someone use the experimenter in weka to perform feature selection (reduction of dimensionality)?

Thanks

Hi Nikos,

This is problem specific. Experimentation and trial and error would get you a clear answer.

Jason, thanks for the post. Do you know if there is any practice to ‘finish off’ training a model with all the data? Some way to keep the generalized nature of the model but include information from all the data. I was thinking something like….

1. Train a model using only the training data until the accuracy of the test data starts to decrease.

2. Train the model for one last epoch with a very small learning rate to use all the data.

I’ve tried this and it hasn’t worked for me. Wondered if you knew of a better solution.

Thanks,

Hi Will,

Generally, we use cross-validation and such to find the algorithm and parameters that best suit the problem.

Then we train the model + parameters using all of the available data and start using it to make predictions.

Does that help?

Good and very good post in machine learnin. Thanks Jason Brownlee

You’re welcome Baouche.

How Do you determine the model is over fitted or underfitted..You calculate rmse or say cross val…but what shoukd be the value of rmse that it is not a error…or say how much should be the rmse that there is jo error..??..how do you determine?

It is easier for iterative algorithms, like neural nets as we can plot learning curves:

https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/

It is much harder to diagnose with a point evaluation on a train and test set.

When a linear model is used to predict the train dataset itself, it always results such that the (Sum of Actual values=Sum of Predicted values). Is this normal? Any reason for this behaviour?

Thanks for the post, it’s really helpful… Please is it a standard practice to split a dataset randomly, say 70-30, and if so, what is the method called?

can it still be regarded as a form of cross validation?

Yes, or similar. It is called a train/test split – a type of data resampling method.

HI Jason

currently working on classifier model..

data has been generated from random sampling method..

every time i receive new data from machine we are feeding to run model, we are not splitting it into train , validate and test

we are training model on new data almost daily basis

our model(using xgboost) ..ROC_AUC score & confusion matrix doesn’t show drastic changes..

do you think this is good method, pls advice..

There is no one best way.

I would recommend brainstorming alternative approaches (e.g. updating the model instead of refitting it from scratch, do nothing, etc…) and compare the approaches to see how they impact the skill of predictions.

Hi Jason,

Now i work on classifier model using keras. i have 20k image data, 80 for training and 20 for testing. i use validation split as my validation data when fit the model.. but i got training and validation accuracy so fast get convergence. from 0.7 can get to 0.9 and in the end my validation accuracy get 1.0, i dont understand my classifier get overfitting or not ?

Your model is overfitting if skill on the training data continues to improve while skill on the validation data gets worse.

I would recommend collecting loss information during training and plotting learning curves to diagnose whether your model is overfitting.

Hi Jason,

Suppose I am building a predictive model using Logistic regression. I split the data 70-30 and my model predicts fine on validation data. Now when I implement this in a new dataset but model somehow fails to predict (accuracy goes down).

1) What steps should one take in this situation to improve models accuracy on the new dataset?

2 What is the good practice to avoid such scenarios in future?

It looks like you are asking a general question about how to develop robust models.

This checklist has a suite of ideas to try:

http://machinelearningmastery.com/machine-learning-performance-improvement-cheat-sheet/

Hey Jason, I’m curious your take on this scenario.

Let’s say you’re using ExtraTreesClassifier from sklearn to do binary classification. You use train/test split with 20/80 split where you’re training on only 20 %, and predicting on 80%. You don’t have max_depth specified on the model. You achieve 99%+ on both train and test.

By definition, both scores don’t fall under the definition of over-fitting (I think). Is this a bad thing in your opinion? To take the scenario a bit further, imagine you up-sampled your dataset from 500,000 to 1,000,000 because of so few “1” (a major class imbalance).

Curious your thoughts.

Sounds fine.

Hello Jason,

What is your opinion of online machine learning algorithms? I don’t think you have any posts about them. I suspect that these models are less vulnerable to overfitting. Unlike traditional algorithms that rely on batch learning methods, online models update their parameters after each training instance. I suspect that these algorithms are more capable of adapting to new information.

Thanks

It really depends on the data.

I hope to cover online learning in more detail, thanks for the prompt!

Hello Jason,

Great post! I have a related question.

In my understanding the 7:3 / 8:2 rule of thumb ratio of training/test split supposed to ensure that your trained model does not overfit the data.

What is the case when the distribution of the training and test set is different (co-variate shift)?

Intuitively, in the case of the Iris dataset, a 9:1 split would mean that the model get over-specified: it would perform too good on a test set that were gathered form the same Iris dataset, while would perform probably bad on a test set gathered from a different set of data consisting of the same attributes of the same flower-types. Because the noise and thus the distribution is different, right?

However, if the two datasets (the Iris, and the other one used for the test) are representative of the population enough (if they are large enough), even if the datasets are different, their distribution should be more or less the same. Thus even a 9:1 ratio wouldn’t mean overfitting, would it?

To go even further, in the case when only the Iris dataset is representative of the population (large) enough, but it performs well on several much smaller datasets with slightly different distributions, wouldn’t even a crazy train/split ratio of e.g. 99/1 be legit?

Would be very grateful for your insights on this.

You can still overfit, it really depends on the data/project/methods used, etc.

A chosen train/test ratio may reduce the bias in the error estimate of the error. In that regard, k-fold CV does a better job in general.

Ideally, we do seek similar distributions between the samples, e.g. we can look at univariate summary statistics.

Not sure I follow your comment about extreme data splits sorry.

Thank you very much for the quick answer!

I see, yes the data quality and the method also need to be considered.

About the extreme data split:

Lets say i am building a model to predict petal lengths. If i have a 99:1 train:split ratio it would definitely cause overfitting if the training and test sets are from the same dataset.

However, if training and the test sets are from different sources (training set is from a huge dataset A, test set is from another dataset B) and

1. the used machine learning method is adequate for the prediction of petal length

2. the quality of the test set (dataset B) is good (containing very few outlaying petal lengths)

3. the size of the test set is reasonably large enough (no smaller then lets say 2000 data points)

3. the achieved recognition accuracy is good (e.g. 80% for 3 classes)

then even a 99:1 train:split ratio would be legit to prove the adequacy of my model?

And a good accuracy would also show that the distribution of petal lenghts are not only similar in dataset A and B but these datasets are also representative of the population?

If you have time to comment on this one more time it is much appreciated!

I cannot agree in general.

Generalized ideas like this do not survive in applied machine learning, each dataset is different and requires controlled experiments in order to gather data to understand what is going on.

Indeed, it is good to fit models on data that is representative of the domain.

Nice take on the overfitting issue. I didn’t think of the data as the best model, but it makes so much sense! Of course, we create models or train machine learning algorithms to approach a level of accuracy, efficacy or confident similar to the one we would have if we have each possible label for each possible instance of data from the domain we are studying.

Certainly a refreshing point of view. Thanks!

Glad it helped.

Very good article.

Do you think having test accuracy 100% while train accuracy 95% with logistic regression is caused by overfitting? Although with SVM, in the same datasets, test accuracy, is smaller than train accuracy, which sounds more logical.

Perhaps underfitting or a good fit.

It would be overfitting if skill on train was much better than test.

Hi Jason,

Good article. I have related questions. I try to create model using gradient boosting and use RMSE for evaluation. RMSE value always decreasing and say relatively same value in training, split-validation 70:30, split-validation 5:95, cross-validation 5~10 folds, and multiple cross validation. But, when testing into unknown data, the error always 3x bigger. What happen?

Thank you.

Some ideas:

Perhaps the new data is too different to the training data?

Perhaps the model has overfit the training data?

I try build a normal data description model from normal examples in anomaly detection problem. I split normal data into 50-50 ratio for train and validation, and I also use an abnormal data to validate. I achieve following (train losses decrease regularly, then slowly and stop dereasing; Train_acc incease quickly to 100% while val_acc is low and increase slowly):

Epoch 0/10 Accuracy: 99.175, Loss: 1.144, Val_Acc: 3.840, Val_Loss: 0.666

=============================== Val_Acc_abn: 100.000, Val_Loss_abn: 16.903

Epoch 1/10 Accuracy: 100.000, Loss: 0.193, Val_Acc: 7.831, Val_Loss: 0.564

=============================== Val_Acc_abn: 100.000, Val_Loss_abn: 11.240

Epoch 2/10 Accuracy: 100.000, Loss: 0.190, Val_Acc: 9.921, Val_Loss: 0.527

=============================== Val_Acc_abn: 100.000, Val_Loss_abn: 9.423

Epoch 3/10 Accuracy: 100.000, Loss: 0.188, Val_Acc: 10.811, Val_Loss: 0.504

=============================== Val_Acc_abn: 100.000, Val_Loss_abn: 8.916

Epoch 4/10 Accuracy: 100.000, Loss: 0.187, Val_Acc: 11.621, Val_Loss: 0.486

=============================== Val_Acc_abn: 100.000, Val_Loss_abn: 8.539

Epoch 5/10 Accuracy: 100.000, Loss: 0.187, Val_Acc: 11.971, Val_Loss: 0.471

=============================== Val_Acc_abn: 100.000, Val_Loss_abn: 8.112

Epoch 6/10 Accuracy: 100.000, Loss: 0.187, Val_Acc: 12.381, Val_Loss: 0.458

=============================== Val_Acc_abn: 100.000, Val_Loss_abn: 7.527

Epoch 7/10 Accuracy: 100.000, Loss: 0.186, Val_Acc: 12.781, Val_Loss: 0.446

=============================== Val_Acc_abn: 100.000, Val_Loss_abn: 7.696

Epoch 8/10 Accuracy: 100.000, Loss: 0.186, Val_Acc: 13.161, Val_Loss: 0.435

=============================== Val_Acc_abn: 100.000, Val_Loss_abn: 7.618

Epoch 9/10 Accuracy: 100.000, Loss: 0.186, Val_Acc: 13.751, Val_Loss: 0.425

=============================== Val_Acc_abn: 100.000, Val_Loss_abn: 7.640

May my model overfit? How can I improve my model?

Many thanks

I have some suggestions here:

http://machinelearningmastery.com/improve-deep-learning-performance/

Hi Dr.Jason,

I try build a normal data description model from normal examples in training for anomaly detection problem. I split normal data into 50-50 ratio for train and validation, and I also use an abnormal data to validate. I achieve following (All losses decrease regularly; Train_acc incease quickly while val_acc is low and increase slowly):

Epoch 0/10 Accuracy: 99.175, Loss: 1.144, Val_Acc: 3.840, Val_Loss: 0.666

=============================== Val_Acc_abn: 100.000, Val_Loss_abn: 16.903

Epoch 1/10 Accuracy: 100.000, Loss: 0.193, Val_Acc: 7.831, Val_Loss: 0.564

=============================== Val_Acc_abn: 100.000, Val_Loss_abn: 11.240

Epoch 2/10 Accuracy: 100.000, Loss: 0.190, Val_Acc: 9.921, Val_Loss: 0.527

=============================== Val_Acc_abn: 100.000, Val_Loss_abn: 9.423

Epoch 3/10 Accuracy: 100.000, Loss: 0.188, Val_Acc: 10.811, Val_Loss: 0.504

=============================== Val_Acc_abn: 100.000, Val_Loss_abn: 8.916

Epoch 4/10 Accuracy: 100.000, Loss: 0.187, Val_Acc: 11.621, Val_Loss: 0.486

=============================== Val_Acc_abn: 100.000, Val_Loss_abn: 8.539

Epoch 5/10 Accuracy: 100.000, Loss: 0.187, Val_Acc: 11.971, Val_Loss: 0.471

=============================== Val_Acc_abn: 100.000, Val_Loss_abn: 8.112

Epoch 6/10 Accuracy: 100.000, Loss: 0.187, Val_Acc: 12.381, Val_Loss: 0.458

=============================== Val_Acc_abn: 100.000, Val_Loss_abn: 7.527

Epoch 7/10 Accuracy: 100.000, Loss: 0.186, Val_Acc: 12.781, Val_Loss: 0.446

=============================== Val_Acc_abn: 100.000, Val_Loss_abn: 7.696

Epoch 8/10 Accuracy: 100.000, Loss: 0.186, Val_Acc: 13.161, Val_Loss: 0.435

=============================== Val_Acc_abn: 100.000, Val_Loss_abn: 7.618

Epoch 9/10 Accuracy: 100.000, Loss: 0.186, Val_Acc: 13.751, Val_Loss: 0.425

=============================== Val_Acc_abn: 100.000, Val_Loss_abn: 7.640

and result on test data:

test_normal Accuracy 6.306

test_abn1 Accuracy 100.000

test_abn2 Accuracy 100.000

test_abn3 Accuracy 100.000

May my model overfit? How can I improve my model?

Many thanks

You can review learning curves of your data to see if the model has overfit.

Hi Jason

thank again for your wonderful blog. I built a model using 80% training and 20% test. I used multiple times k-folds and controlled for the uneven models with stratified samples between training and test and in the folds. I also used the 1SE less than optimal as the choice for model to protect against overfitting. The training model showed 72% accuracy and the test results showed 68%. So a 4% drop. Are there any benchmarks on this drop in accuracy I have been searching. thanks!!

Well done!

Generally, I don’t recommend accuracy when working with imbalanced datasets, perhaps try F1, precision, recall or a similar measure.

Thanks! In the end I compared Prevalence and Detected Prevalence by class as I was trying to predict an election and wanted to see how each Candidate faired in my model. However I was interested in benchmarking how much variance I may have introduced as evidenced by the drop in accuracy between Training and Testing data.

Fascinating, well done!

I have studied well regards to model building but still have doubts about training an test data sets .what should I do? Please advise

Perhaps this will help:

https://machinelearningmastery.com/difference-test-validation-datasets/

Hello Jason! Great post as always! I have followed the advice and split a 70/30 split with cross-validation on my highly unbalanced data set and achieve a favorable 60/90 PR at a threshold of .5 (classification problem). But, my dataset is arbitrarily limited to about 2 million records, ~1.4 million for train 600k for ‘test’. In production, the true scoring dataset will be more like 4 million (which is much larger than 600k!) and the precision just tanks from .6 to .2 (but recall stays at .9). In a perfect world I’d train on a larger true dataset, but due to available resources I can only really train on 1.4-ish million records. Do you have any tips to preserve precision when scoring on a much larger dataset compared to the model’s training data?

Some added notes:

-Recall of .9 was achieved only by random oversampling with replacement on minority obs until 1:1 for the training dataset

-Using XGBoost, 250 iterators and a depth of 8 on final model

Since I’m limited to a training/testing dataset smaller than what is actually going to be scored against, I can’t do the obvious thing and just train on a larger more realistic dataset. I’m going to start trying some ensemble approaches: training several models and either warm starting or mean voting. In the case of warm starting, my first model might be oversampled to capture the true positives’ info gain, but subsequent models maybe not. Or have subsequent models use fewer than the 200 features. Anyways, thanks for reading!

Great question, very tough.

Perhaps the dataset is too small/different from prod? Or perhaps you’re overfitting the training set?

I wonder if you can run a sensitivity analysis on the model vs dataset size and see what is going on there. Also, tuning the prediction threshold towards what you care about more – like really bias it – might help in prod.

Eager to hear how you go Eric.

Hi

I’m dealing with an imbalanced dataset. When I performed classification, I got a training accuracy of 0.991667 & a test accuracy of 1.0.

However, I’m more concerned about the f1 scores & the average f1 score of the model, since it’s an imbalanced dataset. The avg f1 score of the model is 1.0.

With these measures how can I determine if my model is overfitting or underfitting. Does having f1 scores close to 1.0 for each class, mean that the model is not overfitting or underfitting?

Thanks

San

I recommend not using accuracy, here’s why:

https://machinelearningmastery.com/failure-of-accuracy-for-imbalanced-class-distributions/

An average F1 of 1.0 sounds great, as long as you are using repeated stratified k-fold cross-validation to collect your scores.

How to check if my model of overfitting or underfitting when i am using metric as f1_score

Look at the loss value, and follow the instructions here:

https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/

Hi Jason,

Is it possible to overfit the test data? For example, let’s say I compare the performance (e.g., measured by AUC) of a logistic regression versus a CART predictive model, and then choose the model with the highest AUC on the test data. That seems fine. But if I take it to the extreme and compare the AUC performance of thousands of predictive models (e.g., random forest, elastic net, tweaking the pruning parameter in CART, etc.), and then choose the model with the highest AUC on the test data, I wouldn’t expect that same model to perform well on a new test data set. I haven’t seen this concern raised too often, so would appreciate your thoughts.

Thanks,

Brent

Yes, here’s an example:

https://machinelearningmastery.com/train-to-the-test-set-in-machine-learning/

And here:

https://machinelearningmastery.com/hill-climb-the-test-set-for-machine-learning/

This is why a robust test harness is critical, e.g. repeated stratified k-fold cross-validation, perhaps even nested.