What is the Difference Between Test and Validation Datasets?

A validation dataset is a sample of data held back from training your model that is used to give an estimate of model skill while tuning model’s hyperparameters.

The validation dataset is different from the test dataset that is also held back from the training of the model, but is instead used to give an unbiased estimate of the skill of the final tuned model when comparing or selecting between final models.

There is much confusion in applied machine learning about what a validation dataset is exactly and how it differs from a test dataset.

In this post, you will discover clear definitions for train, test, and validation datasets and how to use each in your own machine learning projects.

After reading this post, you will know:

  • How experts in the field of machine learning define train, test, and validation datasets.
  • The difference between validation and test datasets in practice.
  • Procedures that you can use to make the best use of validation and test datasets when evaluating your models.

Let’s get started.

What is the Difference Between Test and Validation Datasets?

What is the Difference Between Test and Validation Datasets?
Photo by veddderman, some rights reserved.

Tutorial Overview

This tutorial is divided into 4 parts; they are:

  1. What is a Validation Dataset by the Experts?
  2. Definitions of Train, Validation, and Test Datasets
  3. Validation Dataset is Not Enough
  4. Validation and Test Datasets Disappear

What is a Validation Dataset by the Experts?

I find it useful to see exactly how datasets are described by the practitioners and experts.

In this section, we will take a look at how the train, test, and validation datasets are defined and how they differ according to some of the top machine learning texts and references.

Generally, the term “validation set” is used interchangeably with the term “test set” and refers to a sample of the dataset held back from training the model.

The evaluation of a model skill on the training dataset would result in a biased score. Therefore the model is evaluated on the held-out sample to give an unbiased estimate of model skill. This is typically called a train-test split approach to algorithm evaluation.

Suppose that we would like to estimate the test error associated with fitting a particular statistical learning method on a set of observations. The validation set approach […] is a very simple strategy for this task. It involves randomly dividing the available set of observations into two parts, a training set and a validation set or hold-out set. The model is fit on the training set, and the fitted model is used to predict the responses for the observations in the validation set. The resulting validation set error rate — typically assessed using MSE in the case of a quantitative response—provides an estimate of the test error rate.

— Gareth James, et al., Page 176, An Introduction to Statistical Learning: with Applications in R, 2013.

We can see the interchangeableness directly in Kuhn and Johnson’s excellent text “Applied Predictive Modeling”. In this example, they are clear to point out that the final model evaluation must be performed on a held out dataset that has not been used prior, either for training the model or tuning the model parameters.

Ideally, the model should be evaluated on samples that were not used to build or fine-tune the model, so that they provide an unbiased sense of model effectiveness. When a large amount of data is at hand, a set of samples can be set aside to evaluate the final model. The “training” data set is the general term for the samples used to create the model, while the “test” or “validation” data set is used to qualify performance.

— Max Kuhn and Kjell Johnson, Page 67, Applied Predictive Modeling, 2013

Perhaps traditionally the dataset used to evaluate the final model performance is called the “test set”. The importance of keeping the test set completely separate is reiterated by Russell and Norvig in their seminal AI textbook. They refer to using information from the test set in any way as “peeking”. They suggest locking the test set away completely until all model tuning is complete.

Peeking is a consequence of using test-set performance to both choose a hypothesis and evaluate it. The way to avoid this is to really hold the test set out—lock it away until you are completely done with learning and simply wish to obtain an independent evaluation of the final hypothesis. (And then, if you don’t like the results … you have to obtain, and lock away, a completely new test set if you want to go back and find a better hypothesis.)

— Stuart Russell and Peter Norvig, page 709, Artificial Intelligence: A Modern Approach, 2009 (3rd edition)

Importantly, Russell and Norvig comment that the training dataset used to fit the model can be further split into a training set and a validation set, and that it is this subset of the training dataset, called the validation set, that can be used to get an early estimate of the skill of the model.

If the test set is locked away, but you still want to measure performance on unseen data as a way of selecting a good hypothesis, then divide the available data (without the test set) into a training set and a validation set.

— Stuart Russell and Peter Norvig, page 709,Artificial Intelligence: A Modern Approach, 2009 (3rd edition)

This definition of validation set is corroborated by other seminal texts in the field. A good (and older) example is the glossary of terms in Ripley’s book “Pattern Recognition and Neural Networks.” Specifically, training, validation, and test sets are defined as follows:

– Training set: A set of examples used for learning, that is to fit the parameters of the classifier.

– Validation set: A set of examples used to tune the parameters of a classifier, for example to choose the number of hidden units in a neural network.

– Test set: A set of examples used only to assess the performance of a fully-specified classifier.

— Brian Ripley, page 354, Pattern Recognition and Neural Networks, 1996

These are the recommended definitions and usages of the terms.

A good example that these definitions are canonical is their reiteration in the famous Neural Network FAQ. In addition to reiterating Ripley’s glossary definitions, it goes on to discuss the common misuse of the terms “test set” and “validation set” in applied machine learning.

The literature on machine learning often reverses the meaning of “validation” and “test” sets. This is the most blatant example of the terminological confusion that pervades artificial intelligence research.

The crucial point is that a test set, by the standard definition in the NN [neural net] literature, is never used to choose among two or more networks, so that the error on the test set provides an unbiased estimate of the generalization error (assuming that the test set is representative of the population, etc.).

Subject: What are the population, sample, training set, design set, validation set, and test set?

Do you know of any other clear definitions or usages of these terms, e.g. quotes in papers or textbook?
Please let me know in the comments below.

Definitions of Train, Validation, and Test Datasets

To reiterate the findings from researching the experts above, this section provides unambiguous definitions of the three terms.

  • Training Dataset: The sample of data used to fit the model.
  • Validation Dataset: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on the validation dataset is incorporated into the model configuration.
  • Test Dataset: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset.

We can make this concrete with a pseudocode sketch:

Below are some additional clarifying notes:

  • The validation dataset may also play a role in other forms of model preparation, such as feature selection.
  • The final model could be fit on the aggregate of the training and validation datasets.

Are these definitions clear to you for your use case?
If not, please ask questions below.

Validation Dataset Is Not Enough

There are other ways of calculating an unbiased, (or progressively more biased in the case of the validation dataset) estimate of model skill on unseen data.

One popular example is to use k-fold cross-validation to tune model hyperparameters instead of a separate validation dataset.

In their book, Kuhn and Johnson have a section titled “Data Splitting Recommendations” in which they layout the limitations of using a sole “test set” (or validation set):

As previously discussed, there is a strong technical case to be made against a single, independent test set:

– A test set is a single evaluation of the model and has limited ability to characterize the uncertainty in the results.
– Proportionally large test sets divide the data in a way that increases bias in the performance estimates.
– With small sample sizes:
– The model may need every possible data point to adequately determine model values.
– The uncertainty of the test set can be considerably large to the point where different test sets may produce very different results.
– Resampling methods can produce reasonable predictions of how well the model will perform on future samples.

— Max Kuhn and Kjell Johnson, Page 78, Applied Predictive Modeling, 2013

They go on to make a recommendation for small sample sizes of using 10-fold cross validation in general because of the desirable low bias and variance properties of the performance estimate. They recommend the bootstrap method in the case of comparing model performance because of the low variance in the performance estimate.

For larger sample sizes, they again recommend a 10-fold cross-validation approach, in general.

Validation and Test Datasets Disappear

It is more than likely that you will not see references to training, validation, and test datasets in modern applied machine learning.

Reference to a “validation dataset” disappears if the practitioner is choosing to tune model hyperparameters using k-fold cross-validation with the training dataset.

We can make this concrete with a pseudocode sketch as follows:

Reference to the “test dataset” too may disappear if the cross-validation of model hyperparameters using the training dataset is nested within a broader cross-validation of the model.

Ultimately, all you are left with is a sample of data from the domain which we may rightly continue to refer to as the training dataset.

Further Reading

This section provides more resources on the topic if you are looking go deeper.

Do you know of any other good resources on this topic? Let me know in the comments below.


In this tutorial, you discovered that there is much confusion around the terms “validation dataset” and “test dataset” and how you can navigate these terms correctly when evaluating the skill of your own machine learning models.

Specifically, you learned:

  • That there is clear precedent for what “training dataset,” “validation dataset,” and “test dataset” refer to when evaluating models.
  • That the “validation dataset” is predominately used to describe the evaluation of models when tuning hyperparameters and data preparation, and the “test dataset” is predominately used to describe the evaluation of a final tuned model when comparing it to other final models.
  • That the notions of “validation dataset” and “test dataset” may disappear when adopting alternate resampling methods like k-fold cross validation, especially when the resampling methods are nested.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

43 Responses to What is the Difference Between Test and Validation Datasets?

  1. Jacob Sanders July 14, 2017 at 5:11 pm #

    Hey nice article! I’m a fan of your posts here but whenever I try to print these articles to carry with me on the metro to read there are these big floating advertisements for your machine learning tutorials that basically make it unreadable 🙁

    • Jason Brownlee July 15, 2017 at 9:38 am #

      I’m sorry to hear that. Can you send me a photo of an example?

      • Zhang Yong December 14, 2017 at 9:23 pm #

        I printed an another articles of you. And I came with the same problem as Jacob Sanders mentioned. Every page has a Ad (“Get your start in machine learning”) which cover a large space resulting in unreadability.
        Hence, I can only read your articles online and cannot print it.
        What a pity! : (

        • Jason Brownlee December 15, 2017 at 5:32 am #

          Sorry to hear that. They are designed to be read online.

  2. Milind Mahajani July 14, 2017 at 5:59 pm #

    What do we call the set of data on which the final model is run in the field to get answers — this is not labeled data. How do we evaluate the performance of the final model in the field?

    • Jason Brownlee July 15, 2017 at 9:40 am #

      We don’t need to evaluate the performance of the final model (unless as an ongoing maintenance task).

      Generally, we use train/test splits to estimate the skill of the final model. We do this robustly so that the estimate is as accurate as we can make it – to help choose between models and model configs. We use the estimate to know how good our final model is.

      This post will make it clearer:

  3. Helen July 26, 2017 at 6:29 am #

    Hi thank you for nice article. I want to check the model to see if the model is fair and unbiased but my professor told me with cross validation or 10-fold cross validation or any of this methods we can’t confirm if the model is valid and fair. can you please give me some hints about which method I can use for this problem?

    • Jason Brownlee July 26, 2017 at 8:03 am #

      Yes, k-fold cross validation is an excellent way to calculate an unbiased estimate of the skill of your model on unseen data.

  4. René July 26, 2017 at 5:43 pm #

    Nice article, really helped me to refresh my memories.

    One little note: In your first code example you loop over parameters but you never use params in the loop’s body. I guess it should be used in model = fit(train, params)!?

    Keep up the good work!

    • Jason Brownlee July 27, 2017 at 7:55 am #

      Glad to hear it.

      Thanks for the suggestion – updated.

  5. Steven August 1, 2017 at 11:52 am #

    Hi Jason,

    in the pseudocode of the part “Validation and Test Datasets Disappear”, I still didn’t understand how you used k-fold cross-validation to tune model hyperparameters with the training dataset.

    Could you explain the pseudocode?


    • Jason Brownlee August 2, 2017 at 7:41 am #

      Sure, each set of parameters (param) is evaluated using k-fold cross validation.

      Does that help?

  6. Krish November 14, 2017 at 8:04 am #

    Hi Jason,

    Great article!

    Want to make sure my understanding is correct. If not, please correct me.

    In general, for train-test data approach, the process is to split a given data set into 70% train data set and 30% test data set (ideally). In the training phase, we fit the model on the training data. And now to evaluate the model (i.e., to check how well the model is able to predict on unseen data), we run the model against the test data and get the predicted results. Since we already know what the expected results are, we compare/evaluate predicted and expected results to get the accuracy of the model.
    If the accuracy is not up to the desired level, we repeat the above process (i.e., train the model, test, compare, train the mode, test, compare, …) until the desired accuracy is achieved.

    But in this approach, we are indirectly using the test data to improve our model. So the idea of evaluating the model on unseen data is not achieved in the first place. Therefore ‘validation data set’ comes into picture and we follow the below approach.

    Train the model, run the model against validation data set, compare/evaluate the output results. Repeat until a desired accuracy is achieved.
    Once the desired accuracy is achieved, take the model and run it against the test data set and compare/evaluate the output results to get the accuracy.
    If this accuracy meets the desired level, the model is used for production. If not, we repeat the training process but this time we obtain a new test data instead.

    • Jason Brownlee November 14, 2017 at 10:22 am #


      It is a balancing act of not using too much influence from the “test set” to ensure we can get a final unbiased (less biased or semi-objective) estimate of model skill on unseen data.

      • PriyaSaxena December 21, 2017 at 10:52 pm #

        Hi Jason,

        Thank you for this article. I have a question, though. I’m new to ML and have been working on a case study on credit risk currently. My data is already divided into three different sets, each for train, validation, and test. I would start by cleaning the train data (fining NA values, removing outliers in case of a continuous dependent variable). Do I need to clean validation and test datasets before I proceed with the method given above for checking the model accuracy? Any help would be really appreciated. Thanks.

        • Jason Brownlee December 22, 2017 at 5:33 am #

          Generally, it is a good idea to perform the same data prep tasks on the other datasets.

          • PriyaSaxena December 22, 2017 at 9:20 pm #

            Okay. Thanks so much.

    • Max December 17, 2017 at 10:53 pm #

      @Krish Awesome summary – you hit the point. Thanks for enhancing my understanding

  7. Krish November 15, 2017 at 5:25 am #

    Thanks Jason!

  8. Magnus November 24, 2017 at 12:55 am #

    Hi Jason,

    Again a good overview. However I want to point out one problem when dividing data into these sets. If the distributions between the data sets differ in some meaningful way, the result from the test set may not be optimal. That is, one should try to have similar distributions for all sets, with similar max and min values. If not, the model may saturate or underperform for some values. I have experienced this myself, even with normalised input data. On the other hand, when dealing with multivariate data sets, this is not easy. It would be nice to read a post on this.

  9. Luxferre December 4, 2017 at 1:49 am #

    Hi Jason,

    First, Thanks for your article

    But, I so confused when implement train, test, and validate split in Python.

    In my implement I use Pandas and Sklearn package.

    So, how to implement this in Python?

  10. Luxferre December 4, 2017 at 9:22 pm #


    I have many ask to you because I still confused:

    1. Do we always have to split train dataset into train/validation sets?
    2. what the result if I don’t split train dataset into train/validation sets?

    • Jason Brownlee December 5, 2017 at 5:43 am #

      No you don’t have to split the data, the validation set can be useful to tune the parameters of a given model.

      • Luxferre December 8, 2017 at 1:13 pm #

        Hi Jason,

        if I want to split data to train/test (90:10), and I want to split train again to train/validation, train/validation (90:10) right or I can split with free ratio ?

        • Jason Brownlee December 8, 2017 at 2:30 pm #

          The idea of “right” really depends on your problem and your goals.

  11. Austin December 6, 2017 at 1:37 pm #

    Hi Jason,
    I just wish to appreciate you for the very nice explanation. I am clear on the terms.
    But I would like you explain more to me on the tuning of model hyperparameter stuff.

    • Jason Brownlee December 7, 2017 at 7:48 am #

      Thanks Austin.

      Which part of tuning do you need help with?

  12. Nil December 12, 2017 at 11:53 pm #

    Hi, Dr. Jason,

    Thank you for this post. It’s amazing, I have cleaned my misunderstood of the validation and test sets.

    I have a doubt maybe it is out of this context but I think it is connected.
    I want to plot training errors and test errors graphical to see the behavior of the two curves to determine the best parameters of a neural network, but I don’t understand where and how to extract the scores to plot.

    I have already read your post: How to Implement the Backpropagation Algorithm From Scratch In Python, and learned there how to have the scores using accuracy, (I am very thankfully for that).

    But now I want to plot the training and test errors in one graphic to see the behavior of the two curves, my problem is I don’t have an idea of where and how to extract this errors (I think training errors I can extract on the training process using MSE), and where and how can I extract the test errors to plot?

    Please help me with this if you can. I am still developing a backpropagation algorithm from scratch and evaluate using k-fold cross-validation (that I learner your posts).

    Best Regards.

    • Jason Brownlee December 13, 2017 at 5:37 am #

      You could collect all predictions/errors across CV folds or simply evaluate the model directly on a test set.

      • Nil December 13, 2017 at 10:23 pm #

        Hi, Dr. Jason,

        Thank you for your reply.

        I have already tried to do so, but there are two problems I find on my self:

        The first problem is (my predict function returns 0 or 1 each time I call it in the loop, with this values I can calculate the error rate and the accuracy), my predict function uses the forward function that returns the outputs of output layer that are rounded to 0 or 1, so I am getting confuse If I have to calculate this errors using these outputs from the forward function inside the predict function before round to 0 or 1 (output-expected)? or I will calculate these errors inside the k-foldCV function after the prediction using the rounded values 0 or 1 (predictions – expected)?

        The second problem is (In the chart of training error I plotted using a function of training errors and the epochs. But here in the test error I can’t imagine the graphic will be a function of the errors with what values since I need to plot this errors in the same graphic).

        My goal is to find the best point (the needed number of epochs) to stop training the neural network by seeing the training errors beside the test errors. I am using accuracy but even I increase or decrease the number of epochs I cant see the effect in the accuracy, so I need to see these errors side by side to decide the number epochs needed to train to avoid over fitting or under fitting.

        I am really stuck at this point, trying to find a way out everyday. And here is where I find most of solutions of my doubts in practice.

        Best Regards

        • Jason Brownlee December 14, 2017 at 5:37 am #

          Perhaps try training multiple parallel models stopped at roughly the same time and combine their predictions in an ensemble to result in a more robust result?

          • Nil December 14, 2017 at 5:59 pm #

            Hi, Dr. Jason,

            The recommended approach is new to me, and seems to be interesting, have got a post where I can see and have an idea of how it works practically?

            Best regards.

          • Jason Brownlee December 15, 2017 at 5:30 am #

            Do the examples in this post help?

          • Nil December 15, 2017 at 8:55 pm #

            Let me read again the and the examples, maybe I can find out something I didn’t see in the first read.

            Best regards.

  13. samik January 9, 2018 at 11:14 pm #

    Hi Jason,

    What is the industry % standard to split the data into 3 data sets i.e train,validate and test ?

  14. samik January 10, 2018 at 2:05 pm #

    Do we have the industry % for splitting the data ?

    • Jason Brownlee January 10, 2018 at 3:45 pm #

      Depends on data.

      Perhaps 70 for training, 30 for test. Then same again for training into training(2) and validation.

Leave a Reply