When you first start out with machine learning you load a dataset and try models. You might think to yourself, why can’t I just build a model with all of the data and evaluate it on the same dataset?
It seems reasonable. More data to train the model is better, right? Evaluating the model and reporting results on the same dataset will tell you how good the model is, right?
In this post you will discover the difficulties with this reasoning and develop an intuition for why it is important to test a model on unseen data.
Train and Test on the Same Dataset
If you have a dataset, say the iris flower dataset, what is the best model of that dataset?
The best model is the dataset itself. If you take a given data instance and ask for it’s classification, you can look that instance up in the dataset and report the correct result every time.
This is the problem you are solving when you train and test a model on the same dataset.
You are asking the model to make predictions to data that it has “seen” before. Data that was used to create the model. The best model for this problem is the look-up model described above.
There are some circumstances where you do want to train a model and evaluate it with the same dataset.
You may want to simplify the explanation of a predictive variable from data. For example, you may want a set of simple rules or a decision tree that best describes the observations you have collected.
In this case, you are building a descriptive model.
These models can be vey useful and can help you in your project or your business to better understand how the attributes relate to the predictive value. You can add meaning to the results with the domain expertise that you have.
The important limitation of a descriptive model is that it is limited to describing the data on which it was trained. You have no idea how accurate a predictive the model it is.
Modeling a Target Function
Consider a made up classification problem that goal of which is to classify data instances as either red or green.
For this problem, assume that there exists a perfect model, or a perfect function that can correctly discriminate any data instance from the domain as red or green. In the context of a specific problem, the perfect discrimination function very likely has profound meaning in the problem domain to the domain experts. We want to think about that and try to tap into that perspective. We want to deliver that result.
Our goal when making a predictive model for this problem is to best approximate this perfect discrimination function.
We build our approximation of the perfect discrimination function using sample data collected from the domain. It’s not all the possible data, it’s a sample or subset of all possible data. If we had all the data, there would be no need to make predictions because the answers could just be looked up.
The data we use to build our approximate model contains structure within it pertaining the the ideal discrimination function. Your goal with data preparation is to best expose that structure to the modeling algorithm. The data also contains things that are irrelevant to the discrimination function such as biases from the selection of the data and random noise that perturbs and hides the structure. The model you select to approximate the function must navigate these obstacles.
The framework helps us understand the deeper difference between a descriptive and predictive model.
Descriptive vs Predictive Models
The descriptive model is only concerned with modeling the structure in the observed data. It makes sense to train and evaluate it on the same dataset.
The predictive model is attempting a much more difficult problem, approximating the true discrimination function from a sample of data. We want to use algorithms that do not pick out and model all of the noise in our sample. We do want to chose algorithms that generalize beyond the observed data. It makes sense that we could only evaluate the ability of the model to generalize from a data sample on data that it had not see before during training.
The best descriptive model is accurate on the observed data. The best predictive model is accurate on unobserved data.
The flaw with evaluating a predictive model on training data is that it does not inform you on how well the model has generalized to new unseen data.
A model that is selected for its accuracy on the training dataset rather than its accuracy on an unseen test dataset is very likely have lower accuracy on an unseen test dataset. The reason is that the model is not as generalized. It has specalized to the structure in the training dataset. This is called overfitting, and it’s more insidious than you think.
For example, you may want to stop training your model once the accuracy stops improving. In this situation, there will be a point where the accuracy on the training set continues to improve but the accuracy on unseen data starts to degrade.
You may be thinking to yourself: “so I’ll train on the training dataset and peek at the test dataset as I go“. A fine idea, but now the test dataset is no longer unseen data as it has been involved and influenced the training dataset.
You must test your model on unseen data to counter overfitting.
A split of data 66%/34% for training to test datasets is a good start. Using cross validation is better, and using multiple runs of cross validation is better again. You want to spend the time and get the best estimate of the models accurate on unseen data.
You can increase the accuracy of your model by decreasing its complexity.
In the case of decision trees for example, you can prune the tree (delete leaves) after training. This will decrease the amount of specialisation in the specific training dataset and increase generalisation on unseen data. If you are using regression for example, you can use regularisation to constrain the complexity (magnitude of the coefficients) during the training process.
In this post you learned the important framework of phrasing the development of a predictive model as an approximation of an unknown ideal discrimination function.
Under this framework you learned that evaluating the model on training data alone is insufficient. You learned that the best and most meaningful way to evaluate the ability of a predictive model to generalize is to evaluate it on unseen data.
This intuition provided the basis for why it is critical to use train/test split tests, cross validation and ideally multiple cross validation in your test harness when evaluating predictive models.