The test options you use when evaluating machine learning algorithms can mean the difference between over-learning, a mediocre result and a usable state-of-the-art result that you can confidently shout from the roof tops (you really do feel like doing that sometimes).
In this post you will discover the standard test options you can use in your algorithm evaluation test harness and how to choose the right options next time.
The root of the difficulty in choosing the right test options is randomness. Most (almost all) machine learning algorithms use randomness in some way. The randomness may be explicit in the algorithm or may be in the sample of the data selected to train the algorithm.
This does not mean that the algorithms produce random results, it means that they produce results with some noise or variance. We call this type of limited variance, stochastic and the algorithms that exploit it, stochastic algorithms.
Train and Test on Same Data
If you have a dataset, you may want to train the model on the dataset and then report the results of the model on that dataset. That’s how good the model is, right?
The problem with this approach of evaluating algorithms is that you indeed will know the performance of the algorithm on the dataset, but do not have any indication of how the algorithm will perform on data that the model was not trained on (so-called unseen data).
This matters, only if you want to use the model to make predictions on unseen data.
A simple way to use one dataset to both train and estimate the performance of the algorithm on unseen data is to split the dataset. You take the dataset, and split it into a training dataset and a test dataset. For example, you randomly select 66% of the instances for training and use the remaining 34% as a test dataset.
The algorithm is run on the training dataset and a model is created and assessed on the test dataset and you get a performance accuracy, lets say 87% classification accuracy.
Spit tests are fast and great when you have a lot of data or when training a model is expensive (it resources or time). A split test on a very very large dataset can produce an accurate estimate of the actual performance of the algorithm.
How good is the algorithm on the data? Can we confidently say it can achieve an accuracy of 87%?
A problem is that if we spit the training dataset again into a different 66%/34% split, we would get a different result from our algorithm. This is called model variance.
Multiple Split Tests
A solution to our problem with the split test getting different results on different splits of the dataset is to reduce the variance of the random process and do it many times. We can collect the results from a fair number of runs (say 10) and take the average.
For example, let’s say we split our dataset 66%/34%, ran our algorithm and got an accuracy and we did this 10 times with 10 different splits. We might have 10 accuracy scores as follows: 87, 87, 88, 89, 88, 86, 88, 87, 88, 87.
The average performance of our model is 87.5, with a standard deviation of about 0.85.
A problem with multiple split tests is that it is possible that some data instance are never included for training or testing, where as others may be selected multiple times. The effect is that this may skew results and may not give an meaningful idea of the accuracy of the algorithm.
A solution to the problem of ensuring each instance is used for training and testing an equal number of times while reducing the variance of an accuracy score is to use cross validation. Specifically k-fold cross validation, where k is the number of splits to make in the dataset.
For example, let’s choose a value of k=10 (very common). This will split the dataset into 10 parts (10 folds) and the algorithm will be run 10 times. Each time the algorithm is run, it will be trained on 90% of the data and tested on 10%, and each run of the algorithm will change which 10% of the data the algorithm is tested on.
In this example, each data instance will be used as a training instance exactly 9 times and as a test instance 1 time. The accuracy will not be a mean and a standard deviation, but instead will be an exact accuracy score of how many correct predictions were made.
The k-fold cross validation method is the go-to method for evaluating the performance of an algorithm on a dataset. You want to choose k-values that give you a good sized training and test dataset for your algorithm. Not too disproportionate (too large or small for training or test). If you have a lot of data, you may may have to resort to either sampling the data or reverting to a split test.
Cross validation does give an unbiased estimation of the algorithms performance on unseen data, but what if the algorithm itself uses randomness. The algorithm would produce different results for the same training data each time it was trained with a different random number seed (start of the sequence of pseudo-randomness). Cross validation does not account for variance in the algorithm’s predictions.
Another point of concern is that cross validation itself uses randomness to decide how to split the dataset into k folds. Cross validation does not estimate how the algorithm perform with different sets of folds.
This only matters if you want to understand how robust the algorithm is on the dataset.
Multiple Cross Validation
A way to account for the variance in the algorithm itself is to run cross validation multiple times and take the mean and the standard deviation of the algorithm accuracy from each run.
This will will give you an an estimate of the performance of the algorithm on the dataset and an estimation of how robust (the size of the standard deviation) the performance is.
If you have one mean and standard deviation for algorithm A and another mean and standard deviation for algorithm B and they differ (for example, algorithm A has a higher accuracy), how do you know if the difference is meaningful?
This only matters if you want to compare the results between algorithms.
The results from multiple runs of k-fold cross validation is a list of numbers. We like to summarize these numbers using the mean and standard deviation. You can think of these numbers as a sample from an underlying population. A statistical significance test answers the question: are two samples drawn from the same population? (no difference). If the answer is “yes”, then, even if the mean and standard deviations differ, the difference can be said to be not statistically significant.
We can use statistical significance tests to give meaning to the differences (or lack there of) between algorithm results when using multiple runs (like multiple runs of k-fold cross validation with different random number seeds). This can when we want to make accurate claims about results (algorithm A was better than algorithm B and the difference was statistically significant)
This is not the end of the story, because there are different statistical significance tests (parametric and nonparametric) and parameters to those tests (p-value). I’m going to draw the line here because if you have followed me this far, you now know enough about selecting test options to produce rigorous (publishable!) results.
In this post you have discovered the difference between the main test options available to you when designing a test harness to evaluate machine learning algorithms.
Specifically, you learned the utility and problems with:
- Training and testing on the same dataset
- Split tests
- Multiple split tests
- Cross validation
- Multiple cross validation
- Statistical significance
When in doubt, use k-fold cross validation (k=10) and use multiple runs of k-fold cross validation with statistical significance tests when you want to meaningfully compare algorithms on your dataset.