The test options you use when evaluating machine learning algorithms can mean the difference between over-learning, a mediocre result and a usable state-of-the-art result that you can confidently shout from the roof tops (you really do feel like doing that sometimes).

In this post you will discover the standard test options you can use in your algorithm evaluation test harness and how to choose the right options next time.

## Randomness

The root of the difficulty in choosing the right test options is randomness. Most (almost all) machine learning algorithms use randomness in some way. The randomness may be explicit in the algorithm or may be in the sample of the data selected to train the algorithm.

This does not mean that the algorithms produce random results, it means that they produce results with some noise or variance. We call this type of limited variance, stochastic and the algorithms that exploit it, stochastic algorithms.

## Train and Test on Same Data

If you have a dataset, you may want to train the model on the dataset and then report the results of the model on that dataset. That’s how good the model is, right?

The problem with this approach of evaluating algorithms is that you indeed will know the performance of the algorithm on the dataset, but do not have any indication of how the algorithm will perform on data that the model was not trained on (so-called unseen data).

This matters, only if you want to use the model to make predictions on unseen data.

## Split Test

A simple way to use one dataset to both train and estimate the performance of the algorithm on unseen data is to split the dataset. You take the dataset, and split it into a training dataset and a test dataset. For example, you randomly select 66% of the instances for training and use the remaining 34% as a test dataset.

The algorithm is run on the training dataset and a model is created and assessed on the test dataset and you get a performance accuracy, lets say 87% classification accuracy.

Spit tests are fast and great when you have a lot of data or when training a model is expensive (it resources or time). A split test on a very very large dataset can produce an accurate estimate of the actual performance of the algorithm.

How good is the algorithm on the data? Can we confidently say it can achieve an accuracy of 87%?

A problem is that if we spit the training dataset again into a different 66%/34% split, we would get a different result from our algorithm. This is called model variance.

## Multiple Split Tests

A solution to our problem with the split test getting different results on different splits of the dataset is to reduce the variance of the random process and do it many times. We can collect the results from a fair number of runs (say 10) and take the average.

For example, let’s say we split our dataset 66%/34%, ran our algorithm and got an accuracy and we did this 10 times with 10 different splits. We might have 10 accuracy scores as follows: 87, 87, 88, 89, 88, 86, 88, 87, 88, 87.

The average performance of our model is 87.5, with a standard deviation of about 0.85.

A problem with multiple split tests is that it is possible that some data instance are never included for training or testing, where as others may be selected multiple times. The effect is that this may skew results and may not give an meaningful idea of the accuracy of the algorithm.

## Cross Validation

A solution to the problem of ensuring each instance is used for training and testing an equal number of times while reducing the variance of an accuracy score is to use cross validation. Specifically k-fold cross validation, where k is the number of splits to make in the dataset.

For example, let’s choose a value of k=10 (very common). This will split the dataset into 10 parts (10 folds) and the algorithm will be run 10 times. Each time the algorithm is run, it will be trained on 90% of the data and tested on 10%, and each run of the algorithm will change which 10% of the data the algorithm is tested on.

In this example, each data instance will be used as a training instance exactly 9 times and as a test instance 1 time. The accuracy will not be a mean and a standard deviation, but instead will be an exact accuracy score of how many correct predictions were made.

The k-fold cross validation method is the go-to method for evaluating the performance of an algorithm on a dataset. You want to choose k-values that give you a good sized training and test dataset for your algorithm. Not too disproportionate (too large or small for training or test). If you have a lot of data, you may may have to resort to either sampling the data or reverting to a split test.

Cross validation does give an unbiased estimation of the algorithms performance on unseen data, but what if the algorithm itself uses randomness. The algorithm would produce different results for the same training data each time it was trained with a different random number seed (start of the sequence of pseudo-randomness). Cross validation does not account for variance in the algorithm’s predictions.

Another point of concern is that cross validation itself uses randomness to decide how to split the dataset into k folds. Cross validation does not estimate how the algorithm perform with different sets of folds.

This only matters if you want to understand how robust the algorithm is on the dataset.

## Multiple Cross Validation

A way to account for the variance in the algorithm itself is to run cross validation multiple times and take the mean and the standard deviation of the algorithm accuracy from each run.

This will will give you an an estimate of the performance of the algorithm on the dataset and an estimation of how robust (the size of the standard deviation) the performance is.

If you have one mean and standard deviation for algorithm A and another mean and standard deviation for algorithm B and they differ (for example, algorithm A has a higher accuracy), how do you know if the difference is meaningful?

This only matters if you want to compare the results between algorithms.

## Statistical Significance

A solution to comparing algorithm performance measures when using multiple runs of k-fold cross validation is to use statistical significance tests (like the Student’s t-test).

The results from multiple runs of k-fold cross validation is a list of numbers. We like to summarize these numbers using the mean and standard deviation. You can think of these numbers as a sample from an underlying population. A statistical significance test answers the question: are two samples drawn from the same population? (no difference). If the answer is “yes”, then, even if the mean and standard deviations differ, the difference can be said to be not statistically significant.

We can use statistical significance tests to give meaning to the differences (or lack there of) between algorithm results when using multiple runs (like multiple runs of k-fold cross validation with different random number seeds). This can when we want to make accurate claims about results (algorithm A was better than algorithm B and the difference was statistically significant)

This is not the end of the story, because there are different statistical significance tests (parametric and nonparametric) and parameters to those tests (p-value). I’m going to draw the line here because if you have followed me this far, you now know enough about selecting test options to produce rigorous (publishable!) results.

## Summary

In this post you have discovered the difference between the main test options available to you when designing a test harness to evaluate machine learning algorithms.

Specifically, you learned the utility and problems with:

- Training and testing on the same dataset
- Split tests
- Multiple split tests
- Cross validation
- Multiple cross validation
- Statistical significance

When in doubt, use k-fold cross validation (k=10) and use multiple runs of k-fold cross validation with statistical significance tests when you want to meaningfully compare algorithms on your dataset.

Great post Jason. Very clear and easy to follow.

Thanks Mickael.

Great post.

About the last part, are there situations where performance comparison for different algorithms needs to involve non parametric test?

Could you please tell us some examples for that if they exists?

Thanks for your help.

The distribution of results is often Gaussian (normally distributed).

If you believe this is not the case for some reason, I would advice you to use a non-parametric test. It is a good question and I cannot think of anything off hand. Often I will use a non-parametric test to take all guessing out of the equation.

Thanks;)

Great posts! Thanks!

You suggest running multiple k-fold cross validation w/ statistical significance tests would help draw conclusions of algorithm comparison.

I wonder how to set aside a fold for validating and tuning a model if I use k-fold cross-validation.

Is it ok to run multiple splits of training/validation/test datasets w/ statistical significance tests, even though some data may occur multiple times in the same split of datasets?

Thanks for your help!

Hi Dr Jason,

This is another great post. I have learned a lot from this post.

Keep up.

Regards,

Surajit

Hi Jason – I’m new to ML and python.Just learnt your algorithm for implementing naive Bayes from scratch in Python.First thanks for explaining it beautifully.

Also learnt implementation of the same algorithm with scikit learn.

Need some help in understanding terms precision,recall and f1 score in classification_report.

Jason … your posts are simple and great

Thanks.

thanks

You’re welcome anbu.

Thank you very much for this magnificent post, Jason. Please keep up the good work.

You’re very welcome Mohammed.

I’m trying to understand how K-fold CV could help evaluate a neural network model trained using backpropagation. As I understand it, in K-fold CV, every batch is use to train a single neural network model. Would that means, at the end of every batch the model is more or less overfitted to that particular batch? So in a 10 fold CV, at the end of the last batch, we have a model that potentially overfitted to the last batch where it could perform well on the last batch 90% training data, but sucked in the test 10%?

And would this work on time series data?

Anyway, thank you for sharing, your posts have been very useful to me.

Hi hew.

Yes, 10 different models are trained and evaluated. We report the average score and throw the models away. Cross validation is only used to estimate the performance of the model on unseen data, not to train the model.

If we are happy with the performance, we can then train the model on the entire training dataset and begin to use it.

Time series is difficult to use with cross validation. Normally, I a train/test split and a sliding window to evaluate models on time series data.

Hi Jason,

Thank you for your reply. Im not sure if i understand you completely. To estimate the performance of our model on unseen data? Is the definition of model here, the ML method used? What does it mean we score well on k fold exercise? All i would assume is that, the dataset we use contain evenly spread common features that can be use for our prediction of our dataset. It doesnt proof its effectiveness against oos. Perhaps the dataset that we selected for the exercise is biased in away that the features that we found never works in oos datset.

Great comment.

The goal of predictive modeling is to develop a model from a sample of data of the domain to perform well on data it has not seen before. To make predictions on unseen data.

If this is not the goal, then you are doing stats and developing a descriptive rather than a predictive model and trying to understand the domain.

Knowing how well a model does on unseen data is a hard problem. We can hold back a sample and use that to estimate the skill of the model. A more advanced technique is to do this many times – cross validation.

Cross validation does not prove a model or modeling methodology (data prep + model) will do well, but it gives us confidence that it will do well.

Indeed, we must be very concerned with the quality of our data sample, otherwise the ability to generalize will be compromised.

Does that help?

Hi Jason,

May I ask you what is the alternative to fit the data when we have very low data points (say 15 to 20 only) but no. of predictors are large (e.g.9 to 10)

That is a harder type of problem Poonam.

You may be better served with small n statistics (e.g. statistical methods). You just don’t have enough observations for machine learning methods to learn from and generalize.