How to Evaluate Machine Learning Algorithms

Once you have defined your problem and prepared your data you need to apply machine learning algorithms to the data in order to solve your problem. You can spend a lot of time choosing, running and tuning algorithms. You want to make sure you are using your time effectively to get closer to your goal.

In this post you will step through a process to rapidly test algorithms and discover whether or not there is structure in your problem for the algorithms to learn and which algorithms are effective.

Test Harness

Test Harness
Photo attributed to NASA Webb Telescope, some rights reserved

Test Harness

You need to define a test harness. The test harness is the data you will train and test an algorithm against and the performance measure you will use to assess its performance. It is important to define your test harness well so that you can focus on evaluating different algorithms and thinking deeply about the problem.

The goal of the test harness is to be able to quickly and consistently test algorithms against a fair representation of the problem being solved. The outcome of testing multiple algorithms against the harness will be an estimation of how a variety of algorithms perform on the problem against a chosen performance measure. You will know which algorithms might be worth tuning on the problem and which should not be considered further.

The results will also give you an indication of how learnable the problem is. If a variety of different learning algorithms university perform poorly on the problem, it may be an indication of a lack of structure available to algorithms to learn. This may be because there actually is a lack of learnable structure in the selected data or it may be an opportunity to try different transforms to expose the structure to the learning algorithms.

Performance Measure

The performance measure is the way you want to evaluate a solution to the problem. It is the measurement you will make of the predictions made by a trained model on the test dataset.

Performance measures are typically specialized to the class of problem you are working with, for example classification, regression, and clustering. Many standard performance measures will give you a score that is meaningful to your problem domain. For example, classification accuracy for classification (total correct correction divided by the total predictions made multiple by 100 to turn it into a percentage).

You may also want a more detailed breakdown of performance, for example, you may want to know about the false positives on a spam classification problem because good email will be marked as spam and cannot be read.

There are many standard performance measures to choose from. You rarely have to devise a new performance measure yourself as you can generally find or adapt one that best captures the requirements of the problem being solved. Look to similar problems you uncovered and at the performance measures used to see if any can be adopted.

Test and Train Datasets

From the transformed data, you will need to select a test set and a training set. An algorithm will be trained on the training dataset and will be evaluated against the test set. This may be as simple as selecting a random split of data (66% for training, 34% for testing) or may involve more complicated sampling methods.

A trained model is not exposed to the test dataset during training and any predictions made on that dataset are designed to be indicative of the performance of the model in general. As such you want to make sure the selection of your datasets are representative of the problem you are solving.

Cross Validation

A more sophisticated approach than using a test and train dataset is to use the entire transformed dataset to train and test a given algorithm. A method you could use in your test harness that does this is called cross validation.

It first involves separating the dataset into a number of equally sized groups of instances (called folds). The model is then trained on all folds exception one that was left out and the prepared model is tested on that left out fold. The process is repeated so that each fold get’s an opportunity at being left out and acting as the test dataset. Finally, the performance measures are averaged across all folds to estimate the capability of the algorithm on the problem.

For example, a 3-fold cross validation would involve training and testing a model 3 times:

  • #1: Train on folds 1+2, test on fold 3
  • #2: Train on folds 1+3, test on fold 2
  • #3: Train on folds 2+3, test on fold 1

The number of folds can vary based on the size of your dataset, but common numbers are 3, 5, 7 and 10 folds. The goal is to have a good balance between the size and representation of data in your train and test sets.

When you’re just getting started, stick with a simple split of train and test data (such as 66%/34%) and move onto cross validation once you have more confidence.

Testing Algorithms

When starting with a problem and having defined a test harness you are happy with, it is time to spot check a variety of machine learning algorithms. Spot checking is useful because it allows you to very quickly see if there is any learnable structures in the data and estimate which algorithms may be effective on the problem.

Spot checking also helps you work out any issues in your test harness and make sure the chosen performance measure is appropriate.

The best first algorithm to spot check is a random. Plug in a random number generator to generate predictions in the appropriate range. This should be the worst “algorithm result” you achieve and will be the measure by which all improvements can be assessed.

Select 5-10 standard algorithms that are appropriate for your problem and run them through your test harness. By standard algorithms, I mean popular methods no special configurations. Appropriate for your problem means that the algorithms can handle regression if you have a regression problem.

Choose methods from the groupings of algorithms we have already reviewed. I like to include a diverse mix and have 10-20 different algorithms drawn from a diverse range of algorithm types. Depending on the library I am using, I may spot check up to a 50+ popular methods to flush out promising methods quickly.

If you want to run a lot of methods, you may have to revisit data preparation and reduce the size of your selected dataset. This may reduce your confidence in the results, so test with various data set sizes. You may like to use a smaller size dataset for algorithm spot checking and a fuller dataset for algorithm tuning.


In this post you learned about the importance of setting up a trust worthy test harness that involves the selection of test and training datasets and a performance measure meaningful to your problem.

You also learned about the strategy of spot checking a diverse range of machine learning algorithms on your problem using your test harness. You discovered that this strategy can quickly highlight whether there is learnable structure in your dataset (and if not you can revisit data preparation) and which algorithms perform generally well on the problem (that may be candidates for further investigation and tuning).


If you are looking to dive deeper into this topic, you can learn more from the resources below.

20 Responses to How to Evaluate Machine Learning Algorithms

  1. Mark Simi April 17, 2014 at 5:11 am #

    Hey Jason,

    I’m really interest in thinking about this in conjunction with parameter tuning. I reached out to some of my colleagues to bounce this off of them, would love to hear your thoughts:

    “My question: is it possible to use this approach in conjunction with a grid search to tune a these models’ parameters? I feel like this would be good in theory, but am not sure how to reconcile models’ varying parameters simultaneously.

    The other approach which I could take: using a test harness to ID the best out of the box performing algo, selecting that for my project, then using a grid search to tune.”

    Is either approach similar to your workflow? If not, what is your thinking on this?

  2. jasonb April 17, 2014 at 5:38 am #

    Hey Mark, great question.

    For a small problem, I’ll spot check algorithms out of the box and maybe a few popular combinations of parameters. For bigger problems and competitions, I’ll grid search all the popular algorithms and then “super-grid search” those that do well.

    It’s a time/ROI trade-off that you have to make depending on the size and importance of the problem.

  3. Mark Simi April 17, 2014 at 6:23 am #

    Okay, so assuming that I have a large level problem that can stand the time. I wasn’t even sure if it was easy / possible to use grid on a bunch of popular algos simultaneously. Guess I’d just need to evaluate the parameters correctly and build a structure to consider differences for grid searching.

    Thanks for the reply, Jason. Much appreciated.

    • jasonb April 17, 2014 at 7:42 am #

      Hey Mark,

      Platforms like scikit-learn and R have parameter tuning capabilities (grid/random search). You will want to use cross validation to get a fair estimation of a given model’s robustness.

      You can largely automate the process as you can constraint the parameter search (at least for the spot check phase) to the bounds of well known parameters (from papers/experience).

      Spot checking is not about getting the best result, just indications on where the best result may be – which methods are generally better at picking out the structure in the problem. Once found, you can go to town on that smaller subset with tuning, ensembles and advanced feature engineering.

  4. Mikle February 10, 2015 at 7:28 am #

    Hi Jason.

    Could you tell me if there is a way of testing a Machine Learning alrorithm as a blackbox?
    Let’s consider an example when I have two applications which use Machine Learning algorithms to solve a certain task. I don’t have an access to applications’ source code and even don’t know which algorithms are used inside. The only thing I can do is to play with an input data and verify an output. I need to test and choose application that suits better for solving a task.
    At first glance, testing approach is straightforward: compare accuracy of two applications using the same data set. However, I am concerned about the fact that accuracy of applications might vary. For example: application A performes better on a small amount of training data; on a larger amount of training data application B gives better results than A; with further growth of training data application B starts suffering from, let’s say, overfitting.
    It is clear that the longer I test the better insight I have. But I can’t test forever- I need to stop at some point. So, my question more is how to determine a sufficient size of data set to ensure that using that set I’ve choosen the best application in a long term?


    • Jason Brownlee February 10, 2015 at 9:30 am #

      It’s a hard open problem Mikle. Estimating accuracy using cross validation is the go and pushing the limits of the two methods. A good test I like is does accuracy degrade gracefully, or scale with dataset size.

  5. Kleyn Guerreiro May 3, 2016 at 9:06 pm #

    Cross validation, as far as I remember, includes a validation step either, together with train and test….

  6. Yohahn May 6, 2016 at 11:59 pm #

    Hi Jason,

    I had a question about Cross-Validation. Suppose you created a hold-out test set prior to training. Then ran k-fold CV on the training set and generated a measure of performance. What does this measure of performance indicate? Is it still indicative of the model’s generalisation capabilities, or is it indicative of the quality of the training process?



  7. Tom Anderson August 7, 2016 at 3:57 pm #

    A more recent article on this blog provides an example of a test harness in Python:

  8. Rahul November 14, 2016 at 5:26 pm #

    HI Jason,

    Are there any testing techniques/methods to test an algorithm.If yes can you please name a few or direct me to a path were i can find them.


    • Jason Brownlee November 15, 2016 at 7:48 am #

      Hi Rahul, if you mean testing for implementation correctness, I would suggest unit tests and regression tests.

  9. vaibhav kumar February 22, 2017 at 5:22 pm #

    Thanks, Jason It was a very Informative post. I am new to machine learning and gaining information through posts like yours. I have a data of accident occurrence in a area and it has around 200 entries, along with location, time and severity information. I want to predict the incident for each hour in future. What is your suggestion on this. Is the volume of data enough to do prediction and what approach should i use.

    Thanks in advance,


  10. Lehyu April 26, 2017 at 6:54 pm #

    Hi Jason, I had a question about cross-validation. Suppose I have a train set O and a test data set P for regression problem. There are too few raw features to support the forecast, so I want to add some statistic features (like mean, median). And my question is when should I generate statistic features? Before or after cross-validation?

    Suppose I split train set O into train set T and validation set V.
    1) If we add statistic features before cross-validation, I think we may have a good fit on validation set V, but it may have high variance on test data set.

    2) If we add statistic features after cross-validation, I think we may loss some information so that it may cause a high bias on test data set (the size of train set O is very small).

    Am I right? What should I do?

    Thanks in advance!

    • Lehyu April 26, 2017 at 6:59 pm #

      generate statistic features before cross-validatin means that I use the whole train data set O to generate statistic features, and then do cross-validation.

      generate statistic features after cross-validation means that I split O into T and V, generate statistic features using T and apply transformation to V.

    • Jason Brownlee April 27, 2017 at 8:39 am #

      Great question, generate new features within each cross validation fold using only the training data in the train data of the fold and not the hold out test of the fold.

      • Lehyu April 27, 2017 at 11:29 am #

        It would be a time-consuming work. But I think it will fulfill my needs. Thanks, Jason.

  11. Franco Ubaudi July 3, 2017 at 1:59 pm #

    In your article titled “How to Evaluate Machine Learning Algorithms” (, there is the following paragraph
    If you want to run a lot of methods, you may have to revisit data preparation and reduce the size of your selected dataset. This may reduce your confidence in the results, so test with various data set sizes. You may like to use a smaller size dataset for algorithm spot checking and a fuller dataset for algorithm tuning.

    which is found in the section titled “Testing Algorithms”.

    To me there seem to be some relevant but and described material, for example, the potential need to reduce the size of the dataset; why could this be the case? Why would testing with various data set sizes be needed, etc?

    • Jason Brownlee July 6, 2017 at 9:58 am #

      The size of the dataset may influence the quality of the function approximated by the ML algorithm.

Leave a Reply