How to Evaluate Machine Learning Algorithms

Once you have defined your problem and prepared your data you need to apply machine learning algorithms to the data in order to solve your problem.

You can spend a lot of time choosing, running and tuning algorithms. You want to make sure you are using your time effectively to get closer to your goal.

In this post you will step through a process to rapidly test algorithms and discover whether or not there is structure in your problem for the algorithms to learn and which algorithms are effective.

Test Harness

Test Harness
Photo attributed to NASA Webb Telescope, some rights reserved

Test Harness

You need to define a test harness. The test harness is the data you will train and test an algorithm against and the performance measure you will use to assess its performance. It is important to define your test harness well so that you can focus on evaluating different algorithms and thinking deeply about the problem.

The goal of the test harness is to be able to quickly and consistently test algorithms against a fair representation of the problem being solved. The outcome of testing multiple algorithms against the harness will be an estimation of how a variety of algorithms perform on the problem against a chosen performance measure. You will know which algorithms might be worth tuning on the problem and which should not be considered further.

The results will also give you an indication of how learnable the problem is. If a variety of different learning algorithms universally perform poorly on the problem, it may be an indication of a lack of structure available to algorithms to learn. This may be because there actually is a lack of learnable structure in the selected data or it may be an opportunity to try different transforms to expose the structure to the learning algorithms.

Performance Measure

The performance measure is the way you want to evaluate a solution to the problem. It is the measurement you will make of the predictions made by a trained model on the test dataset.

Performance measures are typically specialized to the class of problem you are working with, for example classification, regression, and clustering. Many standard performance measures will give you a score that is meaningful to your problem domain. For example, classification accuracy for classification (total correct correction divided by the total predictions made multiple by 100 to turn it into a percentage).

You may also want a more detailed breakdown of performance, for example, you may want to know about the false positives on a spam classification problem because good email will be marked as spam and cannot be read.

There are many standard performance measures to choose from. You rarely have to devise a new performance measure yourself as you can generally find or adapt one that best captures the requirements of the problem being solved. Look to similar problems you uncovered and at the performance measures used to see if any can be adopted.

Test and Train Datasets

From the transformed data, you will need to select a test set and a training set. An algorithm will be trained on the training dataset and will be evaluated against the test set. This may be as simple as selecting a random split of data (66% for training, 34% for testing) or may involve more complicated sampling methods.

A trained model is not exposed to the test dataset during training and any predictions made on that dataset are designed to be indicative of the performance of the model in general. As such you want to make sure the selection of your datasets are representative of the problem you are solving.

Cross Validation

A more sophisticated approach than using a test and train dataset is to use the entire transformed dataset to train and test a given algorithm. A method you could use in your test harness that does this is called cross validation.

It first involves separating the dataset into a number of equally sized groups of instances (called folds). The model is then trained on all folds exception one that was left out and the prepared model is tested on that left out fold. The process is repeated so that each fold get’s an opportunity at being left out and acting as the test dataset. Finally, the performance measures are averaged across all folds to estimate the capability of the algorithm on the problem.

For example, a 3-fold cross validation would involve training and testing a model 3 times:

  • #1: Train on folds 1+2, test on fold 3
  • #2: Train on folds 1+3, test on fold 2
  • #3: Train on folds 2+3, test on fold 1

The number of folds can vary based on the size of your dataset, but common numbers are 3, 5, 7 and 10 folds. The goal is to have a good balance between the size and representation of data in your train and test sets.

When you’re just getting started, stick with a simple split of train and test data (such as 66%/34%) and move onto cross validation once you have more confidence.

Testing Algorithms

When starting with a problem and having defined a test harness you are happy with, it is time to spot check a variety of machine learning algorithms. Spot checking is useful because it allows you to very quickly see if there is any learnable structures in the data and estimate which algorithms may be effective on the problem.

Spot checking also helps you work out any issues in your test harness and make sure the chosen performance measure is appropriate.

The best first algorithm to spot check is a random. Plug in a random number generator to generate predictions in the appropriate range. This should be the worst “algorithm result” you achieve and will be the measure by which all improvements can be assessed.

Select 5-10 standard algorithms that are appropriate for your problem and run them through your test harness. By standard algorithms, I mean popular methods no special configurations. Appropriate for your problem means that the algorithms can handle regression if you have a regression problem.

Choose methods from the groupings of algorithms we have already reviewed. I like to include a diverse mix and have 10-20 different algorithms drawn from a diverse range of algorithm types. Depending on the library I am using, I may spot check up to a 50+ popular methods to flush out promising methods quickly.

If you want to run a lot of methods, you may have to revisit data preparation and reduce the size of your selected dataset. This may reduce your confidence in the results, so test with various data set sizes. You may like to use a smaller size dataset for algorithm spot checking and a fuller dataset for algorithm tuning.


In this post you learned about the importance of setting up a trust worthy test harness that involves the selection of test and training datasets and a performance measure meaningful to your problem.

You also learned about the strategy of spot checking a diverse range of machine learning algorithms on your problem using your test harness. You discovered that this strategy can quickly highlight whether there is learnable structure in your dataset (and if not you can revisit data preparation) and which algorithms perform generally well on the problem (that may be candidates for further investigation and tuning).


If you are looking to dive deeper into this topic, you can learn more from the resources below.

47 Responses to How to Evaluate Machine Learning Algorithms

  1. Avatar
    Mark Simi April 17, 2014 at 5:11 am #

    Hey Jason,

    I’m really interest in thinking about this in conjunction with parameter tuning. I reached out to some of my colleagues to bounce this off of them, would love to hear your thoughts:

    “My question: is it possible to use this approach in conjunction with a grid search to tune a these models’ parameters? I feel like this would be good in theory, but am not sure how to reconcile models’ varying parameters simultaneously.

    The other approach which I could take: using a test harness to ID the best out of the box performing algo, selecting that for my project, then using a grid search to tune.”

    Is either approach similar to your workflow? If not, what is your thinking on this?

  2. Avatar
    jasonb April 17, 2014 at 5:38 am #

    Hey Mark, great question.

    For a small problem, I’ll spot check algorithms out of the box and maybe a few popular combinations of parameters. For bigger problems and competitions, I’ll grid search all the popular algorithms and then “super-grid search” those that do well.

    It’s a time/ROI trade-off that you have to make depending on the size and importance of the problem.

  3. Avatar
    Mark Simi April 17, 2014 at 6:23 am #

    Okay, so assuming that I have a large level problem that can stand the time. I wasn’t even sure if it was easy / possible to use grid on a bunch of popular algos simultaneously. Guess I’d just need to evaluate the parameters correctly and build a structure to consider differences for grid searching.

    Thanks for the reply, Jason. Much appreciated.

    • Avatar
      jasonb April 17, 2014 at 7:42 am #

      Hey Mark,

      Platforms like scikit-learn and R have parameter tuning capabilities (grid/random search). You will want to use cross validation to get a fair estimation of a given model’s robustness.

      You can largely automate the process as you can constraint the parameter search (at least for the spot check phase) to the bounds of well known parameters (from papers/experience).

      Spot checking is not about getting the best result, just indications on where the best result may be – which methods are generally better at picking out the structure in the problem. Once found, you can go to town on that smaller subset with tuning, ensembles and advanced feature engineering.

  4. Avatar
    Mikle February 10, 2015 at 7:28 am #

    Hi Jason.

    Could you tell me if there is a way of testing a Machine Learning alrorithm as a blackbox?
    Let’s consider an example when I have two applications which use Machine Learning algorithms to solve a certain task. I don’t have an access to applications’ source code and even don’t know which algorithms are used inside. The only thing I can do is to play with an input data and verify an output. I need to test and choose application that suits better for solving a task.
    At first glance, testing approach is straightforward: compare accuracy of two applications using the same data set. However, I am concerned about the fact that accuracy of applications might vary. For example: application A performes better on a small amount of training data; on a larger amount of training data application B gives better results than A; with further growth of training data application B starts suffering from, let’s say, overfitting.
    It is clear that the longer I test the better insight I have. But I can’t test forever- I need to stop at some point. So, my question more is how to determine a sufficient size of data set to ensure that using that set I’ve choosen the best application in a long term?


    • Avatar
      Jason Brownlee February 10, 2015 at 9:30 am #

      It’s a hard open problem Mikle. Estimating accuracy using cross validation is the go and pushing the limits of the two methods. A good test I like is does accuracy degrade gracefully, or scale with dataset size.

  5. Avatar
    Kleyn Guerreiro May 3, 2016 at 9:06 pm #

    Cross validation, as far as I remember, includes a validation step either, together with train and test….

  6. Avatar
    Yohahn May 6, 2016 at 11:59 pm #

    Hi Jason,

    I had a question about Cross-Validation. Suppose you created a hold-out test set prior to training. Then ran k-fold CV on the training set and generated a measure of performance. What does this measure of performance indicate? Is it still indicative of the model’s generalisation capabilities, or is it indicative of the quality of the training process?



  7. Avatar
    Tom Anderson August 7, 2016 at 3:57 pm #

    A more recent article on this blog provides an example of a test harness in Python:

  8. Avatar
    Rahul November 14, 2016 at 5:26 pm #

    HI Jason,

    Are there any testing techniques/methods to test an algorithm.If yes can you please name a few or direct me to a path were i can find them.


    • Avatar
      Jason Brownlee November 15, 2016 at 7:48 am #

      Hi Rahul, if you mean testing for implementation correctness, I would suggest unit tests and regression tests.

  9. Avatar
    vaibhav kumar February 22, 2017 at 5:22 pm #

    Thanks, Jason It was a very Informative post. I am new to machine learning and gaining information through posts like yours. I have a data of accident occurrence in a area and it has around 200 entries, along with location, time and severity information. I want to predict the incident for each hour in future. What is your suggestion on this. Is the volume of data enough to do prediction and what approach should i use.

    Thanks in advance,


  10. Avatar
    Lehyu April 26, 2017 at 6:54 pm #

    Hi Jason, I had a question about cross-validation. Suppose I have a train set O and a test data set P for regression problem. There are too few raw features to support the forecast, so I want to add some statistic features (like mean, median). And my question is when should I generate statistic features? Before or after cross-validation?

    Suppose I split train set O into train set T and validation set V.
    1) If we add statistic features before cross-validation, I think we may have a good fit on validation set V, but it may have high variance on test data set.

    2) If we add statistic features after cross-validation, I think we may loss some information so that it may cause a high bias on test data set (the size of train set O is very small).

    Am I right? What should I do?

    Thanks in advance!

    • Avatar
      Lehyu April 26, 2017 at 6:59 pm #

      generate statistic features before cross-validatin means that I use the whole train data set O to generate statistic features, and then do cross-validation.

      generate statistic features after cross-validation means that I split O into T and V, generate statistic features using T and apply transformation to V.

    • Avatar
      Jason Brownlee April 27, 2017 at 8:39 am #

      Great question, generate new features within each cross validation fold using only the training data in the train data of the fold and not the hold out test of the fold.

      • Avatar
        Lehyu April 27, 2017 at 11:29 am #

        It would be a time-consuming work. But I think it will fulfill my needs. Thanks, Jason.

  11. Avatar
    Franco Ubaudi July 3, 2017 at 1:59 pm #

    In your article titled “How to Evaluate Machine Learning Algorithms” (, there is the following paragraph
    If you want to run a lot of methods, you may have to revisit data preparation and reduce the size of your selected dataset. This may reduce your confidence in the results, so test with various data set sizes. You may like to use a smaller size dataset for algorithm spot checking and a fuller dataset for algorithm tuning.

    which is found in the section titled “Testing Algorithms”.

    To me there seem to be some relevant but and described material, for example, the potential need to reduce the size of the dataset; why could this be the case? Why would testing with various data set sizes be needed, etc?

    • Avatar
      Jason Brownlee July 6, 2017 at 9:58 am #

      The size of the dataset may influence the quality of the function approximated by the ML algorithm.

  12. Avatar
    shahid khan August 31, 2017 at 6:38 pm #

    i have been performing principal component analysis to the continuous data variables from various dataset. My question is how effective PCA works for classification models since some of the data features may contain categorical or boolean form data values?

    shall i consider PCA or is there any other methodolgies to implement for classfication algorithms?

    • Avatar
      Jason Brownlee September 1, 2017 at 6:44 am #

      PCA is a data preparation method for supervised learning. It is not a supervised learning algorithm in and of itself.

  13. Avatar
    shahid khan August 31, 2017 at 9:59 pm #

    Hi Jason,

    How PCA is going to influence on the data sets having categorical variables. is it effective enough to use PCA to improve performance of the algorithms?

    • Avatar
      Jason Brownlee September 1, 2017 at 6:46 am #

      As far as I can remember, PCA is for real-valued data only.

  14. Avatar
    Shabnam November 30, 2017 at 11:46 am #

    Thanks Jason for your blog. I have a question about:
    “When you’re just getting started, stick with a simple split of train and test data (such as 66%/34%) and move onto cross validation once you have more confidence”

    Are we choosing either train/validation/test or cross-validation method? Or can we mix these together?
    For example, suppose we put a side 20% of data for test and run cross-validation on the remained 80%. Is that okay? Or should we run cross-validation on the whole data set?
    Which one is a better approach?

    I was doing an experiment on a set of data, I had score of ~0.8 by using cross-validation on 80% of data (I used randomizes search for MLP in sklearn (*) )
    However, when I used the same parameters (*) for the remained 20% percent of data, I got about 0.2 prediction. That is why I am not sure if it is better to mix these methods or not.

  15. Avatar
    Mohammad Ehtasham Billah February 2, 2018 at 8:57 pm #

    I saw in one of the videos of David Langer where he applied 10 fold cross validation 10 times on different models using the caret package.I mean he was cross-validating different models and at the same time grid searching for hyperparameter.(i.e using method=”some_model” within the train() function.My question is, is it ok to combine the 2 steps(cross-validation with grid search) like he did? Or should I do it in two separate steps. First, cross validating different models and then grid searching for hyperparameter.Thanks

  16. Avatar
    Jesús Martínez February 9, 2018 at 11:22 am #

    A very good article, Jason.

    I think that creating a test harnessing is a great fit for some functional programming principles, mainly high order functions, where each step (data sampling, algorithm and evaluation metric) could be delegated to some other function. Of course, it wouldn’t be a pure function due to the fact that, possibly, multiple runs on the same conditions would yield different results because different subsections of the data being sampled.

    What do you think about it? Do you think a programming pattern such as template method would be beneficial? Thanks in advance!

    • Avatar
      Jason Brownlee February 10, 2018 at 8:50 am #

      Interesting, I have not thought about how this might fit into programming patterns.

      Perhaps try it and see how you go.

  17. Avatar
    Michael Jahrer May 7, 2018 at 11:27 am #

    Very interesting I left my comment on techtarget

    Automation top 3 worldwide tools, we go deep in accuracy and time consumption benchmark.

  18. Avatar
    Rom August 28, 2020 at 2:04 am #

    If my training set accuracy on a certain learner is greater than the CV accuracy then surely my model is overfitting. To combat this problem of overfitting I perform tunning on my training data using nested CV. The best hyperparameter combination that i obtain for the learner after nestedCv, I train it on the whole dataset(train+test) and observe the accuracy(or loss). Further I also perform bootstrap resampling procedure to approximate how my model will perform on unseen data(gen. error). After I’m confident about my learner I predict it on the untouched test data. Am I following the right procedure? Please correct me if I’m wrong at any step.

    • Avatar
      Jason Brownlee August 28, 2020 at 6:51 am #

      I don’t agree that you must be overfitting. If you perform model selection via out of sample performance, overfitting becomes moot / irrelevant.

      • Avatar
        Rom August 30, 2020 at 2:08 am #

        suppose it is overfitting then What are the correct steps after I have performed nestedCv before I can finalize my model?

        • Avatar
          Jason Brownlee August 30, 2020 at 6:45 am #

          Overfitting is a possible cause of poor model performance, after you have chosen a model. It is a diagnostic you can use to understand models that perform poorly or models that learn incrementally like neural nets.

          You can simply choose a model that has better out of sample performance and it will not have “overfit”.

          Nevertheless, if you believe poor performance you are seeing is due to overfitting, overfitting can be addressed many ways, most commonly via regularization. Such as adding a penalty to the loss function based on model complexity.

  19. Avatar
    sachin February 25, 2021 at 3:18 pm #

    hi jason
    i would like you to correct the diff between svm and knn algorithm

    • Avatar
      Jason Brownlee February 26, 2021 at 4:55 am #

      Sorry, I don’t understand. Can you elaborate please?

  20. Avatar
    Sue March 6, 2021 at 1:09 am #

    I have one quick question that applies to both approaches that you listed above. After splitting the data into train set and test set, I wonder if it is ok to stop training the model based on the performance on the test set (literally selecting the model that performs the best in the test set, while the test data does not feed into the training process). I have encountered a paper (cannot remember which) that only mention splitting the data into 80 (train):20(holdout/test) and doesn’t further explain dividing the train set again into train & validation set. Thanks Jason!

  21. Avatar
    Mark April 26, 2021 at 9:36 am #

    Hi Jason,

    I’m new to ML and a little confused. My understanding was that you train your model and fit it on training set (using 80/20 split) and then test for accuracy (y_predict vs y_test). In the end, you can use K-fold cross validation to further determine accuracy of your model.

    However, your post makes it sound like you should either use train/test split method OR a cross validation method to train the model.

    “A more sophisticated approach than using a test and train dataset is to use the entire transformed dataset to train and test a given algorithm. A method you could use in your test harness that does this is called cross validation”

    My understanding was cross validation is only used for understanding a model’s skill – not officially training the model with it.

    Could you please clarify?


    • Avatar
      Jason Brownlee April 27, 2021 at 5:12 am #

      You would use train/test split or k-fold cv to evaluate a model on a dataset, not both.

      If you have a lot of data, you could hold some data back as a final test set and use the rest for model selection. This would be ideal, but rarely possible given we often don’t have enough data in practice.

      To assess a model with k-fold cross-validation, we must fit k models, you can learn more here:

  22. Avatar
    Cian June 1, 2021 at 4:28 pm #

    Hi Jason,

    By performing the train-test split (and cross-validation) after data transformation are you not guilty of some data leakage?

    Also, if you are just using cross-validation to compare algorithms with each other, is data leakage forgivable in that context? Provided you have kept a test set outside that has not been involved in any of the data prep/transformation process, for a proper performance metric.


Leave a Reply