How To Compare Machine Learning Algorithms in Python with scikit-learn

It is important to compare the performance of multiple different machine learning algorithms consistently.

In this post you will discover how you can create a test harness to compare multiple different machine learning algorithms in Python with scikit-learn.

You can use this test harness as a template on your own machine learning problems and add more and different algorithms to compare.

Let’s get started.

  • Update March/2018: Added alternate link to download the dataset as the original appears to have been taken down.
How To Compare Machine Learning Algorithms in Python with scikit-learn

How To Compare Machine Learning Algorithms in Python with scikit-learn
Photo by Michael Knight, some rights reserved.

Choose The Best Machine Learning Model

How do you choose the best model for your problem?

When you work on a machine learning project, you often end up with multiple good models to choose from. Each model will have different performance characteristics.

Using resampling methods like cross validation, you can get an estimate for how accurate each model may be on unseen data. You need to be able to use these estimates to choose one or two best models from the suite of models that you have created.

Compare Machine Learning Models Carefully

When you have a new dataset, it is a good idea to visualize the data using different techniques in order to look at the data from different perspectives.

The same idea applies to model selection. You should use a number of different ways of looking at the estimated accuracy of your machine learning algorithms in order to choose the one or two to finalize.

A way to do this is to use different visualization methods to show the average accuracy, variance and other properties of the distribution of model accuracies.

In the next section you will discover exactly how you can do that in Python with scikit-learn.

Need help with Machine Learning in Python?

Take my free 2-week email course and discover data prep, algorithms and more (with code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Compare Machine Learning Algorithms Consistently

The key to a fair comparison of machine learning algorithms is ensuring that each algorithm is evaluated in the same way on the same data.

You can achieve this by forcing each algorithm to be evaluated on a consistent test harness.

In the example below 6 different algorithms are compared:

  1. Logistic Regression
  2. Linear Discriminant Analysis
  3. K-Nearest Neighbors
  4. Classification and Regression Trees
  5. Naive Bayes
  6. Support Vector Machines

The problem is a standard binary classification dataset from the UCI machine learning repository called the Pima Indians onset of diabetes problem (update: download from here). The problem has two classes and eight numeric input variables of varying scales.

The 10-fold cross validation procedure is used to evaluate each algorithm, importantly configured with the same random seed to ensure that the same splits to the training data are performed and that each algorithms is evaluated in precisely the same way.

Each algorithm is given a short name, useful for summarizing results afterward.

Running the example provides a list of each algorithm short name, the mean accuracy and the standard deviation accuracy.

The example also provides a box and whisker plot showing the spread of the accuracy scores across each cross validation fold for each algorithm.

Compare Machine Learning Algorithms

Compare Machine Learning Algorithms

From these results, it would suggest that both logistic regression and linear discriminate analysis are perhaps worthy of further study on this problem.


In this post you discovered how to evaluate multiple different machine learning algorithms on a dataset in Python with scikit-learn.

You learned how to both use the same test harness to evaluate the algorithms and how to summarize the results both numerically and using a box and whisker plot.

You can use this recipe as a template for evaluating multiple algorithms on your own problems.

Do you have any questions about evaluating machine learning algorithms in Python or about this post? Ask your questions in the comments below and I will do my best to answer them.

Frustrated With Python Machine Learning?

Master Machine Learning With Python

Develop Your Own Models in Minutes

…with just a few lines of scikit-learn code

Discover how in my new Ebook:
Machine Learning Mastery With Python

Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, modeling, tuning, and much more…

Finally Bring Machine Learning To
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.

83 Responses to How To Compare Machine Learning Algorithms in Python with scikit-learn

  1. Sundar June 1, 2016 at 8:32 pm #

    Great article! Just a quick question, what do you think is the best method? First optimise hyper-parameters and then compare the algorithms or vice versa. Thanks.

    • Jason Brownlee June 2, 2016 at 6:09 am #

      I recommend first spot checking algorithms and comparing them, followed by tuning.

      • Sundar June 2, 2016 at 7:04 pm #

        Thank you for your recommendation…

        • Eric September 5, 2017 at 11:31 am #

          You should not just rely on this. Accuracy is only one portion of a model’s accuracy. Depending on the investigation desired, you should look at Precision and Recall because accuracy may be only a tiny portion. Also, why have you not taken an approach with ANOVA, or the Wilcoxon Test, major tests within the realm of data science and widely accepted? Additionally, 5×2 cross-validation should be done, not 10-fold (this is widely accepted). Last, what I find completely missing in this is that you have not discussed how to actually arrive at a statistically-significant decision. This is not a good representation.

          • Response to Eric February 6, 2018 at 6:01 am #

            Eric, nobody cares about your phd, whatever it is you did it in. Also, stop calling it ANOVA when all you’re doing is a regression, it doesn’t make you any smarter. And lastly, nobody cares about your phd and your academic research, this is a machine learning article for Data Scientists.

          • Jason Brownlee February 6, 2018 at 9:25 am #

            There’s always room for improvement and we have to start/stop somewhere in an introductory piece.

      • Reean November 28, 2017 at 11:09 am #

        Hi Dr Brownlee,

        Big fan of your tutorials. I have a question regarding the compare first then tune approach. When we plot them on a box plot and select the best, this is all based on the default model setting right? But once we have tune the different settings in a given model, would the predictive performance be different?

        So the not so good models might even outperform the best model given in the first glance boxplot, if we have trained them more properly. So in this sense, wouldn’t it better to train every module separately, and then say, connect all of them and plot their ROC to see which performs best?

  2. dash June 3, 2016 at 7:37 am #

    Accuracy is easily readable but, in my opinion, it should be replaced by AUC: AUC is “consistent” and “more discriminating” than accuracy (Ling et al. 2003).

    • Eric September 5, 2017 at 11:34 am #

      AUC can be examined on an ROC or Precision vs Recall curve. What should happen is weights based on misclassification, in a confusion matrix. In this case, you can tune a model to avoid certain misclassifications, as some may be more valuable to avoid. If you have zero care about which misclassification occurs, ROC is a decent metric for how you should tune parameters. ROC should be examined for hyperparameter decisions.

      • Response to Eric February 6, 2018 at 6:01 am #

        So go write your own article

  3. Tom Anderson August 7, 2016 at 5:09 pm #

    In the code, “seed = 7” is hard coded. Shouldn’t we have a different seed for each fold?

    • Tom Anderson August 7, 2016 at 10:44 pm #

      To answer my own question, it appears that each model is trained and tested for all folds before moving on to the next model. The seed applies to the initial state so for the above, the 10 folds will all be different from one another, but the same data split for each of the 10 folds will be presented to each algorithm.

    • Jason Brownlee August 8, 2016 at 5:41 am #

      Yes Tom, the seed ensures we have the same sequence of random numbers. The random numbers ensure we have a random split of the data into the k folds.

  4. Guillaume Martin October 28, 2016 at 4:39 pm #

    Thank you for sharing.
    I had to tweak the code a little to make it work with scikit-learn 0.18.
    The cross_validation module is deprecated. It’s replaced by model_selection.
    The KFold parameters have changed too:
    0.17: cross_validation.KFold(n, n_folds=3, shuffle=False, random_state=None)
    0.18: model_selection.KFold(n_splits=3, shuffle=False, random_state=None)

    I have a question: is it ok to train the classifier before adding it to the list? Like:
    lr = LogisticRegression(), y_train)

    • Jason Brownlee October 29, 2016 at 7:36 am #

      Thanks Guillaume, I will look at updating the example. I have recently updated all of my books to support the new sklearn.

      No, the structure of the example fits and evaluates each model in turn. Your example essentially unrolls the for loop.

  5. Angela December 23, 2016 at 9:15 am #

    What a great article! I learned so much from your writing 🙂
    I also read your other article comparing different algorithms in R, and I noticed that you used a lot more techniques in that article:
    • Table Summary
    • Box and Whisker Plots
    • Density Plots
    • Dot Plots
    • Parallel Plots
    • Scatterplot Matrix
    • Pairwise xyPlots
    • Statistical Significance Tests
    I was wondering why you did not provide the same techniques in this Python article? Is it because these functions are more readily available in R?
    Thanks so much!

    • Jason Brownlee December 23, 2016 at 10:18 am #

      Great question.

      These capabilities are available in Python, but are spread through the scipy and statsmodels libs rather than directly available in sklearn.

      R is a more technical platform for more technical types, I tend to go into more detail in those examples.

      Is this something you would like to see more of Angela?

  6. Suleyman Sahal December 31, 2016 at 9:14 am #

    Hi Jason. Thank you for these great articles. I also read this article of yours ( What I wonder is the proper validation method. Should we conduct k-fold or repeated n*k-fold cross validation? I recently read a journal article where researchers compare around 50 models under 5*2-folds setting, suggesting it is more robust. How should we proceed while comparing models?

    • Jason Brownlee January 1, 2017 at 5:23 am #

      Hi Suleyman,

      Using k-fold cross validation is a gold standard. The specific configuration is problem specific, but common configurations of 3,5, 10 do well on many datasets.

      On very large datasets, a train-test split may be sufficient. For complex or small datasets, if you have the resources, repeated k-fold cross validation is preferred. Often, we would like to use repeated k-fold cross validation, but the computational expense is too high.

      There is no “best”, just lots of options to tune for your given problem.

      How do you choose?

      Balance your constraints (amount of data, resources, time, ..) against your requirements (robustness of result).

      • Suleyman Sahal January 1, 2017 at 5:54 am #

        Thank you Jason. That is what I have been doing.

        • Eric September 5, 2017 at 11:35 am #

          Do stick with 5×2. Nested Cross Validation is widely accepted, and way ahead of regular k-fold.

  7. Othmane January 5, 2017 at 3:16 am #

    Hi Jason,

    Thanks a lot for this good article.
    Could you please give some interpretations of the standard deviation values?
    Especially regarding overfitting.
    I thought that in case we have a small standard deviation of the cv results, we will have more overfitting, but I am not sure about that.


  8. Dhrubajit January 31, 2017 at 8:20 pm #

    Hi, from the boxplot, we get LR and LDA to have higher accuracy, so we select them as our models.
    So now, can I apply train_test_split to check the RMSE and the accuracy for the testing data using both these models. Whichever gives the best result, I will make that my final model?

    • Jason Brownlee February 1, 2017 at 10:47 am #

      Hi Dhrubajit,

      There are many ways to choose a final model. Often we prefer a model with better average performance rather than better absolute performance, this is because of the natural variance in the estimation of performance of the models on unseen data.

      Once you choose a final model, train it on all available data and you can start to use it to make predictions.

  9. Peter February 3, 2017 at 11:26 pm #

    Great article! Could you please explain me why this program doesn’t work when Y is float?

    • Jason Brownlee February 4, 2017 at 10:01 am #

      Hi Peter,

      Classification problems assume the outcome is a label.

  10. Edmond Sesay March 24, 2017 at 8:55 pm #

    great post! thanks for sharing

  11. Nitesh May 2, 2017 at 4:04 pm #

    Hi Jason,

    I have started learning and implementing Machine learning algorithms.
    One question – the above blog will tell us which Machine learning algorithm to go with. However, Should we ever check that if we are using Regression, how well the regression fits the data by checking
    Autocorrelation, Multicollinearity and normality.

    What I have learnt from reading blogs and articles that we all calculate score by using cross validation methodology, and then find out which would fit best. have not seen anyone following traditional ways such as checking Autocorrelation, Multicollinearity and normality. I might be wrong. Please throw some light on the same.
    THanks Nitesh

    • Jason Brownlee May 3, 2017 at 7:31 am #

      Yes, on time series, an understanding of the autocorrelation is practically required.

      When using a linear method, an idea of multicollinearity can be helpful.

      I would suggest this type of analysis before investigating models to get a better idea of the structure of your problem.

  12. Andrea Grandi November 27, 2017 at 5:40 am #

    Hi Jason! First of all thanks for all your blog posts, they are really helping me to better understand how to work with datasets and machine learning algorithms.

    I’ve a question related to the scoring method. Before discovering the method you are using here, I was using the .score() method in this way (assume I already have splitted the dataset 80/20 and tranformed the data):

    from sklearn.svm import LinearSVC
    lin_svc = LinearSVC(), train_set_labels)
    lin_svc.score(test_set_scaled, test_set_labels)

    getting a score that was similar but different to the one I obtain with the method explained in this post. What’s the difference between using score() and using cross_val_score() ?


  13. Ruchita December 25, 2017 at 12:33 am #

    While comparing algorithms by using same code which is mentioned above i got one error ‘could not covert string to float’.
    Can you please tell me how to tackle it.

    • Jason Brownlee December 25, 2017 at 5:24 am #

      Confirm that you copied the code exactly and that you are using the same data file. Also confirm that all of your Python libraries are up to date.

      • Ruchita December 25, 2017 at 5:04 pm #

        Thanks Jason.
        Actually I am using different dataset. My dataset is about stock related. So what can I do for it while comparing the algorithms.

  14. Pedro January 4, 2018 at 10:57 pm #

    Thank you so much for this tutorial. It really helps one using Machine Learning in sklearn.
    One questio: I’mm trying to use this code with my dataset but I have features which are strings and not numbers, as in your dataset.
    What can I do to change the code in order for it to work? (I’m getting an error saying “could not convert string to float”

  15. Vikas January 9, 2018 at 4:31 pm #

    Hello, Dr. Jason! Thank you so much for this wonderful article. I have a question for you. I noticed you have not mentioned feature selection and feature engineering in the Python mini course. So, my question is that if we were to implement both of these tasks, what should be the order with respect to this present stage of spot checking and comparison of machine learning algorithms? Should we first select one or two best performing models after comparison and then implement feature selection and feature engineering or first implement them and then perform spot checking and model comparison?

    Thank you in advanced.

    • Jason Brownlee January 10, 2018 at 5:21 am #

      Generally before spot checking.

      • Vikas January 10, 2018 at 4:18 pm #

        Thank you for answering! But, that “generally” is really not helping. Can you please explain in which cases it is suggested to do that and in which cases not? I think it’s really important for me to learn.

        • Jason Brownlee January 11, 2018 at 5:47 am #

          No, it depends on the data and the project. I have to speak in generalities because I do not have the capacity to get involved in everyones project. Sorry.

          • Vikas January 11, 2018 at 4:47 pm #

            Oh, okay. I hope to learn it as I do more projects. Thank you!

  16. Emma Gallop January 25, 2018 at 2:04 am #

    Thank you for the brilliant tutorial! If we have split our data in to train and test sets and wanted to know the accuracy of the trained model on the held out test data, could we do:

    cv_results = model_selection.cross_val_score(model, X_test, Y_test, cv=kfold, scoring=scoring)

    Or would we need to use .fit on the training data before testing on the test data? I hope this makes sense! Thank you.

    • Jason Brownlee January 25, 2018 at 5:57 am #

      Generally, we would fit the model on all of the training set, make predictions for the test set, then evaluate the skill of the predictions on the test set.

  17. jason March 10, 2018 at 9:16 am #

    Thanks for a great tutorial! I can follow the logic, but it seems UCI has taken down the pima indians dataset.

  18. Risko Ruus April 2, 2018 at 5:09 pm #

    Thank you for the post Jason. I am curious however, how which scores are normally reported when comparing ML models fitted with slightly different features. E.g there is a set of features both models A and B share and A has been fitted a single unique feature and B has one as well, which A is not trained with. When one would evaluate such models for comparing, should we report and compare the scores on the test set or use cross_val_score with all the data?

    • Jason Brownlee April 3, 2018 at 6:28 am #


      Perhaps you could pick a measure that is relevant to the general domain, it could be something generic such as model accuracy or prediction error.

      • Risko Ruus April 4, 2018 at 1:51 am #

        Thank you for the response, I was thinking of either using MAE or MSE since mine is a regression problem and see, which model achieves the lowest score.
        However I don’t know, what is the general practice, should we compare the two models like:

        A: Using cross_val_score to report MSE with the regressor and X, y where X and y is the entire data (no train/test split)
        B: Doing a train/test split, fitting the model with train data and then making predictions using the test data and compare the MSE scores of the models

        • Jason Brownlee April 4, 2018 at 6:16 am #

          The choice and configuration of the test harness is a big part of the challenge of applied machine learning.

          It must be tailored to your specific problem.

  19. Habiba May 15, 2018 at 7:28 pm #

    Hi Dr. Jason,
    Thank you for this wonderful post. I have a question about complexity. As we usually use Accuracy to score the performance of Learning Algorithms. Is there any provision made for measuring the time taken(i.e. speed ) for the algorithm to complete the specific task given. I added a timeit function in one of your code i used from your post as follows

    %timeit results = cross_val_score(bcancermodelb, X, Y, cv=kfold,scoring=scoring)
    print(“Accuracy: %.3f%% (%.3f%%)” % (results.mean()*100.0, results.std()*100.0))

    and i end up with the following result

    19.2 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    Accuracy: 93.420% (3.623%).

    Can this be a good measure for the time taken for a particular model?

    • Jason Brownlee May 16, 2018 at 6:01 am #

      Generally no. If this is an important consideration in your model, then you can take it into account.

  20. Inayat May 24, 2018 at 3:02 am #

    Hi Jason,
    I am little confused about the box plot I am getting. I had compared 9 algorithms accuracy on my datasets with more than 90 features. The results I am getting is bit confusing as I only LDA showing more than 90% accuracy than other models, which ranges from 20% h 40%. Is the plot is valid and I can consider LDA as best model?

    • Jason Brownlee May 24, 2018 at 8:20 am #

      Perhaps inspect the raw data to confirm the finding.

  21. evangelyn May 28, 2018 at 6:15 pm #

    hi i get this error , why would it be?

    ValueError: Input contains NaN, infinity or a value too large for dtype(‘float64’).

    • Jason Brownlee May 29, 2018 at 6:24 am #

      Perhaps check that your input data was loaded correctly?

  22. evangelyn May 28, 2018 at 6:25 pm #

    cv_results = model_selection.cross_val_score(model, X, Y, cv=kfold,scoring=scoring)

    i probably have error in this line

    • Jason Brownlee May 29, 2018 at 6:24 am #

      Ensure you have copied the code exactly, preserving white space.

  23. Fritz June 23, 2018 at 10:50 pm #

    If I use the code above, I receive the following error message. (Python 3.6 – Spyder)

    for name, model in models:
    File “”, line 1
    for name, model in models:
    SyntaxError: unexpected EOF while parsing

    • Jason Brownlee June 24, 2018 at 7:32 am #

      Looks like you added a “,”.

      Be sure to copy the code exactly.

  24. Fritz June 23, 2018 at 10:57 pm #

    I found the mistake.

  25. rahul July 6, 2018 at 7:06 pm #

    explain this: “preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class'”

  26. Boris July 28, 2018 at 7:48 am #

    Hi Jason
    how can I get confusion matrix’s per algorithm from Your example? Also TP FP TN and FN to calculate prediction and recall per algorithm? Thanks!

  27. Huseyin Varol Erdem August 23, 2018 at 6:52 am #

    Thank you so much for your all posts. They are extremely helpful.

  28. munaza ramzan September 22, 2018 at 5:00 pm #

    Hello Sir,

    The accuracy for my problem is low, will you please suggest me to improve this.

    LDA: 0.581771 (0.052691)
    KNN: 0.523047 (0.054386)
    CART: 0.606641 (0.044246)
    NB: 0.562109 (0.089570)
    SVM: 0.554167 (0.099744)

  29. neama October 6, 2018 at 3:59 am #

    i have a problem in the code.
    in this line kfold = model_selection.KFold(n_splits=6, random_state=seed)
    my dataset contains 4 nemuric input and two classes.
    i need to know what the seeds an n-spilts or they refer for what?
    need quick reply please.

  30. chichi October 29, 2018 at 1:32 am #

    Hello sir,
    I was trying to build a web tool as part of my project. It lets the user upload a dataset and plots accuracy of different algorithms with this code . I tried this code with many datasets and found this code isn’t working for too large datasets like a dataset with 2*10^6 rows and 30 coloumns. Can u tell the approx max size of dataset which could run properly?

    • Jason Brownlee October 29, 2018 at 5:59 am #

      I believe it will scale with the amount of RAM available.

      For larger datasets, it may be better to use progressive loading or a big data framework.

Leave a Reply