[New Book] Click to get The Beginner's Guide to Data Science!
Use the offer code 20offearlybird to get 20% off. Hurry, sale ends soon!

How to Report Classifier Performance with Confidence Intervals

Once you choose a machine learning algorithm for your classification problem, you need to report the performance of the model to stakeholders.

This is important so that you can set the expectations for the model on new data.

A common mistake is to report the classification accuracy of the model alone.

In this post, you will discover how to calculate confidence intervals on the performance of your model to provide a calibrated and robust indication of your model’s skill.

Kick-start your project with my new book Statistics for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

How to Report Classifier Performance with Confidence Intervals

How to Report Classifier Performance with Confidence Intervals
Photo by Andrew, some rights reserved.

Classification Accuracy

The skill of a classification machine learning algorithm is often reported as classification accuracy.

This is the percentage of the correct predictions from all predictions made. It is calculated as follows:

A classifier may have an accuracy such as 60% or 90%, and how good this is only has meaning in the context of the problem domain.

Classification Error

When talking about a model to stakeholders, it may be more relevant to talk about classification error or just error.

This is because stakeholders assume models perform well, they may really want to know how prone a model is to making mistakes.

You can calculate classification error as the percentage of incorrect predictions to the number of predictions made, expressed as a value between 0 and 1.

A classifier may have an error of 0.25 or 0.02.

This value too can be converted to a percentage by multiplying it by 100. For example, 0.02 would become (0.02 * 100.0) or 2% classification error.

Validation Dataset

What dataset do you use to calculate model skill?

It is a good practice to hold out a validation dataset from the modeling process.

This means a sample of the available data is randomly selected and removed from the available data, such that it is not used during model selection or configuration.

After the final model has been prepared on the training data, it can be used to make predictions on the validation dataset. These predictions are used to calculate a classification accuracy or classification error.

Need help with Statistics for Machine Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Confidence Interval

Rather than presenting just a single error score, a confidence interval can be calculated and presented as part of the model skill.

A confidence interval is comprised of two things:

  • Range. This is the lower and upper limit on the skill that can be expected on the model.
  • Probability. This is the probability that the skill of the model will fall within the range.

In general, the confidence interval for classification error can be calculated as follows:

Where error is the classification error, const is a constant value that defines the chosen probability, sqrt is the square root function, and n is the number of observations (rows) used to evaluate the model. Technically, this is called the Wilson score interval.

The values for const are provided from statistics, and common values used are:

  • 1.64 (90%)
  • 1.96 (95%)
  • 2.33 (98%)
  • 2.58 (99%)

Use of these confidence intervals makes some assumptions that you need to ensure you can meet. They are:

  • Observations in the validation data set were drawn from the domain independently (e.g. they are independent and identically distributed).
  • At least 30 observations were used to evaluate the model.

This is based on some statistics of sampling theory that takes calculating the error of a classifier as a binomial distribution, that we have sufficient observations to approximate a normal distribution for the binomial distribution, and that via the central limit theorem that the more observations we classify, the closer we will get to the true, but unknown, model skill.

Confidence Interval Example

Consider a model with an error of 0.02 (error = 0.02) on a validation dataset with 50 examples (n = 50).

We can calculate the 95% confidence interval (const = 1.96) as follows:

Or, stated another way:

There is a 95% likelihood that the confidence interval [0.0, 0.0588] covers the true classification error of the model on unseen data.

Notice that the confidence intervals on the classification error must be clipped to the values 0.0 and 1.0. It is impossible to have a negative error (e.g. less than 0.0) or an error more than 1.0.

Further Reading

Summary

In this post, you discovered how to calculate confidence intervals for your classifier.

Specifically, you learned:

  • How to calculate classification accuracy and classification error when reporting results.
  • What dataset to use when calculating model skill that is to be reported.
  • How to calculate a lower and upper bound on classification error for a chosen level of likelihood.

Do you have any questions about classifier confidence intervals?
Ask your questions in the comments below.

Get a Handle on Statistics for Machine Learning!

Statistical Methods for Machine Learning

Develop a working understanding of statistics

...by writing lines of code in python

Discover how in my new Ebook:
Statistical Methods for Machine Learning

It provides self-study tutorials on topics like:
Hypothesis Tests, Correlation, Nonparametric Stats, Resampling, and much more...

Discover how to Transform Data into Knowledge

Skip the Academics. Just Results.

See What's Inside

86 Responses to How to Report Classifier Performance with Confidence Intervals

  1. Avatar
    Birkey June 2, 2017 at 3:12 pm #

    How’s this (confidence interval) differ from F1 score, which is widely used and, IMHO, easier to comprehend, since it’s one score covers both precision and recall.

    • Avatar
      Jason Brownlee June 3, 2017 at 7:20 am #

      The F1 is a skill measure for the model. It could be accuracy or anything else.

      In this post, we are talking about the confidence (uncertainty) on the calculated skill score.

  2. Avatar
    Elie Kawerk June 3, 2017 at 3:17 am #

    Hi Jason,

    Thank you for the nice post. This error confidence interval that you report corresponds to binary classification only. How about multi-class classification?

    Regards

    • Avatar
      Jason Brownlee June 3, 2017 at 7:25 am #

      Really great question. I expect you would use logloss or AUC and report confidence on that.

      • Avatar
        Elie Kawerk June 3, 2017 at 5:03 pm #

        I see,

        But then the expression of the confidence interval (for AUC or any other metric) would be different I presume since the process wouldn’t be described using the binomial distribution.

        For multi-class classification, wouldn’t the distribution be a multinomial distribution? And in this case the expression for the error confidence interval would change I presume.

        Regards
        Elie

        • Avatar
          Jason Brownlee June 4, 2017 at 7:49 am #

          I see, yes you are correct. I would recommend an empirical approach to summarizing the distribution using the bootstrap method (a post is scheduled).

  3. Avatar
    Jonad June 4, 2017 at 1:23 am #

    Hi Jason,
    Really good post. But I have a question. Does the classification error differ if we use a different skill – for instance F1-score – for our model?
    Thanks

    • Avatar
      Jason Brownlee June 4, 2017 at 7:54 am #

      Hi Jonad,

      Different measures will evaluate skill in different ways. They will provide different perspectives on the same underlying model error.

      Does that make sense?

      • Avatar
        jonad June 5, 2017 at 2:45 am #

        Yes, I was thinking that the classification error formula ( incorrect predictions / total predictions) might differ depending on the evaluation metrics. Now I understand it better.
        Thanks

  4. Avatar
    Simone June 7, 2017 at 9:25 pm #

    Great post!
    How could I use confidence intervals and cross-validation together?

    • Avatar
      Jason Brownlee June 8, 2017 at 7:42 am #

      It’s a tough one, we are generally interested in the variance of model skill during model selection and during the presentation of the final model.

      Often standard deviation of CV score is used to capture model skill variance, perhaps that is generally sufficient and we can leave confidence intervals for presenting the final model or specific predictions?

      I’m open to better ideas.

      • Avatar
        Simone June 9, 2017 at 6:32 pm #

        Ok, Thanks!
        The last question: when I’m using k-fold cv, the value of ‘n’ is equal to the number of all observations or all observations – k?

        • Avatar
          yerart September 15, 2019 at 7:41 am #

          @Simone the value of n is AFAIK an empirical value that is chosen to be 5 or 10. Jason explains that very well (as usual) in this post:

          https://machinelearningmastery.com/k-fold-cross-validation/

        • Avatar
          yerartdev September 15, 2019 at 7:45 am #

          Ah @Simone, by the way, if ‘n’ is equal to the number of all observations that is a type of cross-validation that is called LOOCV (Leave-one-out cross-validation) and uses a single observation from the original sample as the validation data, and the remaining observations as the training data.

      • Avatar
        yerart September 15, 2019 at 8:00 am #

        @Jason what about this? (I haven’t fully read it yet and I’m struggling to understand it but .. well, I think it might work as well):

        Mach Learn. 2018; 107(12): 1895–1922. Published online 2018 May 9. doi: 10.1007/s10994-018-5714-4. PMCID: PMC6191021, PMID: 30393425. Bootstrapping the out-of-sample predictions for efficient and accurate cross-validation. Ioannis Tsamardinos, lissavet Greasidou (corresponding author), and Giorgos Borboudakis.

        https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6191021/

    • Avatar
      yerart September 15, 2019 at 7:36 am #

      Well @Simone, from the point of view of a developer if you take a look at the scikit-learn documentation and go over the section “3.1.1. Computing cross-validated metrics” (https://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics) you will see that the 95% confidence interval of the score estimate is reported as Jason states in this post.

  5. Avatar
    Sathish February 6, 2018 at 8:47 am #

    Hi Jason
    Is there R code for calculating the CI and graphing them?

    Thanks

    • Avatar
      Jason Brownlee February 6, 2018 at 9:27 am #

      I bet there is, I don’t have it on hand, sorry.

  6. Avatar
    Thomas February 12, 2018 at 11:25 pm #

    The error is just the reverse of the accuracy, wouldn’t that be a simpler statement to make?

    This leads to the fundamental problem that accuracy or classification error itself is often mediocre to useless metric because data sets usually are imbalanced. And hence the confidence on that error is just as useless.

    I found this post for a different reason as I wanted to find if anyone else does what i do, namely provide metrics grouped by class probability. What is the precision if the model has 0.9 class probability vs 0.6 for example. That can be very useful information for end users because the metric will often vary greatly based on class probability.

  7. Avatar
    Peipei May 3, 2018 at 12:02 am #

    Hi Jason,

    Nice post. When calculating the confidence interval for error, AUC or other metrics, the standard error of the metric is needed. How should I calculate the standard error?

    • Avatar
      Jason Brownlee May 3, 2018 at 6:34 am #

      Great question, here is the equation:
      https://en.wikipedia.org/wiki/Standard_error

      • Avatar
        Peipei May 3, 2018 at 7:48 pm #

        Thanks for replying. Does this mean I need to get multiple errors by running multiple times (bootstrap or cross-validation) to calculate the standard error?

        • Avatar
          Jason Brownlee May 4, 2018 at 7:43 am #

          Yes, if you are looking to calculate the standard error of the bootstrap result distribution.

  8. Avatar
    Manoj May 3, 2018 at 1:33 pm #

    Hi Jason,

    I am trying to group my customers. Say GAP HK, GAP US should be under the group customer GAP.

    Few of the customers are already grouped. Say GAP HK is grouped under GAP but GAP US is not.

    I am using random forest classifier. I used already grouped customer name as training data. Group customer code is the label that I am trying to predict.

    The classifier is assigning labels as expected. The problem I am facing is that the classifier is also assigning labels or group customer code to the customers although the customer name does not match closely with the training data. It is doing the best possible match. It is problem for me because I need to manually ungroup these customers. Can you suggest how to overcome this problem? Is it possible to know classifier correct probability for each predicted label? If yes, then I can ignore the once with low probability.

    Thank you in advance for advice.

  9. Avatar
    Anish July 12, 2018 at 3:31 am #

    Hi Jason,
    I am not sure if anyone else brought this up but I’ve found one issue here. The confidence interval based measure you suggested is not the “Wilson score interval”. according to the Wikipedia page(which is cited in that link). It’s actually “Normal approximation interval” which is above Wilson score paragraph. Correct me if I am wrong.

    Thanks
    -Anish

  10. Avatar
    AB September 4, 2018 at 1:15 am #

    Hi Jason,
    I’m interested on the relation of Cross Validation and this approach.

    With 150 examples I decide to use a 100 repeated 5-fold Cross Validation to understand the behavior of my classifier. At this point I have 100×5 results and I can use the mean and std dev of the error rates to estimate the variance of the model skills:

    mean(errorRate) +/- 1.96*(std(errorRate))

    I could estimate Confidence Interval of the True Error (that I would obtain on the unseen data) using the the average Error rate:

    mean(errorRate) +/- const * sqrt( (mean(errorRate) * (1 – mean(errorRate))) / n)

    Two questions:
    1. Do you think this approach is correct?
    2. Is correct to set n=150 in the second equation or I should use the average number of Test Data used as Test Set in each fold of CV?

  11. Avatar
    Kostas Theodor September 15, 2018 at 2:22 am #

    Hi Jason, thanks for the great posts on confidence intervals/ bootstraps for machine learning.

    Suppose you use
    A) 5-fold CV
    B) 30-fold CV

    for model evaluation. You pick the final model and train it on all the data at hand.

    What are the options one has for reporting on final model skill with a range for uncertainty in each case?
    Should one have still held out a number of datapoints for validation+binomial confidence interval?
    Is it too late to use the bootstrap confidence intervals as the final model was trained?

    Thanks

    • Avatar
      Jason Brownlee September 15, 2018 at 6:16 am #

      Not sure I follow your question?

      Pick a final model and use a preferred method to report expected performance. It is unrelated to how you chose that model.

      • Avatar
        Kostas Theodor September 16, 2018 at 10:29 pm #

        Thanks Jason. I found your other post https://machinelearningmastery.com/difference-test-validation-datasets/ very helpful.
        Can I confirm that the above procedure of reporting classifier performance with confidence intervals is relevant for the final trained model? If that is so, it seems that the validation dataset mentioned should be called test set to align with the definitions of the linked post?

  12. Avatar
    mars October 6, 2018 at 6:30 am #

    Hi Jason,

    Thank you for the post!

    In your example you use accuracy and error rate and calculate a confidence interval.

    Can one replace “error rate” with, say, precision, recall or f1? Why and why not?

    For example, say we have a sample size=50, f1=0.02
    Does that mean …

    there is a 95% likelihood that the confidence interval [0.0, 0.0588] covers the true F1 of the model on unseen data?

    Thanks!

    • Avatar
      Jason Brownlee October 6, 2018 at 11:42 am #

      Perhaps for some scores. The example in this example is specific for a ratio. I believe you can use it for other ratios like precision, recall and f1.

  13. Avatar
    jecy November 1, 2018 at 12:49 pm #

    Hi Jason

    Thank you for your post.

    How get the standard error of the AUC curve in python

    • Avatar
      Jason Brownlee November 1, 2018 at 2:33 pm #

      Not sure I follow. Standard error refers to a statistical quantity on a distribution, not sure how you would calculate it for a curve.

  14. Avatar
    Lorenzo Famiglini December 29, 2018 at 1:17 am #

    Hello Jason,
    I was wondering if I can compute confidence interval for Recall and Precision. If yes can you explain how can I do this?
    Thank you so much,
    best regards

    Lorenzo

    • Avatar
      Jason Brownlee December 29, 2018 at 5:54 am #

      Yes, I expect the bootstrap would be a good place to start.

  15. Avatar
    Franco Arda April 14, 2019 at 7:23 am #

    Another excellent post Jason. Thank you.

    There might a minor typo in “[0.0, 0.0588]” – should be [0.0, 0.0388] I think.

  16. Avatar
    Daniel Wigmore April 19, 2019 at 12:12 am #

    I am running a classifier with a training set of 41 and a validation set of 14 (55 total observations). I rerun this 50 times with different random slices of the data as training and test. Obviously I cannot make confidence intervals with this small validation set.

    However, because I am rerunning it with different training and validation slices, can I get the mean error rate over the 50 tests and calculate a confidence interval?

    const * sqrt( (error * (1 – error)) / n)

    N would be 700 (14*50). If I had 50 tests which averaged out to an accurracy of 77.4% (error is 0.226), the confidence intervals would be 0.26 and 0.2

    Does this work? or would these confidence intervals be unreliable?

    Thanks for your excellent article

    Dan

    • Avatar
      Jason Brownlee April 19, 2019 at 6:12 am #

      Why would they be unreliable?

      What are you worried about exactly – the dataset selection for each trial?

      • Avatar
        Daniel April 23, 2019 at 5:37 am #

        Hi Mr Brownlee,

        Thank you for the quick reply and apologies for my late response. I am dealing with social science data and the validation set is rather limited. I am worried about assertaining confidence intervals for a limited validation sample.

        A friend of mine came up with a solution in which I keep all the accuracy outputs in a vector and plot them like a histogram (I can’t seem to paste one into this reply window but can send it over if necessary by email).

        Would I be able to get the confidence intervals by looking at the 5th and 95th percentile of the accuracy vector?

        Would there be an advantage to randomly sampling (bootstrapping) with replacement over without replacement?

        Thanks again

        Dan

        • Avatar
          Jason Brownlee April 23, 2019 at 7:58 am #

          Just repeats without bootstrap? I think the distribution would only capture the variance in the model, not the data.

          I would encourage you to use the bootstrap to calculate the distribution of accuracy scores.

  17. Avatar
    Daniel April 24, 2019 at 10:03 pm #

    Thanks Jason,

    Final question,

    My trained model has the below output and the best tuned number of neighbours is 5.

    I am creating confidence intervals through creating a histogram of the accuracies in the resample. Is there a way to subset the resample results in which the model is best tuned (in this case k = 5)?

    No pre-processing
    Resampling: Bootstrapped (1000 reps)
    Summary of sample sizes: 32, 32, 32, 32, 32, 32, …
    Resampling results across tuning parameters:

    k Accuracy Kappa
    5 0.7262690 0.4593792
    7 0.6830904 0.3830819
    9 0.6655405 0.3522427

    Accuracy was used to select the optimal model using the largest value.
    The final value used for the model was k = 5.

    • Avatar
      Jason Brownlee April 25, 2019 at 8:15 am #

      I recommend using a new procedure with the chosen config and the bootstrap to estimate the confidence intervals on model performance.

  18. Avatar
    Daniel April 25, 2019 at 5:50 am #

    sorry as an update, I am selecting ‘final’ as the returnResamp argument in the train control method. I believe that this should retain the resamples from only the best-tuned model

    But typically when I check the mean of the resample, mean(model$resample$Accuracy), the mean is lower than the k=5 accuracy (typically 0.65). Is there a reason for this? I would have thought that the mean accuracy of the best tune resamples would equal the model accuracy in the results.

    After this I promise to leave you alone (and thanks for your patience so far)

  19. Avatar
    Kalen Gordon June 18, 2019 at 7:19 am #

    This was very helpful for classification.

    How would I go about calculating confidence intervals for regression analysis.
    Could i use the same formula but instead of using classification error would I be able to use MAE should I use RMSE?

  20. Avatar
    Fajar October 28, 2019 at 9:25 pm #

    Hi Jason, my question is not too related to this topic, only slightly

    I have a neural network(MLP) for binary classification with a logistic output between 0 and 1. With each run, I have to adjust my threshold on test set for minimizing the misclassifications. My question is to present my results, should I run it multiple times, adjust threshold each time and then take the average of other metrics eg F1 score or I don’t optimize for the threshold at all?

    • Avatar
      Jason Brownlee October 29, 2019 at 5:24 am #

      Hmmm, good question.

      I would take the test as an evaluation of the “system” that includes the model and automatic threshold adjusting procedure. In that case, averaging the results of the whole system is reasonable, as long as you clearly state that is what you are doing.

      Very cool!

  21. Avatar
    dave February 14, 2020 at 1:06 am #

    Hi @Jason I have a question related to this topic:

    In the following thesis http://arno.uvt.nl/show.cgi?fid=147278 the user compute the AUC standard deviation as measure of robustness.

    Let’s say I have run a repeated (10) 10-cross validation experiment with predictions implemented via a Markov chain model. As a measure of robustness, I want to compute the SD of the AUC across the runs/folders for the test set.

    Intuitively, a relatively small standard deviation implies that the model produces stable results in distinguishing conversion from non-conversion.

    The project is based on a 10x 10 cross-validated procedure, which as a consequence generate 100 AUCs.

    Now, to derive the AUC’s SD summarizing the model I understand the process should be (it is not well defined in the paper):

    a. the SD is computed across folders based on the actual folders’ AUC values (we derive 10 SDs, one for each repetition).

    b. The SDs obtained at step b1 are then averaged across runs (it leaves us with 1SD).

    however, as there is not so much literature on the topic I want to know If someone can validate the reasoning above

    Any help or suggestion appreciated

    • Avatar
      Jason Brownlee February 14, 2020 at 6:38 am #

      Yes.

      You would collect a sample of AUC scores across all repeats and all folds. Then calculate summary stats like the mean and standard deviation.

  22. Avatar
    John April 9, 2020 at 9:59 am #

    Wilson score is different, the one you’re describing is “Normal approximation interval” according to Wikipedia https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Normal_approximation_interval

    Wilson score interval is asymmetric

  23. Avatar
    charu June 19, 2020 at 8:22 pm #

    How to calculate the Confidence interval for a binary classification problem with imbalanced dateset , where it is not possible to balance the data ?

  24. Avatar
    Siva Karthik Gade September 4, 2020 at 4:25 pm #

    Thank you Jason – great article!

    One follow up question regarding minimum sample size requirement for (sample) error rate to satisfy normal distribution approximation ->

    I have come across a few posts/slides around CLT which state that – in order for the sample proportion (or
    mean or error rate) of a binomial distribution to approximate to normal distribution (to compute confidence interval), it should follow below 2 conditions:
    np > 10
    n(1-p) > 10
    ex ref. http://homepages.math.uic.edu/~bpower6/stat101/Sampling%20Distributions.pdf

    In the current example, for sample size (n) = 50 & error rate = 0.02, n*p (50 * 0.02 = 1) is not satisfying np > 10.

    In this case, should we increase n to satisfy above requirements (or) Is there something else I am missing here?

    • Avatar
      Jason Brownlee September 5, 2020 at 6:40 am #

      Thanks for sharing.

      Perhaps try increasing the sample size and see if it influences the result.

  25. Avatar
    Manuel Gonçalves October 3, 2020 at 11:49 am #

    How to compute CI with cross-validation? Do we use the CI on mean results? the n value is the test portion or the sum of then? Tne formula used in https://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics , by multiplying std * 2 was unclear.

    • Avatar
      Jason Brownlee October 3, 2020 at 12:30 pm #

      You can use the bootstrap method to get a reliable estimate.

      • Avatar
        Manuel Gonçalves October 4, 2020 at 11:12 am #

        Perhaps a unreliable estimate, there is a reference or paper/book where this formula came from? cross-val will give me [acc1, acc2,acc3, … acc30] list with accuracys and I just comput the mean +/- 2 * std to represent the C.I. (CI = 2* std). What about “n” value, or 1.96 z value?

          • Avatar
            Manuel Gonçalves October 24, 2020 at 1:03 am #

            Thank’s a lot for the comments… So what you think of this scenario?

            Step one – Split train/validation 80/20 and use the train (80%) into cross-validation to get perfoemance metrics to show as means and std.
            Step two – Train a final model with bootstrap on 20% left and comput performances with confifence intervals.

            Is this a valid scenario? Any recomendations or can I use nested cross-validation for this? Is It valid to do botstrap inside a CV llop?

          • Avatar
            Jason Brownlee October 24, 2020 at 7:05 am #

            Not sure I like it – unless you have TONS of data. But if it works for you, go for it!

            Nested CV is good for choosing a model and hyperparametres in a single test harness.

            Final evaluation for reporting could be a bootstrap could be on the entire dataset you have available.

      • Avatar
        Manuel Gonçalves October 4, 2020 at 11:24 am #

        There is an open discussion concerning this formula here: https://github.com/scikit-learn/scikit-learn/issues/6059

  26. Avatar
    Manuel October 24, 2020 at 1:54 am #

    This post concludes that is no way to compute reliable CI inside a cross-validation schema. What about a dozen of papers that show CI from CV results?

    • Avatar
      Jason Brownlee October 24, 2020 at 7:07 am #

      They are likely summarizing the mean and standard deviation of the CV process itself. This is very common.

  27. Avatar
    Manuel October 27, 2020 at 3:55 am #

    In your tutorial, the CI was computed on a single execution of train/test split, but what about repeated executions, e.g., 30 times?

  28. Avatar
    Sourabh April 19, 2021 at 9:39 pm #

    Hi Jason ,
    what is error * (1 -error) this term tells ?

    • Avatar
      Jason Brownlee April 20, 2021 at 5:57 am #

      You can learn more about where/how this was derived in the resources listed in the further reading section.

  29. Avatar
    ASP November 12, 2021 at 8:56 am #

    Hello Jason,
    If I need to come up with a CI for predictions for an existing model and the reported error is the MAE on the test samples, any suggestions on how I can go about this? Thanks for your time.

  30. Avatar
    Bahar February 8, 2023 at 11:20 pm #

    Hi Jason

    Thank you for your post.

    Can you guide me how can I applied CI for F1 Score?

    I should employ Bootstrap or I can use this formula?

    F1 +/- const * sqrt( (F1* (1 – F1)) / n)

  31. Avatar
    Bahar March 11, 2023 at 9:14 pm #

    Hello Dear Jason,

    Does CI report the performance of several classifiers trained only on the same dataset? I have an reverse scenario.

    There are two datasets (i) generated data by GAN and (ii) original data. An SVM has been trained in both datasets.

    Now, can I use CI to compare the performance of these two results, (i) SVM trained on artificial data and (ii)SVM trained on original data?

Leave a Reply