How to Calculate Precision, Recall, F1, and More for Deep Learning Models

Once you fit a deep learning neural network model, you must evaluate its performance on a test dataset.

This is critical, as the reported performance allows you to both choose between candidate models and to communicate to stakeholders about how good the model is at solving the problem.

The Keras deep learning API model is very limited in terms of the metrics that you can use to report the model performance.

I am frequently asked questions, such as:

How can I calculate the precision and recall for my model?


How can I calculate the F1-score or confusion matrix for my model?

In this tutorial, you will discover how to calculate metrics to evaluate your deep learning neural network model with a step-by-step example.

After completing this tutorial, you will know:

  • How to use the scikit-learn metrics API to evaluate a deep learning model.
  • How to make both class and probability predictions with a final model required by the scikit-learn API.
  • How to calculate precision, recall, F1-score, ROC AUC, and more with the scikit-learn API for a model.

Discover how to develop deep learning models for a range of predictive modeling problems with just a few lines of code in my new book, with 18 step-by-step tutorials and 9 projects.

Let’s get started.

How to Calculate Precision, Recall, F1, and More for Deep Learning Models

How to Calculate Precision, Recall, F1, and More for Deep Learning Models
Photo by John, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Binary Classification Problem
  2. Multilayer Perceptron Model
  3. How to Calculate Model Metrics

Binary Classification Problem

We will use a standard binary classification problem as the basis for this tutorial, called the “two circles” problem.

It is called the two circles problem because the problem is comprised of points that when plotted, show two concentric circles, one for each class. As such, this is an example of a binary classification problem. The problem has two inputs that can be interpreted as x and y coordinates on a graph. Each point belongs to either the inner or outer circle.

The make_circles() function in the scikit-learn library allows you to generate samples from the two circles problem. The “n_samples” argument allows you to specify the number of samples to generate, divided evenly between the two classes. The “noise” argument allows you to specify how much random statistical noise is added to the inputs or coordinates of each point, making the classification task more challenging. The “random_state” argument specifies the seed for the pseudorandom number generator, ensuring that the same samples are generated each time the code is run.

The example below generates 1,000 samples, with 0.1 statistical noise and a seed of 1.

Once generated, we can create a plot of the dataset to get an idea of how challenging the classification task is.

The example below generates samples and plots them, coloring each point according to the class, where points belonging to class 0 (outer circle) are colored blue and points that belong to class 1 (inner circle) are colored orange.

Running the example generates the dataset and plots the points on a graph, clearly showing two concentric circles for points belonging to class 0 and class 1.

Scatter Plot of Samples From the Two Circles Problem

Scatter Plot of Samples From the Two Circles Problem

Multilayer Perceptron Model

We will develop a Multilayer Perceptron, or MLP, model to address the binary classification problem.

This model is not optimized for the problem, but it is skillful (better than random).

After the samples for the dataset are generated, we will split them into two equal parts: one for training the model and one for evaluating the trained model.

Next, we can define our MLP model. The model is simple, expecting 2 input variables from the dataset, a single hidden layer with 100 nodes, and a ReLU activation function, then an output layer with a single node and a sigmoid activation function.

The model will predict a value between 0 and 1 that will be interpreted as to whether the input example belongs to class 0 or class 1.

The model will be fit using the binary cross entropy loss function and we will use the efficient Adam version of stochastic gradient descent. The model will also monitor the classification accuracy metric.

We will fit the model for 300 training epochs with the default batch size of 32 samples and evaluate the performance of the model at the end of each training epoch on the test dataset.

At the end of training, we will evaluate the final model once more on the train and test datasets and report the classification accuracy.

Finally, the performance of the model on the train and test sets recorded during training will be graphed using a line plot, one for each of the loss and the classification accuracy.

Tying all of these elements together, the complete code listing of training and evaluating an MLP on the two circles problem is listed below.

Running the example fits the model very quickly on the CPU (no GPU is required).

The model is evaluated, reporting the classification accuracy on the train and test sets of about 83% and 85% respectively.

Note, your specific results may vary given the stochastic nature of the training algorithm.

A figure is created showing two line plots: one for the learning curves of the loss on the train and test sets and one for the classification on the train and test sets.

The plots suggest that the model has a good fit on the problem.

Line Plot Showing Learning Curves of Loss and Accuracy of the MLP on the Two Circles Problem During Training

Line Plot Showing Learning Curves of Loss and Accuracy of the MLP on the Two Circles Problem During Training

How to Calculate Model Metrics

Perhaps you need to evaluate your deep learning neural network model using additional metrics that are not supported by the Keras metrics API.

The Keras metrics API is limited and you may want to calculate metrics such as precision, recall, F1, and more.

One approach to calculating new metrics is to implement them yourself in the Keras API and have Keras calculate them for you during model training and during model evaluation.

For help with this approach, see the tutorial:

This can be technically challenging.

A much simpler alternative is to use your final model to make a prediction for the test dataset, then calculate any metric you wish using the scikit-learn metrics API.

Three metrics, in addition to classification accuracy, that are commonly required for a neural network model on a binary classification problem are:

  • Precision
  • Recall
  • F1 Score

In this section, we will calculate these three metrics, as well as classification accuracy using the scikit-learn metrics API, and we will also calculate three additional metrics that are less common but may be useful. They are:

This is not a complete list of metrics for classification models supported by scikit-learn; nevertheless, calculating these metrics will show you how to calculate any metrics you may require using the scikit-learn API.

For a full list of supported metrics, see:

The example in this section will calculate metrics for an MLP model, but the same code for calculating metrics can be used for other models, such as RNNs and CNNs.

We can use the same code from the previous sections for preparing the dataset, as well as defining and fitting the model. To make the example simpler, we will put the code for these steps into simple function.

First, we can define a function called get_data() that will generate the dataset and split it into train and test sets.

Next, we will define a function called get_model() that will define the MLP model and fit it on the training dataset.

We can then call the get_data() function to prepare the dataset and the get_model() function to fit and return the model.

Now that we have a model fit on the training dataset, we can evaluate it using metrics from the scikit-learn metrics API.

First, we must use the model to make predictions. Most of the metric functions require a comparison between the true class values (e.g. testy) and the predicted class values (yhat_classes). We can predict the class values directly with our model using the predict_classes() function on the model.

Some metrics, like the ROC AUC, require a prediction of class probabilities (yhat_probs). These can be retrieved by calling the predict() function on the model.

For more help with making predictions using a Keras model, see the post:

We can make the class and probability predictions with the model.

The predictions are returned in a two-dimensional array, with one row for each example in the test dataset and one column for the prediction.

The scikit-learn metrics API expects a 1D array of actual and predicted values for comparison, therefore, we must reduce the 2D prediction arrays to 1D arrays.

We are now ready to calculate metrics for our deep learning neural network model. We can start by calculating the classification accuracy, precision, recall, and F1 scores.

Notice that calculating a metric is as simple as choosing the metric that interests us and calling the function passing in the true class values (testy) and the predicted class values (yhat_classes).

We can also calculate some additional metrics, such as the Cohen’s kappa, ROC AUC, and confusion matrix.

Notice that the ROC AUC requires the predicted class probabilities (yhat_probs) as an argument instead of the predicted classes (yhat_classes).

Now that we know how to calculate metrics for a deep learning neural network using the scikit-learn API, we can tie all of these elements together into a complete example, listed below.

Running the example prepares the dataset, fits the model, then calculates and reports the metrics for the model evaluated on the test dataset.

Your specific results may vary given the stochastic nature of the training algorithm.

If you need help interpreting a given metric, perhaps start with the “Classification Metrics Guide” in the scikit-learn API documentation: Classification Metrics Guide

Also, checkout the Wikipedia page for your metric; for example: Precision and recall, Wikipedia.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.





In this tutorial, you discovered how to calculate metrics to evaluate your deep learning neural network model with a step-by-step example.

Specifically, you learned:

  • How to use the scikit-learn metrics API to evaluate a deep learning model.
  • How to make both class and probability predictions with a final model required by the scikit-learn API.
  • How to calculate precision, recall, F1-score, ROC, AUC, and more with the scikit-learn API for a model.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Frustrated With Your Progress In Deep Learning?

Deep Learning with Python

 What If You Could Develop A Network in Minutes

…with just a few lines of Python

Discover how in my new Ebook: Deep Learning With Python

It covers self-study tutorials and end-to-end projects on topics like:
Multilayer PerceptronsConvolutional Nets and Recurrent Neural Nets, and more…

Finally Bring Deep Learning To
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.

39 Responses to How to Calculate Precision, Recall, F1, and More for Deep Learning Models

  1. JG April 3, 2019 at 10:14 pm #

    Very useful scikit-learn library modules (API), to avoid construct and develop your owns functions. Thanks !!.

    I would appreciate if you can add to this snippet (example) the appropriate code to plot (to visualize) the ROC Curves, confusion matrix, (to determine the best threshold probability to decide where to put the “marker” to decide when it is positive or negative or 0/1).

    Also I understand, those metrics only apply for binary classification (F1, precision, recall, AOC curve)? But I know Cohen`s kappa and confusion matrix also apply for multiclass !. Thank you.

  2. scander90 May 2, 2019 at 7:27 pm #

    i used the code blow to get the model result for F1-score

    nn = MLPClassifier(activation=’relu’,alpha=0.01,hidden_layer_sizes=(20,10))
    print (“F1-Score by Neural Network, threshold =”,threshold ,”:” ,predict(nn,train, y_train, test, y_test))

    now i want to get all the other matrices result accuracy and prediction with Plot but i dont know how i can used in the code above

    • Jason Brownlee May 3, 2019 at 6:19 am #

      What problem are you having exactly?

      • scander90 May 4, 2019 at 1:49 pm #

        thank you so much about your support ..

        from sklearn.neural_network import MLPClassifier
        threshold = 200
        train, y_train, test, y_test = prep(data,threshold)
        nn = MLPClassifier(activation=’relu’,alpha=0.01,hidden_layer_sizes=(20,10))
        print (“F1-Score by Neural Network, threshold =”,threshold ,”:” ,predict(nn,train, y_train, test, y_test))

        i used the code above i got it from your website to get the F1-score of the model now am looking to get the accuracy ,Precision and Recall for the same model

  3. Thb DL May 2, 2019 at 7:53 pm #

    Hello, thank you very much for your website, it helps a lot !

    I have a problem related to this post, may be you can halp me 🙂

    I try to understand why I obtain different metrics using “model.evaluate” vs “model.predict” and then compute the metrics…

    I work on sementic segmentation.

    I have an evaluation set of 24 images.

    I have a custom DICE INDEX metrics defined as :

    def dice_coef(y_true, y_pred):

    y_true_f = K.flatten(y_true)

    y_pred_f = K.flatten(y_pred)

    intersection = K.sum (y_true_f * y_pred_f)

    result =(2 * intersection)+1 / (K.sum(y_true_f) + K.sum(y_pred_f))+1

    return result

    When I use model.evaluate, I obtain a dice score of 0.9093835949897766.

    When I use model.predict and then compute the metrics, I obtain a dice score of 0.9092264051238695.

    To give more precisions : I set a batchsize of 24 in model.predict as well as in model.evaluate to be sure the problem is not caused by batch size. I do not know what happen when the batch size is larger (ex: 32) than the number of sample in evaluation set…

    Finaly, to compute the metrics after model.prediction, I run :

    dice_result = 0
    for y_i in range(len(y)):
    dice_result += tf.Session().run(tf.cast(dice_coef(y[y_i], preds[y_i]),
    dice_result /= (len(y))

    I thought about the tf.float32 casting to be the cause of the difference ?
    (Maybe “model.evaluate” computes all with tensorflow tensor and return a float at the end whereas I cast tensor in float32 at every loop ? …)

    Do you think about an explanation ?

    Thank you for your help.

    Cheers !


    • Jason Brownlee May 3, 2019 at 6:20 am #

      I suspect the evaluate score is averaging across batches.

      Perhaps take use predict then calculate the score on all predictions.

      • Thb DL May 7, 2019 at 5:42 am #

        Thank you for your reply.

        I just have 24 images in my evaluation set, so if “model.evaluate” compute across batches, with a batch size of 24, it will compute the metric in one time on the whole evaluation set. So it will normally gives the same results than “model.predict” followed by the metric computation on the evaluation set ?

        That’s why I do not understand my differences here.

        Have a good day.


        • Jason Brownlee May 7, 2019 at 6:21 am #

          I recommend calling predict, then calling the sklearn metric of choice with the results.

          • Thb DL May 9, 2019 at 6:46 pm #

            Ok 🙂

            If I finally decide not to use my dice personal score, but rather to trust Sklearn, is it possible to use this biblioteque with Keras during the training?
            Indeed, at the end of the training I get a graph showing the loss and the dice during the epochs.
            I would like these graphs to be consistent with the final results?

            Thanks again for help!

            Have a good day


          • Jason Brownlee May 10, 2019 at 8:15 am #

            I would expect the graphs to be a fair summary of the training performance.

            For presenting an algorithm, I recommend using a final model to make predictions, and plot the results anew.

          • Thb DL May 10, 2019 at 12:06 am #

            Ok, I worked on this today.

            I fixed this problem. Just in case someone alse has a similar problem.

            The fact was that when I resized my ground truth masks before feeding the network with, I did not threshold after the resizing, so I got other values than 0 and 1 at the edges, and my custom dice score gives bad results.

            Now I put the threshold just after the resizing and have same results for all the functions I use !

            Also, be careful with types casting (float32 vs float64 vs int) !

            Anyway, I thank you very much for your disponibility.

            Have a good daye

          • Jason Brownlee May 10, 2019 at 8:18 am #

            Well done!

  4. Jianhong Cheng May 14, 2019 at 11:02 am #

    How to calculate Precision, Recall, F1, and AUC for multi-class classification Problem

    • Jason Brownlee May 14, 2019 at 2:29 pm #

      You can use the same approach, the scores are averaged across the classes.

      • Erica Rac July 17, 2019 at 5:34 am #

        Your lessons are extremely informative, Professor. I am trying to use this approach to calculate the F1 score for a multi-class classification problem but I keep receiving the error message:
        “ValueError: Classification metrics can’t handle a mix of multilabel-indicator and binary targets” I would very much appreciate if you please guide me to what I am doing wrong? Here is the relevant code:

        # generate and prepare the dataset
        def get_data():
        n_test = 280
        Xtrain, Xtest = X[:n_test, :], X[n_test:, :]
        ytrain, ytest = y[:n_test], y[n_test:]
        return X_train, y_train, X_test, y_test

        # define and fit the model
        def get_model(Xtrain, ytrain):
        model = Sequential()
        model.add(Embedding(max_words, embedding_dim, input_length=max_sequence_length))
        model.add(LSTM(150, dropout=0.2, recurrent_dropout=0.2))
        model.add(Dense(5, activation=’softmax’))
        model.compile(loss=’categorical_crossentropy’, optimizer= “adam”, metrics=[‘accuracy’]), y_train, epochs=2, batch_size=15,callbacks=[EarlyStopping(monitor=’loss’)])
        return model

        # generate data
        X_train, y_train, X_test, y_test = get_data()

        # fit model
        model = get_model(X_train, y_train)

        # predict probabilities for test set
        yhat_probs = model.predict(X_test, verbose=0)

        # predict crisp classes for test set
        yhat_classes = model.predict_classes(X_test, verbose=0)

        # reduce to 1d array
        yhat_probs = yhat_probs.flatten()
        yhat_classes = yhat_classes.flatten()

        # accuracy: (tp + tn) / (p + n)
        accuracy = accuracy_score(y_test, yhat_classes)
        print(‘Accuracy: %f’ % accuracy)
        # precision tp / (tp + fp)
        precision = precision_score(y_test, yhat_classes)
        print(‘Precision: %f’ % precision)
        # recall: tp / (tp + fn)
        recall = recall_score(y_test, yhat_classes)
        print(‘Recall: %f’ % recall)
        # f1: 2 tp / (2 tp + fp + fn)
        f1 = f1_score(y_test, yhat_classes)
        print(‘F1 score: %f’ % f1)

        # kappa
        kappa = cohen_kappa_score(testy, yhat_classes)
        print(‘Cohens kappa: %f’ % kappa)
        # ROC AUC
        auc = roc_auc_score(testy, yhat_probs)
        print(‘ROC AUC: %f’ % auc)
        # confusion matrix
        matrix = confusion_matrix(y_test, yhat_classes)

        • Jason Brownlee July 17, 2019 at 8:31 am #

          Perhaps check your data matches the expectation of the measures you intend to use?

          • Erica Rac July 17, 2019 at 11:51 am #

            I see my error in preprocessing. Thanks for the quick reply!

          • Jason Brownlee July 17, 2019 at 2:24 pm #

            Happy to hear that.

  5. Despina M May 17, 2019 at 4:03 am #

    Hello! Another great post of you! Thank you!

    I want to calculate Precision, Recall, F1 for every class not only the average. Is it possible?

    Thank you in advance

    • Jason Brownlee May 17, 2019 at 5:59 am #

      Yes, I believe the sklearn classification report will provide this information.

      I also suspect you can configure the sklearn functions for each metric to report per-class scores.

      • Despina M May 17, 2019 at 6:49 am #

        Thank you so much for the quick answer! I will try to calculate them.

  6. Despina M May 19, 2019 at 6:07 am #

    I used

    from sklearn.metrics import precision_recall_fscore_support

    precision_recall_fscore_support(y_test, y_pred, average=None)

    print(classification_report(y_test, y_pred, labels=[0, 1]))

    It works fine for me.

    Thanks again!

  7. Vani May 23, 2019 at 10:28 pm #

    How is that accuracy calculated using “history.history[‘val_acc’]” provides different values as compared to accuracy calculated using “accuracy = accuracy_score(testy, yhat_classes)” ?

    • Jason Brownlee May 24, 2019 at 7:51 am #

      It should be the same, e.g. calculate score at the end of each epoch.

      • Vani June 3, 2019 at 1:51 pm #

        thank you

  8. usama May 25, 2019 at 9:53 pm #

    hi jason, i need a code of RNN through which i can find out the classification and confusion matrix of a specific dataset.

  9. vani venk June 3, 2019 at 1:57 pm #

    I calculated accuracy, precision,recall and f1 using following formulas.

    accuracy = metrics.accuracy_score(true_classes, predicted_classes)
    precision=metrics.precision_score(true_classes, predicted_classes)
    recall=metrics.recall_score(true_classes, predicted_classes)
    f1=metrics.f1_score(true_classes, predicted_classes)

    The metrics stays at very low value of around 49% to 52 % even after increasing the number of nodes and performing all kinds of tweaking.

    precision recall f1-score support

    nu 0.49 0.34 0.40 2814
    u 0.50 0.65 0.56 2814

    avg / total 0.49 0.49 0.48 5628

    The confusion matrix shows very high values of FP and FN
    confusion= [[ 953 1861]
    [ 984 1830]]

    What can I do to improve the performance?

    • Vani June 3, 2019 at 2:02 pm #

      For the low values of accuracy, precision, recall and F1, the accuracy and loss plot is also weird.
      The accuracy of validation dataset remains higher than training dataset; similarly, the validation loss remains lower than that of training dataset; whereas the reverse is expected.
      How to overcome this problem?

      • Jason Brownlee June 3, 2019 at 2:35 pm #

        Better results on the test set than the training set may suggest that the test set is not representative of the problem, e.g. is too small.

    • Jason Brownlee June 3, 2019 at 2:35 pm #

      I offer some suggestions here:

  10. onyeka July 27, 2019 at 5:23 pm #

    ValueError: Error when checking input: expected dense_74_input to have shape (2,) but got array with shape (10,)

    i got this error and i dont know what to do next

Leave a Reply