How To Implement Machine Learning Metrics From Scratch in Python

Last Updated on August 13, 2019

After you make predictions, you need to know if they are any good.

There are standard measures that we can use to summarize how good a set of predictions actually are.

Knowing how good a set of predictions is, allows you to make estimates about how good a given machine learning model of your problem,

In this tutorial, you will discover how to implement four standard prediction evaluation metrics from scratch in Python.

After reading this tutorial, you will know:

  • How to implement classification accuracy.
  • How to implement and interpret a confusion matrix.
  • How to implement mean absolute error for regression.
  • How to implement root mean squared error for regression.

Kick-start your project with my new book Machine Learning Algorithms From Scratch, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

  • Update Aug/2018: Tested and updated to work with Python 3.6.
How To Implement Machine Learning Algorithm Performance Metrics From Scratch In Python

How To Implement Machine Learning Algorithm Performance Metrics From Scratch With Python
Photo by Hernán Piñera, some rights reserved.


You must estimate the quality of a set of predictions when training a machine learning model.

Performance metrics like classification accuracy and root mean squared error can give you a clear objective idea of how good a set of predictions is, and in turn how good the model is that generated them.

This is important as it allows you to tell the difference and select among:

  • Different transforms of the data used to train the same machine learning model.
  • Different machine learning models trained on the same data.
  • Different configurations for a machine learning model trained on the same data.

As such, performance metrics are a required building block in implementing machine learning algorithms from scratch.


This tutorial is divided into 4 parts:

  • 1. Classification Accuracy.
  • 2. Confusion Matrix.
  • 3. Mean Absolute Error.
  • 4. Root Mean Squared Error.

These steps will provide the foundations you need to handle evaluating predictions made by machine learning algorithms.

1. Classification Accuracy

A quick way to evaluate a set of predictions on a classification problem is by using accuracy.

Classification accuracy is a ratio of the number of correct predictions out of all predictions that were made.

It is often presented as a percentage between 0% for the worst possible accuracy and 100% for the best possible accuracy.

We can implement this in a function that takes the expected outcomes and the predictions as arguments.

Below is this function named accuracy_metric() that returns classification accuracy as a percentage. Notice that we use “==” to compare the equality actual to predicted values. This allows us to compare integers or strings, two main data types that we may choose to use when loading classification data.

We can contrive a small dataset to test this function. Below are a set of 10 actual and predicted integer values. There are two mistakes in the set of predictions.

Below is a complete example with this dataset to test the accuracy_metric() function.

Running this example produces the expected accuracy of 80% or 8/10.

Accuracy is a good metric to use when you have a small number of class values, such as 2, also called a binary classification problem.

Accuracy starts to lose it’s meaning when you have more class values and you may need to review a different perspective on the results, such as a confusion matrix.

2. Confusion Matrix

A confusion matrix provides a summary of all of the predictions made compared to the expected actual values.

The results are presented in a matrix with counts in each cell. The counts of actual class values are summarized horizontally, whereas the counts of predictions for each class values are presented vertically.

A perfect set of predictions is shown as a diagonal line from the top left to the bottom right of the matrix.

The value of a confusion matrix for classification problems is that you can clearly see which predictions were wrong and the type of mistake that was made.

Let’s create a function to calculate a confusion matrix.

We can start off by defining the function to calculate the confusion matrix given a list of actual class values and a list of predictions.

The function is listed below and is named confusion_matrix(). It first makes a list of all of the unique class values and assigns each class value a unique integer or index into the confusion matrix.

The confusion matrix is always square, with the number of class values indicating the number of rows and columns required.

Here, the first index into the matrix is the row for actual values and the second is the column for predicted values. After the square confusion matrix is created and initialized to zero counts in each cell, it is a matter of looping through all predictions and incrementing the count in each cell.

The function returns two objects. The first is the set of unique class values, so that they can be displayed when the confusion matrix is drawn. The second is the confusion matrix itself with the counts in each cell.

Let’s make this concrete with an example.

Below is another contrived dataset, this time with 3 mistakes.

We can calculate and print the confusion matrix for this dataset as follows:

Running the example produces the output below. The example first prints the list of unique values and then the confusion matrix.

It’s hard to interpret the results this way. It would help if we could display the matrix as intended with rows and columns.

Below is a function to correctly display the matrix.

The function is named print_confusion_matrix(). It names the columns as P for Predictions and the rows as A for Actual. Each column and row are named for the class value for which it corresponds.

The matrix is laid out with the expectation that each class label is a single character or single digit integer and that the counts are also single digit integers. You could extend it to handle large class labels or prediction counts as an exercise.

We can piece together all of the functions and display a human readable confusion matrix.

Running the example produces the output below. We can see the class labels of 0 and 1 across the top and bottom. Looking down the diagonal of the matrix from the top left to bottom right, we can see that 3 predictions of 0 were correct and 4 predictions of 1 were correct.

Looking in the other cells, we can see 2 + 1 or 3 prediction errors. We can see that 2 predictions were made as a 1 that were in fact actually a 0 class value. And we can see 1 prediction that was a 0 that was in fact actually a 1.

A confusion matrix is always a good idea to use in addition to classification accuracy to help interpret the predictions.

3. Mean Absolute Error

Regression problems are those where a real value is predicted.

An easy metric to consider is the error in the predicted values as compared to the expected values.

The Mean Absolute Error or MAE for short is a good first error metric to use.

It is calculated as the average of the absolute error values, where “absolute” means “made positive” so that they can be added together.

Below is a function named mae_metric() that implements this metric. As above, it expects a list of actual outcome values and a list of predictions. We use the built-in abs() Python function to calculate the absolute error values that are summed together.

We can contrive a small regression dataset to test this function.

Only one prediction (0.5) is correct, whereas all other predictions are wrong by 0.01. Therefore, we would expect the mean absolute error (or the average positive error) for these predictions to be a little less than 0.01.

Below is an example that tests the mae_metric() function with the contrived dataset.

Running this example prints the output below. We can see that as expected, the MAE was about 0.008, a small value slightly lower than 0.01.

4. Root Mean Squared Error

Another popular way to calculate the error in a set of regression predictions is to use the Root Mean Squared Error.

Shortened as RMSE, the metric is sometimes called Mean Squared Error or MSE, dropping the Root part from the calculation and the name.

RMSE is calculated as the square root of the mean of the squared differences between actual outcomes and predictions.

Squaring each error forces the values to be positive, and the square root of the mean squared error returns the error metric back to the original units for comparison.

Below is an implementation of this in a function named rmse_metric().  It uses the sqrt() function from the math module and uses the ** operator to raise the error to the 2nd power.

We can test this metric on the same dataset used to test the calculation of Mean Absolute Error above.

Below is a complete example. Again, we would expect an error value to be generally close to 0.01.

Running the example, we see the results below. The result is slightly higher at 0.0089.

RMSE values are always slightly higher than MSE values, which becomes more pronounced as the prediction errors increase. This is a benefit of using RMSE over MSE in that it penalizes larger errors with worse scores.


You have only seen a small sample of the most widely used performance metrics.

There are many other performance metrics that you may require.

Below is a list of 5 additional performance metrics that you may wish to implement to extend this tutorial

  • Precision for classification.
  • Recall for classification.
  • F1 for classification.
  • Area Under ROC Curve or AUC for classification.
  • Goodness of Fit or R^2 (R squared) for regression.

Did you implement any of these extensions?
Share your experiences in the comments below.


In this tutorial, you discovered how to implement algorithm prediction performance metrics from scratch in Python.

Specifically, you learned:

  • How to implement and interpret classification accuracy.
  • How to implement and interpret the confusion matrix for classification problems.
  • How to implement and interpret mean absolute error for regression.
  • How to implement and interpret root mean squared error for regression.

Do you have any questions?
Ask your questions in the comments and I will do my best to answer.

Discover How to Code Algorithms From Scratch!

Machine Learning Algorithms From Scratch

No Libraries, Just Python Code.

...with step-by-step tutorials on real-world datasets

Discover how in my new Ebook:
Machine Learning Algorithms From Scratch

It covers 18 tutorials with all the code for 12 top algorithms, like:
Linear Regression, k-Nearest Neighbors, Stochastic Gradient Descent and much more...

Finally, Pull Back the Curtain on
Machine Learning Algorithms

Skip the Academics. Just Results.

See What's Inside

17 Responses to How To Implement Machine Learning Metrics From Scratch in Python

  1. Avatar
    Joao Pires October 25, 2016 at 2:42 am #

    I think accuracy metrics could give wrong idea of the results. Mostly because of the false positives.
    It’s easy to implement, but in my opinion maybe isn’t the best choice, despite the fact there are many developers which use it.

    • Avatar
      Jason Brownlee October 25, 2016 at 8:30 am #

      I agree Joao, often for classification logloss, kappa or even F1 are better measures to use.

      Accuracy is a great place to start though, especially for beginners.

  2. Avatar
    Joao Pires October 25, 2016 at 2:49 am #

    Another comment related to ML algorithms validation … the best option is to divide the data in three sets: test, train and validation, because only test and train could direct the learning system to cheat.
    Once more, there are few authors which use this method.

    • Avatar
      Jason Brownlee October 25, 2016 at 8:31 am #

      Hi Joao, 3 sets is a good practice, if you have the data to spare.

  3. Avatar
    Faisal February 8, 2019 at 2:32 am #

    Hi Jason,

    In the absence of any published result how to verify that the Mean Square Error (MSE) I’m getting is good. For example in one problem the MSE is 400 – 500 when the actual values are in the range of 953 and 1616. In another problem MSE is around 25-30 when the range of actual values are between 0 and 85. How I can tell I’m getting good MSE? I thought about normalization but again I believe it will be just scaled down.

    What about Pred measure which I saw in some papers? Is it going to help in finding the good MSE value?

  4. Avatar
    Faisal February 8, 2019 at 9:26 am #

    Thanks for a very quick reply.

    Reading above and other articles I think I need to implement zero rule algorithm and can see my baseline.

    In an article of baseline timeseries forecasting, you get MSE = 17730 using naïve forecast which seems quite high. So how to know lower limit. Is 15000 good or 150 is good?

    What about naïve forecast for multivariate time series?

    • Avatar
      Jason Brownlee February 8, 2019 at 2:07 pm #

      The lowest limit is zero error.

      Good is only measured as compared to the naive method. That is the best we can do. Push this limit down with more sophisticated naive forecasts.

      • Avatar
        Faisal February 9, 2019 at 1:10 am #

        Thanks. I’ll try and will come back if there are some more questions.

  5. Avatar
    Bhaskar Tripathi May 23, 2019 at 3:39 pm #

    How do we calculate Normalized Mean Square Error ? I could not find that in Scikit documentation? Could you please help on that ?

    • Avatar
      Jason Brownlee May 24, 2019 at 7:46 am #

      I believe you are referring to root mean squared error or RMSE.

      You can calculate the MSE and then calculate the square root of the result.

  6. Avatar
    San February 13, 2020 at 5:48 pm #

    why is enumerate in confusion matrix code is so meaningful? i tried to run without it and it gives same results

    • Avatar
      Jason Brownlee February 14, 2020 at 6:29 am #

      Sorry, Id on’t understand. Can you elaborate?

  7. Avatar
    Ferhat June 13, 2020 at 12:48 am #

    Firstly, I’ve normalized the data between 0 and 1 since data values in the range of 2k and 3k. After this process, RMSE: 0.184907340558712. I’m wondering , Can i interpret this score like 18%. If I can’t, how should i interpret this score.

    • Avatar
      Jason Brownlee June 13, 2020 at 6:08 am #


      But you can invert the transform on the prediction, then calculate the error which will give a value you can interpret in the context of the original variable/units.

  8. Avatar
    Sunny August 16, 2022 at 12:27 am #

    How to calculate TPR FPR ROC without using sklearn?

Leave a Reply