How To Implement Machine Learning Metrics From Scratch in Python

By Jason Brownlee on August 13, 2019 in Code Algorithms From Scratch 17

After you make predictions, you need to know if they are any good.

There are standard measures that we can use to summarize how good a set of predictions actually are.

Knowing how good a set of predictions is, allows you to make estimates about how good a given machine learning model of your problem,

In this tutorial, you will discover how to implement four standard prediction evaluation metrics from scratch in Python.

After reading this tutorial, you will know:

How to implement classification accuracy.
How to implement and interpret a confusion matrix.
How to implement mean absolute error for regression.
How to implement root mean squared error for regression.

Kick-start your project with my new book Machine Learning Algorithms From Scratch, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Update Aug/2018: Tested and updated to work with Python 3.6.

How To Implement Machine Learning Algorithm Performance Metrics From Scratch In Python

How To Implement Machine Learning Algorithm Performance Metrics From Scratch With Python
Photo by Hernán Piñera, some rights reserved.

Description

You must estimate the quality of a set of predictions when training a machine learning model.

Performance metrics like classification accuracy and root mean squared error can give you a clear objective idea of how good a set of predictions is, and in turn how good the model is that generated them.

This is important as it allows you to tell the difference and select among:

Different transforms of the data used to train the same machine learning model.
Different machine learning models trained on the same data.
Different configurations for a machine learning model trained on the same data.

As such, performance metrics are a required building block in implementing machine learning algorithms from scratch.

Tutorial

This tutorial is divided into 4 parts:

1. Classification Accuracy.
2. Confusion Matrix.
3. Mean Absolute Error.
4. Root Mean Squared Error.

These steps will provide the foundations you need to handle evaluating predictions made by machine learning algorithms.

1. Classification Accuracy

A quick way to evaluate a set of predictions on a classification problem is by using accuracy.

Classification accuracy is a ratio of the number of correct predictions out of all predictions that were made.

It is often presented as a percentage between 0% for the worst possible accuracy and 100% for the best possible accuracy.

accuracy = correct predictions / total predictions * 100

1	accuracy = correct predictions / total predictions * 100

We can implement this in a function that takes the expected outcomes and the predictions as arguments.

Below is this function named accuracy_metric() that returns classification accuracy as a percentage. Notice that we use “==” to compare the equality actual to predicted values. This allows us to compare integers or strings, two main data types that we may choose to use when loading classification data.

# Calculate accuracy percentage between two lists
def accuracy_metric(actual, predicted):
	correct = 0
	for i in range(len(actual)):
		if actual[i] == predicted[i]:
			correct += 1
	return correct / float(len(actual)) * 100.0

# Calculate accuracy percentage between two lists

def accuracy_metric(actual, predicted):

correct = 0

for i in range(len(actual)):

if actual[i] == predicted[i]:

correct += 1

return correct / float(len(actual)) * 100.0

We can contrive a small dataset to test this function. Below are a set of 10 actual and predicted integer values. There are two mistakes in the set of predictions.

actual          predicted
0		0
0		1
0		0
0		0
0		0
1		1
1		0
1		1
1		1
1		1

actual predicted

0 0

0 1

0 0

1 1

1 0

1 1

Below is a complete example with this dataset to test the accuracy_metric() function.

# Calculate accuracy percentage between two lists
def accuracy_metric(actual, predicted):
	correct = 0
	for i in range(len(actual)):
		if actual[i] == predicted[i]:
			correct += 1
	return correct / float(len(actual)) * 100.0

# Test accuracy
actual = [0,0,0,0,0,1,1,1,1,1]
predicted = [0,1,0,0,0,1,0,1,1,1]
accuracy = accuracy_metric(actual, predicted)
print(accuracy)

# Calculate accuracy percentage between two lists

def accuracy_metric(actual, predicted):

correct = 0

for i in range(len(actual)):

if actual[i] == predicted[i]:

correct += 1

return correct / float(len(actual)) * 100.0

# Test accuracy

actual = [0,0,0,0,0,1,1,1,1,1]

predicted = [0,1,0,0,0,1,0,1,1,1]

accuracy = accuracy_metric(actual, predicted)

print(accuracy)

Running this example produces the expected accuracy of 80% or 8/10.

80.0

80.0

Accuracy is a good metric to use when you have a small number of class values, such as 2, also called a binary classification problem.

Accuracy starts to lose it’s meaning when you have more class values and you may need to review a different perspective on the results, such as a confusion matrix.

2. Confusion Matrix

A confusion matrix provides a summary of all of the predictions made compared to the expected actual values.

The results are presented in a matrix with counts in each cell. The counts of actual class values are summarized horizontally, whereas the counts of predictions for each class values are presented vertically.

A perfect set of predictions is shown as a diagonal line from the top left to the bottom right of the matrix.

The value of a confusion matrix for classification problems is that you can clearly see which predictions were wrong and the type of mistake that was made.

Let’s create a function to calculate a confusion matrix.

We can start off by defining the function to calculate the confusion matrix given a list of actual class values and a list of predictions.

The function is listed below and is named confusion_matrix(). It first makes a list of all of the unique class values and assigns each class value a unique integer or index into the confusion matrix.

The confusion matrix is always square, with the number of class values indicating the number of rows and columns required.

Here, the first index into the matrix is the row for actual values and the second is the column for predicted values. After the square confusion matrix is created and initialized to zero counts in each cell, it is a matter of looping through all predictions and incrementing the count in each cell.

The function returns two objects. The first is the set of unique class values, so that they can be displayed when the confusion matrix is drawn. The second is the confusion matrix itself with the counts in each cell.

# calculate a confusion matrix
def confusion_matrix(actual, predicted):
	unique = set(actual)
	matrix = [list() for x in range(len(unique))]
	for i in range(len(unique)):
		matrix[i] = [0 for x in range(len(unique))]
	lookup = dict()
	for i, value in enumerate(unique):
		lookup[value] = i
	for i in range(len(actual)):
		x = lookup[actual[i]]
		y = lookup[predicted[i]]
		matrix[y][x] += 1
	return unique, matrix

# calculate a confusion matrix

def confusion_matrix(actual, predicted):

unique = set(actual)

matrix = [list() for x in range(len(unique))]

for i in range(len(unique)):

matrix[i] = [0 for x in range(len(unique))]

lookup = dict()

for i, value in enumerate(unique):

lookup[value] = i

for i in range(len(actual)):

x = lookup[actual[i]]

y = lookup[predicted[i]]

matrix[y][x] += 1

return unique, matrix

Let’s make this concrete with an example.

Below is another contrived dataset, this time with 3 mistakes.

actual     	predicted
0		0
0		1
0		1
0		0
0		0
1		1
1		0
1		1
1		1
1		1

actual predicted

0 0

0 1

0 0

1 1

1 0

1 1

We can calculate and print the confusion matrix for this dataset as follows:

# Example of Calculating a Confusion Matrix

# calculate a confusion matrix
def confusion_matrix(actual, predicted):
	unique = set(actual)
	matrix = [list() for x in range(len(unique))]
	for i in range(len(unique)):
		matrix[i] = [0 for x in range(len(unique))]
	lookup = dict()
	for i, value in enumerate(unique):
		lookup[value] = i
	for i in range(len(actual)):
		x = lookup[actual[i]]
		y = lookup[predicted[i]]
		matrix[y][x] += 1
	return unique, matrix

# Test confusion matrix with integers
actual = [0,0,0,0,0,1,1,1,1,1]
predicted = [0,1,1,0,0,1,0,1,1,1]
unique, matrix = confusion_matrix(actual, predicted)
print(unique)
print(matrix)

# Example of Calculating a Confusion Matrix

# calculate a confusion matrix

def confusion_matrix(actual, predicted):

unique = set(actual)

matrix = [list() for x in range(len(unique))]

for i in range(len(unique)):

matrix[i] = [0 for x in range(len(unique))]

lookup = dict()

for i, value in enumerate(unique):

lookup[value] = i

for i in range(len(actual)):

x = lookup[actual[i]]

y = lookup[predicted[i]]

matrix[y][x] += 1

return unique, matrix

# Test confusion matrix with integers

actual = [0,0,0,0,0,1,1,1,1,1]

predicted = [0,1,1,0,0,1,0,1,1,1]

unique, matrix = confusion_matrix(actual, predicted)

print(unique)

print(matrix)

Running the example produces the output below. The example first prints the list of unique values and then the confusion matrix.

{0, 1}
[[3, 1], [2, 4]]

1 2	{0, 1} [[3, 1], [2, 4]]

It’s hard to interpret the results this way. It would help if we could display the matrix as intended with rows and columns.

Below is a function to correctly display the matrix.

The function is named print_confusion_matrix(). It names the columns as P for Predictions and the rows as A for Actual. Each column and row are named for the class value for which it corresponds.

The matrix is laid out with the expectation that each class label is a single character or single digit integer and that the counts are also single digit integers. You could extend it to handle large class labels or prediction counts as an exercise.

# pretty print a confusion matrix
def print_confusion_matrix(unique, matrix):
	print('(A)' + ' '.join(str(x) for x in unique))
	print('(P)---')
	for i, x in enumerate(unique):
		print("%s| %s" % (x, ' '.join(str(x) for x in matrix[i])))

# pretty print a confusion matrix

def print_confusion_matrix(unique, matrix):

print('(A)' + ' '.join(str(x) for x in unique))

print('(P)---')

for i, x in enumerate(unique):

print("%s| %s" % (x, ' '.join(str(x) for x in matrix[i])))

We can piece together all of the functions and display a human readable confusion matrix.

# Example of Calculating and Displaying a Pretty Confusion Matrix

# calculate a confusion matrix
def confusion_matrix(actual, predicted):
	unique = set(actual)
	matrix = [list() for x in range(len(unique))]
	for i in range(len(unique)):
		matrix[i] = [0 for x in range(len(unique))]
	lookup = dict()
	for i, value in enumerate(unique):
		lookup[value] = i
	for i in range(len(actual)):
		x = lookup[actual[i]]
		y = lookup[predicted[i]]
		matrix[y][x] += 1
	return unique, matrix

# pretty print a confusion matrix
def print_confusion_matrix(unique, matrix):
	print('(A)' + ' '.join(str(x) for x in unique))
	print('(P)---')
	for i, x in enumerate(unique):
		print("%s| %s" % (x, ' '.join(str(x) for x in matrix[i])))

# Test confusion matrix with integers
actual = [0,0,0,0,0,1,1,1,1,1]
predicted = [0,1,1,0,0,1,0,1,1,1]
unique, matrix = confusion_matrix(actual, predicted)
print_confusion_matrix(unique, matrix)

# Example of Calculating and Displaying a Pretty Confusion Matrix

# calculate a confusion matrix

def confusion_matrix(actual, predicted):

unique = set(actual)

matrix = [list() for x in range(len(unique))]

for i in range(len(unique)):

matrix[i] = [0 for x in range(len(unique))]

lookup = dict()

for i, value in enumerate(unique):

lookup[value] = i

for i in range(len(actual)):

x = lookup[actual[i]]

y = lookup[predicted[i]]

matrix[y][x] += 1

return unique, matrix

# pretty print a confusion matrix

def print_confusion_matrix(unique, matrix):

print('(A)' + ' '.join(str(x) for x in unique))

print('(P)---')

for i, x in enumerate(unique):

print("%s| %s" % (x, ' '.join(str(x) for x in matrix[i])))

# Test confusion matrix with integers

actual = [0,0,0,0,0,1,1,1,1,1]

predicted = [0,1,1,0,0,1,0,1,1,1]

unique, matrix = confusion_matrix(actual, predicted)

print_confusion_matrix(unique, matrix)

Running the example produces the output below. We can see the class labels of 0 and 1 across the top and bottom. Looking down the diagonal of the matrix from the top left to bottom right, we can see that 3 predictions of 0 were correct and 4 predictions of 1 were correct.

Looking in the other cells, we can see 2 + 1 or 3 prediction errors. We can see that 2 predictions were made as a 1 that were in fact actually a 0 class value. And we can see 1 prediction that was a 0 that was in fact actually a 1.

(A)0 1
(P)---
0| 3 1
1| 2 4

(A)0 1

(P)---

0| 3 1

1| 2 4

A confusion matrix is always a good idea to use in addition to classification accuracy to help interpret the predictions.

3. Mean Absolute Error

Regression problems are those where a real value is predicted.

An easy metric to consider is the error in the predicted values as compared to the expected values.

The Mean Absolute Error or MAE for short is a good first error metric to use.

It is calculated as the average of the absolute error values, where “absolute” means “made positive” so that they can be added together.

MAE = sum( abs(predicted_i - actual_i) ) / total predictions

1	MAE = sum( abs(predicted_i - actual_i) ) / total predictions

Below is a function named mae_metric() that implements this metric. As above, it expects a list of actual outcome values and a list of predictions. We use the built-in abs() Python function to calculate the absolute error values that are summed together.

def mae_metric(actual, predicted):
	sum_error = 0.0
	for i in range(len(actual)):
		sum_error += abs(predicted[i] - actual[i])

def mae_metric(actual, predicted):

sum_error = 0.0

for i in range(len(actual)):

sum_error += abs(predicted[i] - actual[i])

We can contrive a small regression dataset to test this function.

actual 		predicted
0.1		0.11
0.2		0.19
0.3		0.29
0.4		0.41
0.5		0.5

actual predicted

0.1 0.11

0.2 0.19

0.3 0.29

0.4 0.41

0.5 0.5

Only one prediction (0.5) is correct, whereas all other predictions are wrong by 0.01. Therefore, we would expect the mean absolute error (or the average positive error) for these predictions to be a little less than 0.01.

Below is an example that tests the mae_metric() function with the contrived dataset.

# Calculate mean absolute error
def mae_metric(actual, predicted):
	sum_error = 0.0
	for i in range(len(actual)):
		sum_error += abs(predicted[i] - actual[i])
	return sum_error / float(len(actual))

# Test RMSE
actual = [0.1, 0.2, 0.3, 0.4, 0.5]
predicted = [0.11, 0.19, 0.29, 0.41, 0.5]
mae = mae_metric(actual, predicted)
print(mae)

# Calculate mean absolute error

def mae_metric(actual, predicted):

sum_error = 0.0

for i in range(len(actual)):

sum_error += abs(predicted[i] - actual[i])

return sum_error / float(len(actual))

# Test RMSE

actual = [0.1, 0.2, 0.3, 0.4, 0.5]

predicted = [0.11, 0.19, 0.29, 0.41, 0.5]

mae = mae_metric(actual, predicted)

print(mae)

Running this example prints the output below. We can see that as expected, the MAE was about 0.008, a small value slightly lower than 0.01.

0.007999999999999993

1	0.007999999999999993

4. Root Mean Squared Error

Another popular way to calculate the error in a set of regression predictions is to use the Root Mean Squared Error.

Shortened as RMSE, the metric is sometimes called Mean Squared Error or MSE, dropping the Root part from the calculation and the name.

RMSE is calculated as the square root of the mean of the squared differences between actual outcomes and predictions.

Squaring each error forces the values to be positive, and the square root of the mean squared error returns the error metric back to the original units for comparison.

RMSE = sqrt( sum( (predicted_i - actual_i)^2 ) / total predictions)

1	RMSE = sqrt( sum( (predicted_i - actual_i)^2 ) / total predictions)

Below is an implementation of this in a function named rmse_metric(). It uses the sqrt() function from the math module and uses the ** operator to raise the error to the 2nd power.

# Calculate root mean squared error
def rmse_metric(actual, predicted):
	sum_error = 0.0
	for i in range(len(actual)):
		prediction_error = predicted[i] - actual[i]
		sum_error += (prediction_error ** 2)
	mean_error = sum_error / float(len(actual))
	return sqrt(mean_error)

# Calculate root mean squared error

def rmse_metric(actual, predicted):

sum_error = 0.0

for i in range(len(actual)):

prediction_error = predicted[i] - actual[i]

sum_error += (prediction_error ** 2)

mean_error = sum_error / float(len(actual))

return sqrt(mean_error)

We can test this metric on the same dataset used to test the calculation of Mean Absolute Error above.

Below is a complete example. Again, we would expect an error value to be generally close to 0.01.

from math import sqrt

# Calculate root mean squared error
def rmse_metric(actual, predicted):
	sum_error = 0.0
	for i in range(len(actual)):
		prediction_error = predicted[i] - actual[i]
		sum_error += (prediction_error ** 2)
	mean_error = sum_error / float(len(actual))
	return sqrt(mean_error)

# Test RMSE
actual = [0.1, 0.2, 0.3, 0.4, 0.5]
predicted = [0.11, 0.19, 0.29, 0.41, 0.5]
rmse = rmse_metric(actual, predicted)
print(rmse)

from math import sqrt

# Calculate root mean squared error

def rmse_metric(actual, predicted):

sum_error = 0.0

for i in range(len(actual)):

prediction_error = predicted[i] - actual[i]

sum_error += (prediction_error ** 2)

mean_error = sum_error / float(len(actual))

return sqrt(mean_error)

# Test RMSE

actual = [0.1, 0.2, 0.3, 0.4, 0.5]

predicted = [0.11, 0.19, 0.29, 0.41, 0.5]

rmse = rmse_metric(actual, predicted)

print(rmse)

Running the example, we see the results below. The result is slightly higher at 0.0089.

RMSE values are always slightly higher than MSE values, which becomes more pronounced as the prediction errors increase. This is a benefit of using RMSE over MSE in that it penalizes larger errors with worse scores.

0.00894427190999915

1	0.00894427190999915

Extensions

You have only seen a small sample of the most widely used performance metrics.

There are many other performance metrics that you may require.

Below is a list of 5 additional performance metrics that you may wish to implement to extend this tutorial

Precision for classification.
Recall for classification.
F1 for classification.
Area Under ROC Curve or AUC for classification.
Goodness of Fit or R^2 (R squared) for regression.

Did you implement any of these extensions?
Share your experiences in the comments below.

Review

In this tutorial, you discovered how to implement algorithm prediction performance metrics from scratch in Python.

Specifically, you learned:

How to implement and interpret classification accuracy.
How to implement and interpret the confusion matrix for classification problems.
How to implement and interpret mean absolute error for regression.
How to implement and interpret root mean squared error for regression.

Do you have any questions?
Ask your questions in the comments and I will do my best to answer.

17 Responses to How To Implement Machine Learning Metrics From Scratch in Python

Joao Pires October 25, 2016 at 2:42 am #

I think accuracy metrics could give wrong idea of the results. Mostly because of the false positives.
It’s easy to implement, but in my opinion maybe isn’t the best choice, despite the fact there are many developers which use it.

Reply
- Jason Brownlee October 25, 2016 at 8:30 am #
  
  I agree Joao, often for classification logloss, kappa or even F1 are better measures to use.
  
  Accuracy is a great place to start though, especially for beginners.
  
  Reply
Joao Pires October 25, 2016 at 2:49 am #

Another comment related to ML algorithms validation … the best option is to divide the data in three sets: test, train and validation, because only test and train could direct the learning system to cheat.
Once more, there are few authors which use this method.

Reply
- Jason Brownlee October 25, 2016 at 8:31 am #
  
  Hi Joao, 3 sets is a good practice, if you have the data to spare.
  
  Reply
Faisal February 8, 2019 at 2:32 am #

Hi Jason,

In the absence of any published result how to verify that the Mean Square Error (MSE) I’m getting is good. For example in one problem the MSE is 400 – 500 when the actual values are in the range of 953 and 1616. In another problem MSE is around 25-30 when the range of actual values are between 0 and 85. How I can tell I’m getting good MSE? I thought about normalization but again I believe it will be just scaled down.

What about Pred measure which I saw in some papers? Is it going to help in finding the good MSE value?

Reply
- Jason Brownlee February 8, 2019 at 7:54 am #
  
  Great question, I answer it here:
  https://machinelearningmastery.com/faq/single-faq/how-to-know-if-a-model-has-good-performance
  
  Reply
Faisal February 8, 2019 at 9:26 am #

Thanks for a very quick reply.

Reading above and other articles I think I need to implement zero rule algorithm and can see my baseline.

In an article of baseline timeseries forecasting, you get MSE = 17730 using naïve forecast which seems quite high. So how to know lower limit. Is 15000 good or 150 is good?

What about naïve forecast for multivariate time series?

Reply
- Jason Brownlee February 8, 2019 at 2:07 pm #
  
  The lowest limit is zero error.
  
  Good is only measured as compared to the naive method. That is the best we can do. Push this limit down with more sophisticated naive forecasts.
  
  Reply
  - Faisal February 9, 2019 at 1:10 am #
    
    Thanks. I’ll try and will come back if there are some more questions.
    
    Reply
Bhaskar Tripathi May 23, 2019 at 3:39 pm #

How do we calculate Normalized Mean Square Error ? I could not find that in Scikit documentation? Could you please help on that ?

Reply
- Jason Brownlee May 24, 2019 at 7:46 am #
  
  I believe you are referring to root mean squared error or RMSE.
  
  You can calculate the MSE and then calculate the square root of the result.
  
  Reply
San February 13, 2020 at 5:48 pm #

why is enumerate in confusion matrix code is so meaningful? i tried to run without it and it gives same results

Reply
- Jason Brownlee February 14, 2020 at 6:29 am #
  
  Sorry, Id on’t understand. Can you elaborate?
  
  Reply
Ferhat June 13, 2020 at 12:48 am #

Firstly, I’ve normalized the data between 0 and 1 since data values in the range of 2k and 3k. After this process, RMSE: 0.184907340558712. I’m wondering , Can i interpret this score like 18%. If I can’t, how should i interpret this score.

Reply
- Jason Brownlee June 13, 2020 at 6:08 am #
  
  No.
  
  But you can invert the transform on the prediction, then calculate the error which will give a value you can interpret in the context of the original variable/units.
  
  Reply
Sunny August 16, 2022 at 12:27 am #

How to calculate TPR FPR ROC without using sklearn?

Reply
- James Carmichael August 16, 2022 at 9:44 am #
  
  Hi Sunny…The following discussion may be of interest to you:
  
  https://stackoverflow.com/questions/61321778/how-to-calculate-tpr-and-fpr-in-python-without-using-sklearn
  
  Reply

Navigation

How To Implement Machine Learning Metrics From Scratch in Python

Description

Tutorial

1. Classification Accuracy

2. Confusion Matrix

3. Mean Absolute Error

4. Root Mean Squared Error

Extensions

Review

Discover How to Code Algorithms From Scratch!

No Libraries, Just Python Code.

Finally, Pull Back the Curtain on
Machine Learning Algorithms

More On This Topic

17 Responses to How To Implement Machine Learning Metrics From Scratch in Python

Leave a Reply Click here to cancel reply.

Navigation

Description

Tutorial

1. Classification Accuracy

2. Confusion Matrix

3. Mean Absolute Error

4. Root Mean Squared Error

Extensions

Review

Discover How to Code Algorithms From Scratch!

No Libraries, Just Python Code.

Finally, Pull Back the Curtain on Machine Learning Algorithms

More On This Topic

17 Responses to How To Implement Machine Learning Metrics From Scratch in Python

Leave a Reply Click here to cancel reply.

Finally, Pull Back the Curtain on
Machine Learning Algorithms