Last Updated on

After you make predictions, you need to know if they are any good.

There are standard measures that we can use to summarize how good a set of predictions actually are.

Knowing how good a set of predictions is, allows you to make estimates about how good a given machine learning model of your problem,

In this tutorial, you will discover how to implement four standard prediction evaluation metrics from scratch in Python.

After reading this tutorial, you will know:

- How to implement classification accuracy.
- How to implement and interpret a confusion matrix.
- How to implement mean absolute error for regression.
- How to implement root mean squared error for regression.

Discover how to code ML algorithms from scratch including kNN, decision trees, neural nets, ensembles and much more in my new book, with full Python code and no fancy libraries.

Let’s get started.

**Update Aug/2018**: Tested and updated to work with Python 3.6.

## Description

You must estimate the quality of a set of predictions when training a machine learning model.

Performance metrics like classification accuracy and root mean squared error can give you a clear objective idea of how good a set of predictions is, and in turn how good the model is that generated them.

This is important as it allows you to tell the difference and select among:

- Different transforms of the data used to train the same machine learning model.
- Different machine learning models trained on the same data.
- Different configurations for a machine learning model trained on the same data.

As such, performance metrics are a required building block in implementing machine learning algorithms from scratch.

## Tutorial

This tutorial is divided into 4 parts:

- 1. Classification Accuracy.
- 2. Confusion Matrix.
- 3. Mean Absolute Error.
- 4. Root Mean Squared Error.

These steps will provide the foundations you need to handle evaluating predictions made by machine learning algorithms.

### 1. Classification Accuracy

A quick way to evaluate a set of predictions on a classification problem is by using accuracy.

Classification accuracy is a ratio of the number of correct predictions out of all predictions that were made.

It is often presented as a percentage between 0% for the worst possible accuracy and 100% for the best possible accuracy.

1 |
accuracy = correct predictions / total predictions * 100 |

We can implement this in a function that takes the expected outcomes and the predictions as arguments.

Below is this function named **accuracy_metric()** that returns classification accuracy as a percentage. Notice that we use “==” to compare the equality actual to predicted values. This allows us to compare integers or strings, two main data types that we may choose to use when loading classification data.

1 2 3 4 5 6 7 |
# Calculate accuracy percentage between two lists def accuracy_metric(actual, predicted): correct = 0 for i in range(len(actual)): if actual[i] == predicted[i]: correct += 1 return correct / float(len(actual)) * 100.0 |

We can contrive a small dataset to test this function. Below are a set of 10 actual and predicted integer values. There are two mistakes in the set of predictions.

1 2 3 4 5 6 7 8 9 10 11 |
actual predicted 0 0 0 1 0 0 0 0 0 0 1 1 1 0 1 1 1 1 1 1 |

Below is a complete example with this dataset to test the **accuracy_metric()** function.

1 2 3 4 5 6 7 8 9 10 11 12 13 |
# Calculate accuracy percentage between two lists def accuracy_metric(actual, predicted): correct = 0 for i in range(len(actual)): if actual[i] == predicted[i]: correct += 1 return correct / float(len(actual)) * 100.0 # Test accuracy actual = [0,0,0,0,0,1,1,1,1,1] predicted = [0,1,0,0,0,1,0,1,1,1] accuracy = accuracy_metric(actual, predicted) print(accuracy) |

Running this example produces the expected accuracy of 80% or 8/10.

1 |
80.0 |

Accuracy is a good metric to use when you have a small number of class values, such as 2, also called a binary classification problem.

Accuracy starts to lose it’s meaning when you have more class values and you may need to review a different perspective on the results, such as a confusion matrix.

### 2. Confusion Matrix

A confusion matrix provides a summary of all of the predictions made compared to the expected actual values.

The results are presented in a matrix with counts in each cell. The counts of actual class values are summarized horizontally, whereas the counts of predictions for each class values are presented vertically.

A perfect set of predictions is shown as a diagonal line from the top left to the bottom right of the matrix.

The value of a confusion matrix for classification problems is that you can clearly see which predictions were wrong and the type of mistake that was made.

Let’s create a function to calculate a confusion matrix.

We can start off by defining the function to calculate the confusion matrix given a list of actual class values and a list of predictions.

The function is listed below and is named **confusion_matrix()**. It first makes a list of all of the unique class values and assigns each class value a unique integer or index into the confusion matrix.

The confusion matrix is always square, with the number of class values indicating the number of rows and columns required.

Here, the first index into the matrix is the row for actual values and the second is the column for predicted values. After the square confusion matrix is created and initialized to zero counts in each cell, it is a matter of looping through all predictions and incrementing the count in each cell.

The function returns two objects. The first is the set of unique class values, so that they can be displayed when the confusion matrix is drawn. The second is the confusion matrix itself with the counts in each cell.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# calculate a confusion matrix def confusion_matrix(actual, predicted): unique = set(actual) matrix = [list() for x in range(len(unique))] for i in range(len(unique)): matrix[i] = [0 for x in range(len(unique))] lookup = dict() for i, value in enumerate(unique): lookup[value] = i for i in range(len(actual)): x = lookup[actual[i]] y = lookup[predicted[i]] matrix[y][x] += 1 return unique, matrix |

Let’s make this concrete with an example.

Below is another contrived dataset, this time with 3 mistakes.

1 2 3 4 5 6 7 8 9 10 11 |
actual predicted 0 0 0 1 0 1 0 0 0 0 1 1 1 0 1 1 1 1 1 1 |

We can calculate and print the confusion matrix for this dataset as follows:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
# Example of Calculating a Confusion Matrix # calculate a confusion matrix def confusion_matrix(actual, predicted): unique = set(actual) matrix = [list() for x in range(len(unique))] for i in range(len(unique)): matrix[i] = [0 for x in range(len(unique))] lookup = dict() for i, value in enumerate(unique): lookup[value] = i for i in range(len(actual)): x = lookup[actual[i]] y = lookup[predicted[i]] matrix[y][x] += 1 return unique, matrix # Test confusion matrix with integers actual = [0,0,0,0,0,1,1,1,1,1] predicted = [0,1,1,0,0,1,0,1,1,1] unique, matrix = confusion_matrix(actual, predicted) print(unique) print(matrix) |

Running the example produces the output below. The example first prints the list of unique values and then the confusion matrix.

1 2 |
{0, 1} [[3, 1], [2, 4]] |

It’s hard to interpret the results this way. It would help if we could display the matrix as intended with rows and columns.

Below is a function to correctly display the matrix.

The function is named **print_confusion_matrix()**. It names the columns as P for Predictions and the rows as A for Actual. Each column and row are named for the class value for which it corresponds.

The matrix is laid out with the expectation that each class label is a single character or single digit integer and that the counts are also single digit integers. You could extend it to handle large class labels or prediction counts as an exercise.

1 2 3 4 5 6 |
# pretty print a confusion matrix def print_confusion_matrix(unique, matrix): print('(A)' + ' '.join(str(x) for x in unique)) print('(P)---') for i, x in enumerate(unique): print("%s| %s" % (x, ' '.join(str(x) for x in matrix[i]))) |

We can piece together all of the functions and display a human readable confusion matrix.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
# Example of Calculating and Displaying a Pretty Confusion Matrix # calculate a confusion matrix def confusion_matrix(actual, predicted): unique = set(actual) matrix = [list() for x in range(len(unique))] for i in range(len(unique)): matrix[i] = [0 for x in range(len(unique))] lookup = dict() for i, value in enumerate(unique): lookup[value] = i for i in range(len(actual)): x = lookup[actual[i]] y = lookup[predicted[i]] matrix[y][x] += 1 return unique, matrix # pretty print a confusion matrix def print_confusion_matrix(unique, matrix): print('(A)' + ' '.join(str(x) for x in unique)) print('(P)---') for i, x in enumerate(unique): print("%s| %s" % (x, ' '.join(str(x) for x in matrix[i]))) # Test confusion matrix with integers actual = [0,0,0,0,0,1,1,1,1,1] predicted = [0,1,1,0,0,1,0,1,1,1] unique, matrix = confusion_matrix(actual, predicted) print_confusion_matrix(unique, matrix) |

Running the example produces the output below. We can see the class labels of 0 and 1 across the top and bottom. Looking down the diagonal of the matrix from the top left to bottom right, we can see that 3 predictions of 0 were correct and 4 predictions of 1 were correct.

Looking in the other cells, we can see 2 + 1 or 3 prediction errors. We can see that 2 predictions were made as a 1 that were in fact actually a 0 class value. And we can see 1 prediction that was a 0 that was in fact actually a 1.

1 2 3 4 |
(A)0 1 (P)--- 0| 3 1 1| 2 4 |

A confusion matrix is always a good idea to use in addition to classification accuracy to help interpret the predictions.

### 3. Mean Absolute Error

Regression problems are those where a real value is predicted.

An easy metric to consider is the error in the predicted values as compared to the expected values.

The Mean Absolute Error or MAE for short is a good first error metric to use.

It is calculated as the average of the absolute error values, where “absolute” means “made positive” so that they can be added together.

1 |
MAE = sum( abs(predicted_i - actual_i) ) / total predictions |

Below is a function named **mae_metric()** that implements this metric. As above, it expects a list of actual outcome values and a list of predictions. We use the built-in **abs()** Python function to calculate the absolute error values that are summed together.

1 2 3 4 |
def mae_metric(actual, predicted): sum_error = 0.0 for i in range(len(actual)): sum_error += abs(predicted[i] - actual[i]) |

We can contrive a small regression dataset to test this function.

1 2 3 4 5 6 |
actual predicted 0.1 0.11 0.2 0.19 0.3 0.29 0.4 0.41 0.5 0.5 |

Only one prediction (0.5) is correct, whereas all other predictions are wrong by 0.01. Therefore, we would expect the mean absolute error (or the average positive error) for these predictions to be a little less than 0.01.

Below is an example that tests the **mae_metric()** function with the contrived dataset.

1 2 3 4 5 6 7 8 9 10 11 12 |
# Calculate mean absolute error def mae_metric(actual, predicted): sum_error = 0.0 for i in range(len(actual)): sum_error += abs(predicted[i] - actual[i]) return sum_error / float(len(actual)) # Test RMSE actual = [0.1, 0.2, 0.3, 0.4, 0.5] predicted = [0.11, 0.19, 0.29, 0.41, 0.5] mae = mae_metric(actual, predicted) print(mae) |

Running this example prints the output below. We can see that as expected, the MAE was about 0.008, a small value slightly lower than 0.01.

1 |
0.007999999999999993 |

### 4. Root Mean Squared Error

Another popular way to calculate the error in a set of regression predictions is to use the Root Mean Squared Error.

Shortened as RMSE, the metric is sometimes called Mean Squared Error or MSE, dropping the Root part from the calculation and the name.

RMSE is calculated as the square root of the mean of the squared differences between actual outcomes and predictions.

Squaring each error forces the values to be positive, and the square root of the mean squared error returns the error metric back to the original units for comparison.

1 |
RMSE = sqrt( sum( (predicted_i - actual_i)^2 ) / total predictions) |

Below is an implementation of this in a function named **rmse_metric()**. It uses the **sqrt()** function from the math module and uses the ** operator to raise the error to the 2nd power.

1 2 3 4 5 6 7 8 |
# Calculate root mean squared error def rmse_metric(actual, predicted): sum_error = 0.0 for i in range(len(actual)): prediction_error = predicted[i] - actual[i] sum_error += (prediction_error ** 2) mean_error = sum_error / float(len(actual)) return sqrt(mean_error) |

We can test this metric on the same dataset used to test the calculation of Mean Absolute Error above.

Below is a complete example. Again, we would expect an error value to be generally close to 0.01.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
from math import sqrt # Calculate root mean squared error def rmse_metric(actual, predicted): sum_error = 0.0 for i in range(len(actual)): prediction_error = predicted[i] - actual[i] sum_error += (prediction_error ** 2) mean_error = sum_error / float(len(actual)) return sqrt(mean_error) # Test RMSE actual = [0.1, 0.2, 0.3, 0.4, 0.5] predicted = [0.11, 0.19, 0.29, 0.41, 0.5] rmse = rmse_metric(actual, predicted) print(rmse) |

Running the example, we see the results below. The result is slightly higher at 0.0089.

RMSE values are always slightly higher than MSE values, which becomes more pronounced as the prediction errors increase. This is a benefit of using RMSE over MSE in that it penalizes larger errors with worse scores.

1 |
0.00894427190999915 |

## Extensions

You have only seen a small sample of the most widely used performance metrics.

There are many other performance metrics that you may require.

Below is a list of 5 additional performance metrics that you may wish to implement to extend this tutorial

- Precision for classification.
- Recall for classification.
- F1 for classification.
- Area Under ROC Curve or AUC for classification.
- Goodness of Fit or R^2 (R squared) for regression.

**Did you implement any of these extensions?**

Share your experiences in the comments below.

## Review

In this tutorial, you discovered how to implement algorithm prediction performance metrics from scratch in Python.

Specifically, you learned:

- How to implement and interpret classification accuracy.
- How to implement and interpret the confusion matrix for classification problems.
- How to implement and interpret mean absolute error for regression.
- How to implement and interpret root mean squared error for regression.

**Do you have any questions?**

Ask your questions in the comments and I will do my best to answer.

I think accuracy metrics could give wrong idea of the results. Mostly because of the false positives.

It’s easy to implement, but in my opinion maybe isn’t the best choice, despite the fact there are many developers which use it.

I agree Joao, often for classification logloss, kappa or even F1 are better measures to use.

Accuracy is a great place to start though, especially for beginners.

Another comment related to ML algorithms validation … the best option is to divide the data in three sets: test, train and validation, because only test and train could direct the learning system to cheat.

Once more, there are few authors which use this method.

Hi Joao, 3 sets is a good practice, if you have the data to spare.

Hi Jason,

In the absence of any published result how to verify that the Mean Square Error (MSE) I’m getting is good. For example in one problem the MSE is 400 – 500 when the actual values are in the range of 953 and 1616. In another problem MSE is around 25-30 when the range of actual values are between 0 and 85. How I can tell I’m getting good MSE? I thought about normalization but again I believe it will be just scaled down.

What about Pred measure which I saw in some papers? Is it going to help in finding the good MSE value?

Great question, I answer it here:

https://machinelearningmastery.com/faq/single-faq/how-to-know-if-a-model-has-good-performance

Thanks for a very quick reply.

Reading above and other articles I think I need to implement zero rule algorithm and can see my baseline.

In an article of baseline timeseries forecasting, you get MSE = 17730 using naïve forecast which seems quite high. So how to know lower limit. Is 15000 good or 150 is good?

What about naïve forecast for multivariate time series?

The lowest limit is zero error.

Good is only measured as compared to the naive method. That is the best we can do. Push this limit down with more sophisticated naive forecasts.

Thanks. I’ll try and will come back if there are some more questions.

How do we calculate Normalized Mean Square Error ? I could not find that in Scikit documentation? Could you please help on that ?

I believe you are referring to root mean squared error or RMSE.

You can calculate the MSE and then calculate the square root of the result.