The metrics that you choose to evaluate your machine learning algorithms are very important.

Choice of metrics influences how the performance of machine learning algorithms is measured and compared. They influence how you weight the importance of different characteristics in the results and your ultimate choice of which algorithm to choose.

In this post, you will discover how to select and use different machine learning performance metrics in Python with scikit-learn.

Let’s get started.

**Update Jan/2017**: Updated to reflect changes to the scikit-learn API in version 0.18.

## About the Recipes

Various different machine learning evaluation metrics are demonstrated in this post using small code recipes in Python and scikit-learn.

Each recipe is designed to be standalone so that you can copy-and-paste it into your project and use it immediately.

Metrics are demonstrated for both classification and regression type machine learning problems.

- For classification metrics, the Pima Indians onset of diabetes dataset is used as demonstration. This is a binary classification problem where all of the input variables are numeric.
- For regression metrics, the Boston House Price dataset is used as demonstration. this is a regression problem where all of the input variables are also numeric.

In each recipe, the dataset is downloaded directly from the UCI Machine Learning repository.

All recipes evaluate the same algorithms, Logistic Regression for classification and Linear Regression for the regression problems. A 10-fold cross-validation test harness is used to demonstrate each metric, because this is the most likely scenario where you will be employing different algorithm evaluation metrics.

A caveat in these recipes is the cross_val_score function used to report the performance in each recipe.It does allow the use of different scoring metrics that will be discussed, but all scores are reported so that they can be sorted in ascending order (largest score is best).

Some evaluation metrics (like mean squared error) are naturally descending scores (the smallest score is best) and as such are reported as negative by the *cross_val_score()* function. This is important to note, because some scores will be reported as negative that by definition can never be negative.

You can learn more about machine learning algorithm performance metrics supported by scikit-learn on the page Model evaluation: quantifying the quality of predictions.

Let’s get on with the evaluation metrics.

### Need help with Machine Learning in Python?

Take my free 2-week email course and discover data prep, algorithms and more (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

## Classification Metrics

Classification problems are perhaps the most common type of machine learning problem and as such there are a myriad of metrics that can be used to evaluate predictions for these problems.

In this section we will review how to use the following metrics:

- Classification Accuracy.
- Logarithmic Loss.
- Area Under ROC Curve.
- Confusion Matrix.
- Classification Report.

### 1. Classification Accuracy

Classification accuracy is the number of correct predictions made as a ratio of all predictions made.

This is the most common evaluation metric for classification problems, it is also the most misused. It is really only suitable when there are an equal number of observations in each class (which is rarely the case) and that all predictions and prediction errors are equally important, which is often not the case.

Below is an example of calculating classification accuracy.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# Cross Validation Classification Accuracy import pandas from sklearn import model_selection from sklearn.linear_model import LogisticRegression url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data" names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = pandas.read_csv(url, names=names) array = dataframe.values X = array[:,0:8] Y = array[:,8] seed = 7 kfold = model_selection.KFold(n_splits=10, random_state=seed) model = LogisticRegression() scoring = 'accuracy' results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring) print("Accuracy: %.3f (%.3f)") % (results.mean(), results.std()) |

You can see that the ratio is reported. This can be converted into a percentage by multiplying the value by 100, giving an accuracy score of approximately 77% accurate.

1 |
Accuracy: 0.770 (0.048) |

### 2. Logarithmic Loss

Logarithmic loss (or logloss) is a performance metric for evaluating the predictions of probabilities of membership to a given class.

The scalar probability between 0 and 1 can be seen as a measure of confidence for a prediction by an algorithm. Predictions that are correct or incorrect are rewarded or punished proportionally to the confidence of the prediction.

You can learn more about logarithmic on the Loss functions for classification Wikipedia article.

Below is an example of calculating logloss for Logistic regression predictions on the Pima Indians onset of diabetes dataset.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# Cross Validation Classification LogLoss import pandas from sklearn import model_selection from sklearn.linear_model import LogisticRegression url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data" names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = pandas.read_csv(url, names=names) array = dataframe.values X = array[:,0:8] Y = array[:,8] seed = 7 kfold = model_selection.KFold(n_splits=10, random_state=seed) model = LogisticRegression() scoring = 'neg_log_loss' results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring) print("Logloss: %.3f (%.3f)") % (results.mean(), results.std()) |

Smaller logloss is better with 0 representing a perfect logloss. As mentioned above, the measure is inverted to be ascending when using the *cross_val_score()* function.

1 |
Logloss: -0.493 (0.047) |

### 3. Area Under ROC Curve

Area under ROC Curve (or AUC for short) is a performance metric for binary classification problems.

The AUC represents a model’s ability to discriminate between positive and negative classes. An area of 1.0 represents a model that made all predictionsÂ perfectly. An area of 0.5 represents a model as good as random. Learn more about ROC here.

ROC can be broken down into sensitivity and specificity. A binary classification problem is really a trade-off between sensitivity and specificity.

- Sensitivity is the true positive rate also called the recall. It is the number instances from the positive (first) class that actually predicted correctly.
- Specificity is also called the true negative rate. Is the number of instances from the negative class (second) class that were actually predicted correctly.

You can learn more about ROC on the Wikipedia page.

The example below provides a demonstration of calculating AUC.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# Cross Validation Classification ROC AUC import pandas from sklearn import model_selection from sklearn.linear_model import LogisticRegression url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data" names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = pandas.read_csv(url, names=names) array = dataframe.values X = array[:,0:8] Y = array[:,8] seed = 7 kfold = model_selection.KFold(n_splits=10, random_state=seed) model = LogisticRegression() scoring = 'roc_auc' results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring) print("AUC: %.3f (%.3f)") % (results.mean(), results.std()) |

You can see the the AUC is relatively close to 1 and greater than 0.5, suggesting some skill in the predictions.

1 |
AUC: 0.824 (0.041) |

### 4. Confusion Matrix

The confusion matrix is a handy presentation of the accuracy of a model with two or more classes.

The table presents predictions on the x-axis and accuracy outcomes on the y-axis. The cells of the table are the number of predictions made by a machine learning algorithm.

For example, a machine learning algorithm can predict 0 or 1 and each prediction may actually have been a 0 or 1. Predictions for 0 that were actually 0 appear in the cell for prediction=0 and actual=0, whereas predictions for 0 that were actually 1 appear in the cell for prediction = 0 and actual=1. And so on.

You can learn more about the Confusion Matrix on the Wikipedia article.

Below is an example of calculating a confusion matrixÂ for a set of prediction by a model on a test set.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# Cross Validation Classification Confusion Matrix import pandas from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn.metrics import confusion_matrix names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = pandas.read_csv(url, names=names) array = dataframe.values X = array[:,0:8] Y = array[:,8] test_size = 0.33 seed = 7 X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=test_size, random_state=seed) model = LogisticRegression() model.fit(X_train, Y_train) predicted = model.predict(X_test) matrix = confusion_matrix(Y_test, predicted) print(matrix) |

Although the array is printed without headings, you can see that the majority of the predictions fall on the diagonal line of the matrix (which are correct predictions).

1 2 |
[[141 21] [ 41 51]] |

### 5. Classification Report

Scikit-learn does provide a convenience report when working on classification problems to give you a quick idea of the accuracy of a model using a number of measures.

The *classification_report()* function displays the precision, recall, f1-score and support for each class.

The example below demonstrates the report on the binary classification problem.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# Cross Validation Classification Report import pandas from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn.metrics import classification_report names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = pandas.read_csv(url, names=names) array = dataframe.values X = array[:,0:8] Y = array[:,8] test_size = 0.33 seed = 7 X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=test_size, random_state=seed) model = LogisticRegression() model.fit(X_train, Y_train) predicted = model.predict(X_test) report = classification_report(Y_test, predicted) print(report) |

You can see good prediction and recall for the algorithm.

1 2 3 4 5 6 |
precision recall f1-score support 0.0 0.77 0.87 0.82 162 1.0 0.71 0.55 0.62 92 avg / total 0.75 0.76 0.75 254 |

## Regression Metrics

In this section will review 3 of the most common metrics for evaluating predictions on regression machine learning problems:

- Mean Absolute Error.
- Mean Squared Error.
- R^2.

### 1. Mean Absolute Error

The Mean Absolute Error (or MAE) is the sum of the absolute differences between predictions and actual values. It gives an idea of how wrong the predictions were.

The measure gives an idea of the magnitude of the error, but no idea of the direction (e.g. over or under predicting).

You can learn more about Mean Absolute error on Wikipedia.

The example below demonstrates calculating mean absolute error on the Boston house price dataset.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# Cross Validation Regression MAE import pandas from sklearn import model_selection from sklearn.linear_model import LinearRegression url = "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data" names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'] dataframe = pandas.read_csv(url, delim_whitespace=True, names=names) array = dataframe.values X = array[:,0:13] Y = array[:,13] seed = 7 kfold = model_selection.KFold(n_splits=10, random_state=seed) model = LinearRegression() scoring = 'neg_mean_absolute_error' results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring) print("MAE: %.3f (%.3f)") % (results.mean(), results.std()) |

A value of 0 indicates no error or perfect predictions. Like logloss, this metric is inverted by the *cross_val_score()* function.

1 |
MAE: -4.005 (2.084) |

### 2. Mean Squared Error

The Mean Squared Error (or MSE) is much like the mean absolute error in that it provides a gross idea of the magnitude of error.

Taking the square root of the mean squared error converts the units back to the original units of the output variable and can be meaningful for description and presentation. This is called the Root Mean Squared Error (or RMSE).

You can learn more about Mean Squared Error on Wikipedia.

The example below provides a demonstration of calculating mean squared error.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# Cross Validation Regression MSE import pandas from sklearn import model_selection from sklearn.linear_model import LinearRegression url = "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data" names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'] dataframe = pandas.read_csv(url, delim_whitespace=True, names=names) array = dataframe.values X = array[:,0:13] Y = array[:,13] seed = 7 kfold = model_selection.KFold(n_splits=10, random_state=seed) model = LinearRegression() scoring = 'neg_mean_squared_error' results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring) print("MSE: %.3f (%.3f)") % (results.mean(), results.std()) |

This metric too is inverted so that the results are increasing. Remember to take the absolute value before taking the square root if you are interested in calculating the RMSE.

1 |
MSE: -34.705 (45.574) |

### 3. R^2 Metric

The R^2 (or R Squared) metric provides an indication of the goodness of fit of a set of predictions to the actual values. In statistical literature, this measure is called the coefficient of determination.

This is a value between 0 and 1 for no-fit and perfect fit respectively.

You can learn more about theÂ Coefficient of determination article on Wikipedia.

The example below provides a demonstration of calculating the mean R^2 for a set of predictions.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# Cross Validation Regression R^2 import pandas from sklearn import model_selection from sklearn.linear_model import LinearRegression url = "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data" names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'] dataframe = pandas.read_csv(url, delim_whitespace=True, names=names) array = dataframe.values X = array[:,0:13] Y = array[:,13] seed = 7 kfold = model_selection.KFold(n_splits=10, random_state=seed) model = LinearRegression() scoring = 'r2' results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring) print("R^2: %.3f (%.3f)") % (results.mean(), results.std()) |

You can see thatÂ the predictions have a poor fit to the actual values with a value close to zero and less than 0.5.

1 |
R^2: 0.203 (0.595) |

## Summary

In this post, you discovered metrics that you can use to evaluate your machine learning algorithms.

You learned about 3 classification metrics:

- Accuracy.
- Logarithmic Loss.
- Area Under ROC Curve.

Also 2 convenience methods for classification prediction results:

- Confusion Matrix.
- Classification Report.

And 3 regression metrics:

- Mean Absolute Error.
- Mean Squared Error.
- R^2.

Do you have any questions about metrics for evaluating machine learning algorithms or this post? Ask your question in the comments and I will do my best to answer it.

What do you mean by model_selection?

You can learn about the sklearn.model_selection API here:

http://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection

You must have sklearn 0.18.0 or higher installed.

Hello Jason

Thanks for this tutorial but i have one question about computing auc.

I’m doing binary classification with imbalanced classes and then computing auc but i have one problem. Im using keras.

My method for computing auc looks like this:

1. Train model and save him – 1st python script

2. load model and model weiths – 2nd python script

3. load one image (loop) and save result to csv file -2nd python script

4. use roc_auc_score from sklearn

in 3rd point im loading image and then i’m using predict_proba for result. Results are always from 0-1 but should i use predict proba?.This method is from http://stackoverflow.com/questions/41032551/how-to-compute-receiving-operating-characteristic-roc-and-auc-in-keras

Eka solution.

Looks good, I would recommend predict_proba(), I expect it normalizes any softmax output to ensure the values add to one.

Jason,

Long time reader, first time writer. I am having trouble how to pick which model performance metric will be useful for a current project. Let me give you some background.

I have a classification model that I really want to maximize my Recall results. The reasoning is that, if I say something is 1 when it is not 1 I lose a lot of time/$, but when I say something is 0 and its is not 0 I don’t lose much time/$ at all. Ie. I want to reduce False Negatives. Also the distribution of the dependent variable in my training set is highly skewed toward 0s, less than 5% of all my dependent variables in the training set are 1s. Normally I would use an F1 score, AUC, VIF, Accuracy, MAE, MSE or many of the other classification model metrics that are discussed, but I am unsure what to use now. Currently I am using LogLoss as my model performance metric as I have found documentation that this is the correct metric to use in cases of a skewed dependent variable, as well a situations where I mainly care about Recall and don’t care much about Precision or visa versa. I received this information from people on the Kaggle forums.

Thank you for your expert opinion, I very much appreciate your help. If you don’t have time for such I question I will understand.

Hi Evy, thanks for being a long time reader.

I would suggest tuning your model and focusing on the recall statistic alone.

I would also suggest using models that make predictions as a probability and tune the threshold on the probability too to optimize the recall (ROC curves can help understand this).

I hope that helps as a start.

Hey Jason,

Thanks for the great articles, I just have a question about the MSE and its properties. When building a linear model, adding features should always lower the MSE in the training data, right?

It’s just, when I use the polynomial features method in SciKit, and fit a linear regression, the MSE does not necessarily fall, sometimes it rises, as I add features.

Is it because of some innate properties of the MSE metric, or is it simply because I have a bug in my code?

Adding features has no guarantee of reducing MSE as far as I know. Where did you get that from?

Hi Jason,

Thank you for this article. Very helpful! Now I am using Python SciKit Learn to train an imbalanced dataset. I am looking for a good metric embedded in Python SciKit Learn already that works for evaluating the performance of model in predicting imbalanced dataset. Do you have some recommendations or ideas? Alternatively, I knew a judging criterion, balanced error rate (BER), but I have not idea how to use it as a scoring parameter with Python?

Thank you much!

Cheng

Great question.

This post may give you some ideas:

http://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

Hi Jason,

I still have some confusions about the metrics to evaluate regression problem. In cross_val_score of cross validation, the final results are the negative mean squared error and negative mean absolute error, so what does it mean? (It means the model performs poorly or that’s the good sign that the model can minimize the metrics?)

Additionally, I used some regression methods and they returned very good results such as R_squared = 0.9999 and very small MSE, MSA on the testing part. However the result of cross_val_score is 1.00 +- 00 for example, so it means the model is overfitting?

So in general, I suppose when we use cross_val_score to evaluate regression model, we should choose the model which has the smallest MSE and MSA, that’s true or not?

Thank you so much for your answer, that will help me alot

Good question.

Generally, the interpretation of the score is specific to the problem. A good score is really only relative to scores you can achieve with other methods.

Choosing a model depends on your application, but generally, you want to pick the simplest model that gives the best model skill.

Hi Jason,

I recently read some articles that were completely against using R^2 for evaluating non-linear models (such as in the case of ML algorithms). Given that it is still common practice to use it, whats your take on this?

Cheers

I recommend using a few metrics and interpret them in the context of your specific problem.

I do find R^2 useful.

how to choose the right metric for a machine learning problem ?

You need a metrics that best captures what you are looking to optimize on your specific problem.

Maybe you need to talk to domain experts. Maybe you need to try out a few metrics and present results to stakeholders. It could be an iterative process.

How CA depends on the value ‘random_state’?

See this post:

https://machinelearningmastery.com/randomness-in-machine-learning/

Jason,

What are differences between loss functions and evaluation metrics? Loss function = evaluation metric – regularization terms?

Kono

Great question.

A loss function is minimized when fitting a model.

A loss function score can be reported as a model skill, e.g. an evaluation metric, but does not have to be.

Regularization terms are modifications of a loss function to penalize complex models, e.g. to result in a simpler and often better/more skillful resulting model.

Does that help?

@Jason, thanks! very helpful!

You’re welcome!

Hi Jason,

I have the following question. Instead of using the MSE in the standard configuration, I want to use it with sample weights, where basically each datapoint would get a different weight (it is a separate column in the original dataframe, but clearly not a feature of the trained model). How would I incorporate those sample weight in the scoring function?

Great question, I believe the handling of weights will be algorithm specific.

For example, google points to this example for SVM:

http://scikit-learn.org/stable/auto_examples/svm/plot_weighted_samples.html

Another awesome and helpful post in your blog. Thanks a million!

Thanks.

In the general case, I see a sensitivity and specificity tradeoff when the classes overlap [1].

– How can I find the optimal point where both values are high algorithmically using python?

– Would the classifier give the highest accuracy at this point assuming classes are balanced?

Thanking in advance

[1] https://www.youtube.com/watch?v=vtYDyGGeQyo

You might want to look into ROC curves and model calibration.

Hi Jason,

I would love to see a similar post on unsupervised learning algorithms metric.

From my side, I only knew adjusted rand score as one of the metric.

Thanks for the suggestion.