🎰

A simple reference for common ML metrics.

@December 26, 2020

Introduction

There are several different performance metrics that are key to know for machine learning purposes. At the root of the entire performance question is the fact that classification is trickier than regression. In regression, you can simply compare the distance between the point of your prediction and the label. This doesn't translate to the classification space. Furthermore, in classification, pure accuracy is often a trap, especially when considering the case of class imbalanced datasets (discussed further later). Given such considerations and the potential complexity of measuring model performance, we need better methods than straight accuracy by which to evaluate the performance of classifiers.

💡
It's easy to get bogged down in all the different metrics, definitions, and formulas that make up accuracy assessment in ML. My advice is to understand a few basic formulas, commit them to memory, and then use logic to derive any other interpretations of specific formulas.

Confusion Matrix

The fundamental way to analyze a classifier is to examine the confusion matrix. The confusion matrix looks as follows:

The medical confusion matrix
The medical confusion matrix
The sklearn confusion matrix. The differing alignment of the predicted and actual classes can be confusing.
The sklearn confusion matrix. The differing alignment of the predicted and actual classes can be confusing.

The confusion matrix illustrates the difference between ground truth and what our model predicts. The confusion matrix is created by comparing a set of predictions, which is created using input features and a machine learning model, to a set of ground truth labels. Generally, it's advisable to start by understanding the confusion matrix for the cross-validated training set; we hold out the testing set until we have decent confidence in our trained model.

From the confusion matrix, we can generate a number of different performance metrics. There are 6 specific metrics that are important for every ML practitioner to know. They're in the following table, but I will cover the definition and formula for each of the main performance metrics.

Copy of Basic Machine Learning Metrics

NameDefinition
Precision
accuracy of the positive predictions
Recall
rate at which true positives were correctly labeled.
F-Score
harmonic mean of the precision and recall
AUROC
metric of discrimination between the positive and negative classes
Sensitivity
True positive rate (same as recall)
Specificity
True negative rate

Precision

💡
Metric 1: Precision, or the accuracy of the positive predictions.

This is a pretty intuitive metric. We always want to know how well we predicted; that's why we look at accuracy. With precision, we focus more specifically on how well we did with the positive predictions. It is helpful to note that precision is also referred to as true positive rate. When we said something was X, how often was it actually X? Of the number of positive predictions we made, how many did we get right? This leads to the basic formula, which is:

image

Unfortunately, precision can be misleading, particularly if the number of positive predictions we make is low. For example, if you made one correct positive prediction and no other positive predictions, your precision would be exactly 1, or optimal. Given this, we need another metric to balance our understanding of precision.

Recall

💡
Metric 2: Recall, or the rate at which true positives were correctly labeled.

This is slightly less intuitive, but I advise you not to overcomplicate it. If the problem with precision is trivial non-prediction of positive instances, we can diagnose that by simply looking at all the times we correctly identified true positives. This understanding leads us to the formula, which is:

image

This is NOT the pure inverse of precision; do not call it that. Rather, it is a diagnostic rectification of precision. Of all the times that something was X, how often did we correctly label it?

F-Score

Since precision and recall can be contrasting depictions of classification performance, it is convenient to merge them into a single metric. This metric is referred to as the F-score.

💡
Metric 3: F-score, or the harmonic mean of precision and recall

The advantage of F-score lies in the harmonic mean's ability to weight lower values more heavily. This ensures that the F-score is high only when both precision and recall do well relative to one another. The origin of this becomes evident in the formula for the F-score:

image

However, this strength can also be a downside. The F1-score is highest when precision and recall are symmetrical or close in value, but this may not be ideal; sometimes we care more about one over the other. As an example, let's look at a medical context in the case of cancer.

💡
Precision vs. Recall Example: Cancer is serious, and we do not want to miss it. Thus, we know false negatives are very costly. In this case, we want to maximize recall, at the expense of precision. Even if we get a number of false positves, under no circumstances do we want to give someone a clean bill of health who doesn't have one (false negative).

As this example shows, there is a natural tradeoff that occurs between precision and recall. This is know as the precision-recall tradeoff. To get a sense of this tradeoff, it helps to create a precision-recall curve (PR curve), which directly compares precision on the y-axis and recall on the x-axis, or plot the precision and recall against the threshold score range (which we can start to obtain in sklearn via the method decision_function).

An example of a precision-recall curve
An example of a precision-recall curve

AUROC

Before we move on from the topic of plotting performance curves, let's discuss the receiver operating characteristic curve, another helpful visualization. The ROC curve is quite similar to the PR curve. It plots true positive rate against false positive rate.

image

A common confusion for beginners is understanding all the different ways that the ROC curve can be described as being plotted (i.e. as sensitivity vs. 1-specificity). My personal advice is to focus on the simplest, which is TPR vs. FPR. Some of the common alternatives are detailed below:

  • recall vs. false positive rate
  • sensitivity vs. 1 - specificity
  • true positive rate vs. false positive rate

Ultimately, the ROC curve provides another way of understanding the performance of our classifier. Just the same as the PR curve, it also highlights a tradeoff inherent to classification models. The higher the true positive rate, the more false positives there are. To help us easily compare classifiers we use area under the ROC curve, which is abbreviated as AUROC. The best models have AUROCs closer to 1.

When should we use the PR curve as opposed to the ROC curve? As a general rule, whenever there is a class imbalance or the positive class is rare, it is better to use the PR curve as a metric. The ROC curve in these instances tends to overstate how well the classifier does.

Sensitivity (and Specificity)

Now that we understand the ROC and the AUROC, understanding sensitivity is pretty trivial. We mentioned that ROC is plotted frequently as TPR vs. FPR. In fact, the true positive rate is sensitivity.

image

Finally, for the final metric, specificity is the true negative metric. It's almost the opposite of sensitivity, but not quite.

image

Sources and Further Reading

  1. Geron Chapter 3
  2. https://medium.com/cascade-bio-blog/making-sense-of-real-world-data-roc-curves-and-when-to-use-them-90a17e6d1db?
  3. https://www.fharrell.com/post/mlconfusion/
  4. https://alexgude.com/blog/machine-learning-metrics-interview/
  5. https://statinfer.com/204-4-2-calculating-sensitivity-and-specificity-in-python/