## Choosing the right metric for your data science/machine learning project: class imbalance

A few days ago, I read a fantastic article about how to deliver on ML projects on medium. It got me thinking about all the mistakes I made when I was first learning about machine learning. Back then, I was so fascinated by the different kinds of machine learning models available and learned my first lesson when I was training an XGBoost model on a highly imbalanced dataset using accuracy as the metric.

The lesson learned was that **choosing the right metrics are so much more critical than selecting correct algorithms. Selecting the right metric is vital to the success or failure for a business problem. **

**Metrics in Imbalanced class classification**

Choosing the correct metric for an imbalanced classification problem is a crucial first step in building the models. For credit card fraud detection, this problem is most profound since the majority of the transactions are legit.

If you use the accuracy as a metric, you will likely get a really high accuracy using any model. Imagine that out of 100 transactions, there is 1 fraudulent one. The accuracy for the model that predicts all transactions to be legit is 99%. But this model is useless at best.

What kind of metrics could you use? There are metrics like F-measure, ROC/AUC, recall, kappa, precision, balanced accuracy and etc., which all measure a different aspect of the model. Sklearn website has more metrics to choose from.

**Different metrics**

The precision is that given the predicted value are true, what is the true label percentage, expressed in conditional probability as P( True label | predicted True )

The recall is that given the labels are true, what is the percentage of them are predicted to be true, expressed in conditional probability as P( predicted True | True label ). The recall is equal to the sensitivity in the hypothesis testing.

The F1 score is the harmonic mean between precision and recall. F2 gives higher weight to recall while F0.5 gives higher weight to precision. Why harmonic mean instead of the arithmetic mean? The explanation comes from StackOverflow.

Because the harmonic mean punishes extreme values more.

Consider a *trivial* method (e.g. always returning class A). There are infinite data elements of class B and a single element of class A:

Precision: 0.0

Recall: 1.0

When taking the arithmetic mean, it would have 50% correct. Despite being the *worst* possible outcome! With the harmonic mean, the F1-measure is 0.

`Arithmetic mean: 0.5`

Harmonic mean: 0.0

In other words, to have a high F1, you need to *both* have a high precision and recall.

This is not the end of the story since you need to define what is the positive class for the precision and recall. Regarding credit card fraud detection, you are more interested in the fraudulent case than the legit case so that fraudulent transaction should be set to be the positive case.

What is ROC/AUC? I have talked about how to understand ROC curves from the hypothesis testing perspective in previous blogs. In short, ROC plots the True Positive rate(Recall) vs. False Positive rate.

True Positive Rate = P(Predicted True| True Label) = Recall

False Positive rate = P(Predicted True | False Label)

AUC is short for Area Under the Curve. Since the ROC/AUC metric takes the negative class into consideration, the metric could be inflated from the class imbalance, but if you care about the correct prediction for both class, then ROC is a better choice than Accuracy.

Another replacement for ROC/AUC is the balanced accuracy, in which the metric re-weighted the accuracy for negative and positive class. This is an excellent complement for the ROC/AUC score.

Cohen’s Kappa is another metric that balanced the accuracy for imbalanced class problems. It is defined by

where is just the accuracy and is the hypothetical probability of agreement due to chance. The Wikipedia page here has some explanation of Kappa if you are not familiar with it. This is another source that explains the kappa metric from PSU online classes.

**Some Discussion: F-measure vs. ROC/AUC**

Some people have argued before that ROC curve should not be used in highly imbalanced classification problems in Kaggle since the ROC curve is inflated due to class imbalance. For the scripts he presented, the AUC/ROC score does not make sense but it is not always true.

This statement from Kaggle below described when you should use ROC/AUC vs

“True negatives need to be meaningful for ROC to be a good choice of measure. In his example, if we’ve got 1,000 pictures of cats and dogs and our model determines whether the picture is a cat (target = 0) or a dog (target = 1), we probably care just as much about getting the cats right as the dogs, and so ROC is a good choice of metric.

If instead, we’ve got a collection of 1,000,000 pictures, and we build a model to try to identify the 1,000 dog pictures mixed in it, correctly identifying “not-dog” pictures is not quite as useful. Instead, it makes more sense to measure how often a picture is a dog when our model says it’s a dog (i.e., precision) and how many of the dogs in the picture set we found (i.e., recall). ”

**TL;DR Use F-measure if you did not care about negative class and Use AUC/ROC, balanced accuracy, and Kappa if you care about the performance in both classes. **

For the multi-classification problems, there are two types of F-measure. The Micro-averaged and Macro-averaged F-measures. The Micro-averaged F-measure calculates the recall and precision using data from all the classes, and Macro-averaged calculates the recall and precision for each class and averages it.

For a complete review of all the metrics for classifications, see this paper.

If you read this article before Nov 13, there are two equations errors for kappa and Precision_micro and Recall_micro, which are now corrected.