Choosing the right metric for your data science/machine learning project: class imbalance

A few days ago, I read a fantastic article about how to deliver on ML projects on medium. It got me thinking about all the mistakes I made when I was first learning about machine learning. Back then, I was so fascinated by the different kinds of machine learning models available and learned my first lesson when I was training an XGBoost model on a highly imbalanced dataset using accuracy as the metric.

The lesson learned was that choosing the right metrics are so much more critical than selecting correct algorithms. Selecting the right metric is vital to the success or failure for a business problem. 

Metrics in Imbalanced class classification

Choosing the correct metric for an imbalanced classification problem is a crucial first step in building the models. For credit card fraud detection, this problem is most profound since the majority of the transactions are legit.

If you use the accuracy as a metric, you will likely get a really high accuracy using any model. Imagine that out of 100 transactions, there is 1 fraudulent one. The accuracy for the model that predicts all transactions to be legit is 99%. But this model is useless at best.

What kind of metrics could you use? There are metrics like F-measure, ROC/AUC, recall, kappa, precision, balanced accuracy and etc., which all measure a different aspect of the model. Sklearn website has more metrics to choose from.

Different metrics

The precision is that given the predicted value are true, what is the true label percentage, expressed in conditional probability as  P( True label | predicted True )

The recall is that given the labels are true, what is the percentage of them are predicted to be true, expressed in conditional probability as P( predicted True | True label ). The recall is equal to the sensitivity in the hypothesis testing.

The F1 score is the harmonic mean between precision and recall. F2 gives higher weight to recall while F0.5 gives higher weight to precision. Why harmonic mean instead of the arithmetic mean? The explanation comes from StackOverflow.

Because the harmonic mean punishes extreme values more.

Consider a trivial method (e.g. always returning class A). There are infinite data elements of class B and a single element of class A:

Precision: 0.0
Recall: 1.0

When taking the arithmetic mean, it would have 50% correct. Despite being the worst possible outcome! With the harmonic mean, the F1-measure is 0.

Arithmetic mean: 0.5
Harmonic mean: 0.0

In other words, to have a high F1, you need to both have a high precision and recall.

This is not the end of the story since you need to define what is the positive class for the precision and recall. Regarding credit card fraud detection, you are more interested in the fraudulent case than the legit case so that fraudulent transaction should be set to be the positive case.

What is ROC/AUC? I have talked about how to understand ROC curves from the hypothesis testing perspective in previous blogs.  In short, ROC plots the True Positive rate(Recall) vs. False Positive rate.

True Positive Rate = P(Predicted True| True Label) = Recall

False Positive rate = P(Predicted True | False Label)

AUC is short for Area Under the Curve. Since the ROC/AUC metric takes the negative class into consideration, the metric could be inflated from the class imbalance, but if you care about the correct prediction for both class, then ROC is a better choice than Accuracy.

Another replacement for ROC/AUC is the balanced accuracy, in which the metric re-weighted the accuracy for negative and positive class. This is an excellent complement for the ROC/AUC score.

Cohen’s Kappa is another metric that balanced the accuracy for imbalanced class problems. It is defined by

kappa

where P_o is just the accuracy and P_e is the hypothetical probability of agreement due to chance. The Wikipedia page here has some explanation of Kappa if you are not familiar with it. This is another source that explains the kappa metric from PSU online classes.

Some Discussion: F-measure vs. ROC/AUC

Some people have argued before that ROC curve should not be used in highly imbalanced classification problems in Kaggle since the ROC curve is inflated due to class imbalance. For the scripts he presented, the AUC/ROC score does not make sense but it is not always true.

This statement from Kaggle below described when you should use ROC/AUC vs

“True negatives need to be meaningful for ROC to be a good choice of measure. In his example, if we’ve got 1,000 pictures of cats and dogs and our model determines whether the picture is a cat (target = 0) or a dog (target = 1), we probably care just as much about getting the cats right as the dogs, and so ROC is a good choice of metric.

If instead, we’ve got a collection of 1,000,000 pictures, and we build a model to try to identify the 1,000 dog pictures mixed in it, correctly identifying “not-dog” pictures is not quite as useful. Instead, it makes more sense to measure how often a picture is a dog when our model says it’s a dog (i.e., precision) and how many of the dogs in the picture set we found (i.e., recall). ”

TL;DR Use F-measure if you did not care about negative class and Use AUC/ROC, balanced accuracy, and Kappa if you care about the performance in both classes. 

For the multi-classification problems, there are two types of F-measure. The Micro-averaged and Macro-averaged F-measures.  The Micro-averaged F-measure calculates the recall and precision using data from all the classes, and Macro-averaged calculates the recall and precision for each class and averages it.

Screenshot from 2019-01-01 20-41-40

For a complete review of all the metrics for classifications, see this paper.

 

 

If you read this article before Nov 13, there are two equations errors for kappa and Precision_micro and Recall_micro, which are now corrected.

Advertisements

Set values in DataFrame with Boolean index in pandas

If you want to set values in pandas data frame with Boolean vectors, you will likely get an error described below.

Have you ever come across this error below when you want to set some values in the pandas data frame?


/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:1:

SettingWithCopyWarning: A value is trying to be set on a copy of
a slice from a DataFrame

See the caveats in the documentation:

http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

"""Entry point for launching an IPython kernel.

Is the warning you get similar to the code below? By setting some values to the pandas data frames with index and column name?


df.some_col_name[index]="A value to set"

df.loc[index,some_col_name]="A value to set"

The correct way to do this is


df.some_col_name.where(~index,other="A value to set")

If you do the where syntax and you get an error like


bad operand type for unary ~: 'float'

It is because that the index is not a boolean, you need to convert the pandas series into boolean values using the code below.


index=index.astype("bool")

df.some_col_name.where(~index,other="A value to set")

This is really annoying and very counter-intuitive and stupid if you are coming from R or Matlab(I suppose?).

Gonna add more pandas fix to the blogs as I learned along the way.

If you want to know more about SettingWithCopyWarning in pandas. You could check out this detailed blog by Dataquest. https://www.dataquest.io/blog/settingwithcopywarning/

Quick summary: seaborn vs ggplot2

Visualization: The start and the end

The visualization part of the data analysis is the initial and extremely crucial part of any data analysis. it could be the exploratory data analysis at the beginning of predictive modeling or the end product for a monthly report.

There are several data visualization packages in Python and R. In R, we have the excellent ggplot2, which is based on the grammar of graphics. In python, we have the amazing seaborn and matplotlib packages. All these packages approach the problem of plotting a little bit different, but they all aimed at plotting the same thing. Once you know the graph you want to plot, it would be easier to master and switch between them.

Of course, there are more fancy plotting solutions available like d3 or plotly.

Another way to look at visualization:

If we compare data visualization to a Taylor expansion, one-variable visualizations are like the first order expansions, two-variable visualizations are like the second order expansions, three-or-more-variable visualizations are like the higher order terms in Taylor expansion.

Visualizing one variable

discrete data

We should use a barplot to count the number of instances in each category.

ggplot2: geom_bar

seaborn: sns.countplot

continuous data

We should use a histogram to accomplish this.

ggplot2: geom_histogram

seaborn: sns.distplot

Visualizing two variables

Two discrete data columns

Use histogram but label another data column with colors (I will talk facet in visualizing 3 or more variables.)

ggplot2: geom_histogram(aes(color=”name_of_another_data_column”))

seaborn: sns.countplot(hue=’name_of_another_data_column’)

One discrete, one continuous data columns

Use boxplot for this.

ggplot2: geom_boxplot

seaborn: sns.boxplot

There are also swarmplot, stripplot, violinplot for this type of job.

Two continuous data columns

We use scatter plot for this.

ggplot2: geom_point

seaborn: sns.regplot,sns.jointplot(kind=’scatter’)

Visualizing three or more variables.

Things are getting complicated here. There are generally two ways to accomplish this. The first method is to label things differently with different colors, sizes or shapes and etc in one graph. The second method is to plot more graphs with each graph visualizing some variables by keeping one variable constant.

Two discrete and one continuous data columns

It could be visualized by visualizing one discrete and one continuous variable with boxplot and use color or facet to visualize another discrete variable.

Two Continuous and one discrete data columns

It could be visualized by visualizing two continuous variables with a scatterplot and use color or facet to visualize another discrete variable.

In ggplot2:


sp <- ggplot(data=tips, aes(x=total_bill, y=tip)) + geom_point()

# Divide by levels of "sex", in the vertical direction
sp + facet_grid(sex ~ .)

In seaborn you could choose factorplot or FacetGrid.


import matplotlib.pyplot as plt
g=sns.FacetGrid(data=tips,row='sex')
g.map(sns.regplot,'total_bill','tip')

 

Three continuous data columns

This needs a 3D scatterplot. This is not implemented in ggplot2 or seaborn/matplotlib, it needs some special packages. See this documentation for python.

This presentation is a good example of how to do more than 2 variables in R using ggplot2.

For the advanced feature like FaceGrid and factorplot in seaborn, see this blog for more examples.

This is a rather short summary and comparison between seaborn and ggplot2, and a discussion of how I viewed the data visualization process. I will add more examples in R/Python in the future.

Docker for data analysis: a hands-on tutorial (jupyter and rstudio-server)

Dependency, dependency, and dependency! Dependency is an evil being that used to haunt software development, now it has come to data analysis. Out of the 14 million possibilities, there is one way to defeat it. The answer is Docker!

Based on whether you will use rstudio-server or jupyter notebook, the following blog will split into two parts. One part is about how to run a Docker container for jupyter, another part is about how to run a container for rstudio-server, somewhat similar to jupyter.

Before we proceed, it is recommended that you create an account on DockerHub, a hosting service for Docker images. See the line here.

Jupyter notebook

After registering an account on DockerHub, you could search for the possible Docker images there. After a little search and trial, I found this container. It has pre-installed R, Python3, Julia and most of the common images for data science. See more details on its DockerHub.

How to use it? Simply type the below command into your terminal:


docker run -it --rm -p 8888:8888 jupyter/datascience-notebook

After downloading about a 6GB Docker image, it would start the jupyter notebook, you will have something like this.

docker6

docker7.png

On the second line of the second picture, you see a sentence “to login with a token: …..”.Simply by clicking the http address, you would see a jupyter notebook.

jupyter

Then, you have the jupyter notebook with support for R, Python3, and Julia. Let us see if the most common packages are installed.

jupyter2

Let us test if most R packages are installed.

jupyter3

Worth noting here that data.table package is not installed.

Let me try to install my automl package, dplyr and caret package from Github.

jupyter4.png

It seems that there are some issues with installing from Github. But all the R-cran install works perfectly.

Multiple terminals.

In order to download some files or made some changes to the Docker environment, you could open up another terminal by


docker exec -it  bash

After you are done, you could use ctrl+p+q to exit the container without interrupting the container.

Rstudio-server

For those of you who did not know Rstudio-server, it is like the jupyter notebook for R.

The image I found on the DockerHub is this.

To use it, you should first download it.


docker pull dceoy/rstudio-server

Then, you run the image with the command below:


docker container run --rm -p 8787:8787 -v ${PWD}:/home/rstudio -w /home/rstudio dceoy/rstudio-server

The Rstudio-serve would not start itself, you need to type the address below into your browser.

http://127.0.0.1:8787/

Then, you will need to enter the username and password, both of which are rstudio.

Let us see if the common packages are installed.

rstudio

Then, let us try to install my automl package from Github.

rstudio2.png

rstudio3

It works perfectly. (I still have some dependency issues with some of the packages to resolve in my package, but the conflicts are printed here.)

 

Read More