Quick summary: seaborn vs ggplot2

Visualization: The start and the end

The visualization part of the data analysis is the initial and extremely crucial part of any data analysis. it could be the exploratory data analysis at the beginning of predictive modeling or the end product for a monthly report.

There are several data visualization packages in Python and R. In R, we have the excellent ggplot2, which is based on the grammar of graphics. In python, we have the amazing seaborn and matplotlib packages. All these packages approach the problem of plotting a little bit different, but they all aimed at plotting the same thing. Once you know the graph you want to plot, it would be easier to master and switch between them.

Of course, there are more fancy plotting solutions available like d3 or plotly.

Another way to look at visualization:

If we compare data visualization to a Taylor expansion, one-variable visualizations are like the first order expansions, two-variable visualizations are like the second order expansions, three-or-more-variable visualizations are like the higher order terms in Taylor expansion.

Visualizing one variable

discrete data

We should use a barplot to count the number of instances in each category.

ggplot2: geom_bar

seaborn: sns.countplot

continuous data

We should use a histogram to accomplish this.

ggplot2: geom_histogram

seaborn: sns.distplot

Visualizing two variables

Two discrete data columns

Use histogram but label another data column with colors (I will talk facet in visualizing 3 or more variables.)

ggplot2: geom_histogram(aes(color=”name_of_another_data_column”))

seaborn: sns.countplot(hue=’name_of_another_data_column’)

One discrete, one continuous data columns

Use boxplot for this.

ggplot2: geom_boxplot

seaborn: sns.boxplot

There are also swarmplot, stripplot, violinplot for this type of job.

Two continuous data columns

We use scatter plot for this.

ggplot2: geom_point

seaborn: sns.regplot,sns.jointplot(kind=’scatter’)

Visualizing three or more variables.

Things are getting complicated here. There are generally two ways to accomplish this. The first method is to label things differently with different colors, sizes or shapes and etc in one graph. The second method is to plot more graphs with each graph visualizing some variables by keeping one variable constant.

Two discrete and one continuous data columns

It could be visualized by visualizing one discrete and one continuous variable with boxplot and use color or facet to visualize another discrete variable.

Two Continuous and one discrete data columns

It could be visualized by visualizing two continuous variables with a scatterplot and use color or facet to visualize another discrete variable.

In ggplot2:


sp <- ggplot(data=tips, aes(x=total_bill, y=tip)) + geom_point()

# Divide by levels of "sex", in the vertical direction
sp + facet_grid(sex ~ .)

In seaborn you could choose factorplot or FacetGrid.


import matplotlib.pyplot as plt
g=sns.FacetGrid(data=tips,row='sex')
g.map(sns.regplot,'total_bill','tip')

 

Three continuous data columns

This needs a 3D scatterplot. This is not implemented in ggplot2 or seaborn/matplotlib, it needs some special packages. See this documentation for python.

This presentation is a good example of how to do more than 2 variables in R using ggplot2.

For the advanced feature like FaceGrid and factorplot in seaborn, see this blog for more examples.

This is a rather short summary and comparison between seaborn and ggplot2, and a discussion of how I viewed the data visualization process. I will add more examples in R/Python in the future.

Advertisements

Docker for data analysis: a hands-on tutorial (jupyter and rstudio-server)

Dependency, dependency, and dependency! Dependency is an evil being that used to haunt software development, now it has come to data analysis. Out of the 14 million possibilities, there is one way to defeat it. The answer is Docker!

Based on whether you will use rstudio-server or jupyter notebook, the following blog will split into two parts. One part is about how to run a Docker container for jupyter, another part is about how to run a container for rstudio-server, somewhat similar to jupyter.

Before we proceed, it is recommended that you create an account on DockerHub, a hosting service for Docker images. See the line here.

Jupyter notebook

After registering an account on DockerHub, you could search for the possible Docker images there. After a little search and trial, I found this container. It has pre-installed R, Python3, Julia and most of the common images for data science. See more details on its DockerHub.

How to use it? Simply type the below command into your terminal:


docker run -it --rm -p 8888:8888 jupyter/datascience-notebook

After downloading about a 6GB Docker image, it would start the jupyter notebook, you will have something like this.

docker6

docker7.png

On the second line of the second picture, you see a sentence “to login with a token: …..”.Simply by clicking the http address, you would see a jupyter notebook.

jupyter

Then, you have the jupyter notebook with support for R, Python3, and Julia. Let us see if the most common packages are installed.

jupyter2

Let us test if most R packages are installed.

jupyter3

Worth noting here that data.table package is not installed.

Let me try to install my automl package, dplyr and caret package from Github.

jupyter4.png

It seems that there are some issues with installing from Github. But all the R-cran install works perfectly.

Multiple terminals.

In order to download some files or made some changes to the Docker environment, you could open up another terminal by


docker exec -it  bash

After you are done, you could use ctrl+p+q to exit the container without interrupting the container.

Rstudio-server

For those of you who did not know Rstudio-server, it is like the jupyter notebook for R.

The image I found on the DockerHub is this.

To use it, you should first download it.


docker pull dceoy/rstudio-server

Then, you run the image with the command below:


docker container run --rm -p 8787:8787 -v ${PWD}:/home/rstudio -w /home/rstudio dceoy/rstudio-server

The Rstudio-serve would not start itself, you need to type the address below into your browser.

http://127.0.0.1:8787/

Then, you will need to enter the username and password, both of which are rstudio.

Let us see if the common packages are installed.

rstudio

Then, let us try to install my automl package from Github.

rstudio2.png

rstudio3

It works perfectly. (I still have some dependency issues with some of the packages to resolve in my package, but the conflicts are printed here.)

 

Read More

Make your R code faster: part 2

I know it is cliche to quote Kunth’s famous saying “premature optimization is the root of all evil” in anything related to optimization. But it can never be emphasized enough. It means that when you are writing codes you should do optimization in a top-down fashion. That is to start with a program that does what you want and profiling the code to find the slowest part to optimize.

What Kunth did not point out is that there is a fine line between premature optimization and writing faster code. It would be beneficial if you have some common sense in writing code that is fast, robust and easy-to-maintain.

This series of blogs is to do some common sense benchmark to compare several frequently used functionals and loops in R in a very limited way.

A general rule of thumb in both R and Python is that simpler and shorter code is faster.

Benchmark and Profiling: the tool for measuring code performance.

After you have some codes that do what you want, you might want to run it on bigger data or run it thousands of times, then maybe you will find yourself waiting a long time for the program to finish. Time for optimization!

First Step: profiling

The first step in doing any optimization is to find where the coding is running for the longest time by profiling. In R, you have the profvis package built on top of the Rprof in the base package. The usage is really simple.


library(profvis)

profvis({

# copy and paste your code here.

})

The code and output would look like this.

There is a small tricky thing with profvis. If you define a function to do a calculation, you will need to include that function definition inside the profvis({}) function call in order to get the detailed memory and time usage.  The output from the profvis function call would be pretty straightforward, but if you need some more detailed examples and explanations you could see the example on official documentation, which is really amazing.

https://rstudio.github.io/profvis/

Be aware that the profvis package may not be able to detect the code from C/C++, some base R functions, and parallel computation.

Another reason I love profvis package is that you could easily share the profiling results with others by either sending others the output file by


htmlwidgets::saveWidget(p,"profile.html')

or publish it on Rpubs.com like an R-markdown file.

Second step: benchmark

Now you know which part of the codes occupies the majority of your computation time. Generally, there is huge room for speed improvement in R.

You should first check if you code suffers from the two major problems in R as I have written in the first blog in this series. ()

 https://phdstatsphys.wordpress.com/2017/12/20/make-your-r-code-runs-faster/

Next, you should check if the operations you do have a predefined function in R or some packages on CRAN/Bioconductor.

The third step is to write your functions in C/C++ code by using Rcpp. The last resort is to do a calculation in parallel if possible.

Then, you will probably have several different ways to do the operations that you found to take the longest time in your calculation. It is benchmark time!!!!

Doing benchmark in R is extremely simple. Just load the microbenchmark package and benchmark.


library(microbenchmark)

microbenchmark(

# enter your functions here, separated by comma like below.

your_function1(data, params)

,your_function2(data,params)

, ...... # more of your functions.

,times=100)

Under the hood, microbenchmark function executes each function 100 times by default. You could change the default times executed by options times=1000 or times = 10.

Another great feature of microbenchmark is that you could plot the execution time very easily. How easily? Let me show you.


library(microbenchmark)

benchmark_result=microbenchmark( your_function1(data,params), your_function2(data,params), times=1000)

ggplot2::autoplot(benchmark_result)

The resulting plot would look something like this.

benchmark1

Example use

Grab a cup of coffee or tea and let us do a benchmark.

We are going to start with a simple example. Try to calculate the mean value of each column in a data frame assuming all data are numeric.

First, we need to define the functions to calculate the column mean and check if they actually give the correct results for a simple test case. Unit testing would be most ideal, but I have not mastered that skill yet. If you know how to do that in R using testthat package, that is the way to go.

Below is the R code for doing column mean in several common ways, including the functional programming family lapply, apply, vapply, map in the purrr package and the loops like for and foreach.

Benchmark for calculating column means for non-high dimensional data.


lapply_colmean=function(data){
# this calculates column mean value since dataframes are comprised of lists.
unlist(lapply(data,mean))
}
apply_colmean=function(data){
apply(data,2,mean)
}
map_colmean=function(data){
purrr::map_dbl(data,mean)
}
foreach_colmean=function(data){
library(foreach)
foreach(i=1:dim(data)[2],.combine = c)%do%{mean(data[,i])}
}
for_colmean=function(data){
mean_vector=c(NA,length=ncol(data))
for(i in 1:ncol(data)){
mean_vector[i]<-mean(data[,i])
}
return(mean_vector)
}
vapply_colmean=function(data){
vapply(data,mean,numeric(1))
}
# need to confirm the functions return the correct result.
test_dataframe=data.frame(1:10,1:10,1:10)
lapply_colmean(test_dataframe)
apply_colmean(test_dataframe)
map_colmean(test_dataframe)
foreach_colmean(test_dataframe)
for_colmean(test_dataframe)
vapply_colmean(test_dataframe)

Next, let us benchmark those functions on 1000 by 100 data frame.


library(microbenchmark)
row_num=1000
col_num=100

library(magrittr)
library(ggplot2)
random_dataframe_colmean=matrix(rnorm(row_num*col_num),nrow=row_num,ncol=col_num)%>%as.data.frame()
microbenchmark(
lapply_colmean(random_dataframe_colmean)
,vapply_colmean(random_dataframe_colmean)
,apply_colmean(random_dataframe_colmean)
,map_colmean(random_dataframe_colmean)
,foreach_colmean(random_dataframe_colmean)
,for_colmean(random_dataframe_colmean)
,colMeans(random_dataframe_colmean)
,times = 1000
)%>%autoplot

This is the result of the benchmark.

benchmark1

If the picture is not clear enough, you could see the complete code and result in my Rpubs.

 http://rpubs.com/edwardcooper/faster_r_codes_2

The conclusion from the benchmark: 

When the number of columns is smaller than the number of rows (non-high dimensional data):
1) The rank of different functions is vapply>lapply > colMeans > map > for > apply >> foreach.
2) The exceedingly slow speed of foreach is kind of surprising to me.
3) lapply is such a fast functional and it is faster than the colMeans function, which is dedicated to calculating the column mean. It is even faster than the simple C I wrote. (Add Rcpp here later).

Benchmark for calculating column means for high dimensional data.

I will just transpose the data to convert the data frame into a high dimensional data.

Let us see how each method would perform for calculating column mean for high dimension data.


library(microbenchmark)
library(magrittr)
library(ggplot2)
# transpose the previous matrix.
random_dataframe_colmean_t=t(random_dataframe_colmean)
microbenchmark(
lapply_colmean(random_dataframe_colmean_t)
,vapply_colmean(random_dataframe_colmean_t)
,apply_colmean(random_dataframe_colmean_t)
,map_colmean(random_dataframe_colmean_t)
,foreach_colmean(random_dataframe_colmean_t)
,for_colmean(random_dataframe_colmean_t)
,colMeans(random_dataframe_colmean_t)
,times = 1000
)%>%autoplot

The resulting plot isbenchmark2

The takeaway from this is that:

When the number of columns is bigger than the number of rows (high dimensional data):
1) the speed of different functions is colMeans >> for \approx apply > foreach > map > lapply \approx vapply.
2) the speed of lapply is the slowest in this situation.

 

Let us profile the code to see where the time is spent in the foreach_colmean and lapply_colmean calculation.


library(profvis)

row_num=1000
col_num=100

library(magrittr)
library(ggplot2)
random_dataframe_colmean=matrix(rnorm(row_num*col_num),nrow=row_num,ncol=col_num)%>%as.data.frame()
profvis({

foreach_colmean=function(data){
library(foreach)
foreach(i=1:dim(data)[2],.combine = c)%do%{mean(data[,i])}
}
foreach_colmean(random_dataframe_colmean_t)

},interval=0.05)

The profiling result is

profiling

 

The plot may not be clear enough, you could look at my Rpubs link for all the results and code. http://rpubs.com/edwardcooper/faster_r_codes_2

The reason why foreach is slower for this data size is that there is a lot of error-handling behind the scenes. So that you should never use foreach in you lowest layer of loops. You should put it on the outer layers for error handling and parallel programming.

TL;DR

The performance of different algorithms changes with data dimension and data size.

If there is one algorithm dedicated to doing something, that is most likely the fastest method.

That is enough for one blog. For the next blog in this series, I am going to do a case study on speed up the MSD (mean squared distance)  calculation for more than 25 thousand atoms in 5000 timesteps for 17 temperatures from 40hrs even using data.table to 2hrs. How did I do that? It is a combination of choosing right library to use and parallel programming.

The full code and results are in  http://rpubs.com/edwardcooper/faster_r_codes_2.

I forgot to attach the session info at the end of the Markdown file. So here it is.

rsession

 

Make your R code faster : part 1

If you are coming from a C or Java background, you definitely will complain that R is slow. If you are new to programming, maybe you have heard about people talking that R code is slow.

In this blog, I hope to show you how to make your R code faster since it has a very different language design compared to C/C++, Java or even Python.

Disclaimer: The content of this blog is a combination of my 2-year experience using R, the famous R inferno book, Advanced R by Hadley and the online datacamp class on writing efficient R code. Special thanks to Hadley for making his Advanced R book free online. 

Common pitfalls to avoid

Never grow your data in R.

It is perhaps the number one thing you should consider when you are writing any R code.

Suppose we want to create a vector from 1 to 1000. How do we do that in R?

There are four possible ways:

* Add a number to the sequence using a for-loop.
* Pre-allocate a vector of size 1000 and change the values using a for-loop.
* Using seq function in base R.
* Use the colon “:” operator

Let us compare the computation time for these four methods.

First, we will define the functions to do these calculations.

# Add a number to the sequence using a for loop.
growing=function(n){
  x=NULL
  for(i in 1:n){
  x=append(x,i)
  }
  return(x)
}

# pre-allocate a vector
preallocate=function(n){
  x=vector(mode="integer", length=n)
  for(i in 1:n){
    x[i]=i
  }
  return(x)
}

# Using seq function in base R.
seq(1,100,by=1) 

# use the colon : operator
1:100

Next, fire up the microbenchmark package to benchmark them.

library(microbenchmark)
library(magrittr)
microbenchmark(
  growing(1000),
  preallocate(1000),
  seq(1,1000),
  1:1000,
  times=1000L
)%>%print()

b1

It is easy to see that the colon operator is the clear winner here.

The difference would be more significant if we are building a longer vector.

library(microbenchmark)
library(magrittr)
microbenchmark(
  growing(10000),
  preallocate(10000),
  seq(1,10000),
  1:10000,
  times=1000L
)%>%print()

b2

It is computationally expensive to allocate memory to R function so that the growing function is very very slow. That is why the first method is so slow.

Doing memory pre-allocation could avoid the expensive memory allocation operation. But the preallocate function still makes one thousand times more function call compared to seq and colon functions. That is why the second method is also slow.

The difference between seq and colon function comes as a surprise to me. It must come from the difference in the C and FORTRAN code underneath these two functions.
Let me know if you know why there is a big difference between them in the comment below.

The difference between preallocate function and seq is the difference between vectorized form vs non-vectorized form.

Vectorize your calculation.

R is faster when you use a vectorized form of calculation. Luckily, most R functions do calculations in a vectorized fashion.

Suppose we want to multiply each element of a vector by 2. Let us define a function non_vector_mutiply2 that takes a input vector and multiply every item by 2.

non_vector_mutiply2=function(x){
  x_multi=vector(mode="integer", length=length(x))
  for(i in 1:length(x)){
    x_multi[i]=x[i]*2
  }
  return(x_multi)
}

non_vector_mutiply2(c(1,2))

 

Let us benchmark the above function with “^” operator.

library(microbenchmark)
# use the fastest methods in generating a sequence
vector=1:1000
microbenchmark(
  non_vector_mutiply2(vector),
  vector*2,
  times=1000L
)

b3

The vectorized ^ operator is the clear winner here.

Another example of vectorized vs non-vectorized function performance.

Suppose we want to get the cumulative sum of a sequence. Let us benchmark the vectorized R function cumsum and a non-vectorized function.

non_vector_cumsum=function(x){
  x_multi=vector(mode="integer", length=length(x))
  x_multi[1]=x[1]
  for(i in 1:(length(x)-1) ){
    x_multi[i+1]=x_multi[i]+x[i+1]
  }
  return(x_multi)
}

non_vector_cumsum(c(2,4,8))

library(microbenchmark)
vector=1:1000
microbenchmark(
  non_vector_cumsum(vector),
  cumsum(vector),
  times = 100L
)

b4

Vectorized R function cumsum is the clear winner here.

R code optimizes a lot of calculations in a vectorized way with C and FORTRAN. Thus, it is always a good idea to go with the function pre-defined in R. When in doubt, wrap all your calculation in a function and do a benchmark with microbenchmark package.

If you ever want to profile your code in R, you could use provis package. You just need to wrap every calculation you want to do inside the


provis({

# put all your calculations here.

})

That concludes the two simple methods to make your R code faster.

 

You could see the full code and blog on my Rpubs (http://rpubs.com/edwardcooper/faster_r_codes).

Read More

Doing statistics in parallel with R

In this blog, I will talk about how I use R to parallelize my stationary hypothesis testing on the time series data with R.

Main part: How to parallelize statistics calculations in R.

If your statistics calculation like hypothesis testing, parameter estimation, or stochastic process simulation takes a long time, you should parallelize your calculation.  How do you define “long”? I personally would think that if your calculation takes longer than 3 mins (about the time for a cup of tea), then it needs some boost in speed with some simple steps in R. The same could be done in python with Dask.

More often than not, the long calculation is in a for-loop, or several independent calculations like cross-validation.

These are the steps to take to speed up your statistics calculation:

Step 1, replace your loop with functional from the apply family like lapply,sapply,vapply,tapply. Or the functional from the map family in purr package.

If it is still longer than 3 mins, then take step 2.

Step 2, replace your loop with foreach function, so that calculation could be parallelized.

The foreach function is very easy to learn, but it would take some time like 1hr to read the tutorial and figure out how to work with the doParallel package to register a parallel backend.

Step 3, if you do not want to learn foreach and doParallel but still want to do statistics in parallel, I wrote a wrapper function on top of foreach and the doParallel package called map_pc, which is basically a parallelized version of apply/map function. 

map_pc.png

You could easily import this function into R like this:

source(“https://raw.githubusercontent.com/edwardcooper/lammps/master/map_pc.R&#8221;)

The example use of map_pc to calculate the sum of each column.

First, we make a dataframe with 10 columns and 10 million rows with a gaussian distribution with mean zero, variance 1.

random_data=data.frame(matrix(rnorm(1e8),ncol=10,nrow=1e7))

Second, we apply the map_pc function to calculate the sum of each column.

cumsum_result=map_p(data=random_data,sum)

Finally, we could examine the result with summary function.

summary(cumsum_result)

Here is a little bit explanation on the arguments of the map_pc function.

The first argument is data, the second argument to the function is the function to apply to the data column wise. The third argument is nthread, the number of thread to use. The fourth argument is the type, choose FORK on Linux or Mac, choose PSOCK is you are on windows (I did not test it on Windows).

Here are some examples of how I use map_pc to do the stationary test in parallel. https://github.com/edwardcooper/lammps/blob/master/st_extract_cal.R

This is another reason why you should always write any calculations into functions as I mentioned in a previous blog ( https://phdstatsphys.wordpress.com/2017/09/05/benefits-of-writing-functions/ ).

On a side note, I do know of the mcapply and clusterApply series of functions (https://stat.ethz.ch/R-manual/R-devel/library/parallel/html/clusterApply.html), but writing calculation in parallel is a lot of fun for me. Plus, if there is any problem with my code, I could easily fix it. There is also the benefit of choosing your favorite parallel backend.

How to further the R code performance

Now you have tried the methods I mentioned above, but calculation time is still longer than  3 mins. Are there any other options? Of course, R is a rather flexible language. If you have any problem with the memory limitation, you could use the bigmemory,

If you have any problem with the memory limitation, you could use the bigmemory, ff and many more packages to work with data that too big be loaded into memory.

If your statistics calculation is slow because of the source code is poorly written, then you could write your own functions in C/C++ and use it in R, see the tutorial from Hadley Wickham here: http://adv-r.had.co.nz/Rcpp.html.

If your data comes from several sources or in a database, then you could try sergeant package to easily load data from various sources. It utilizes the Apache Drill as the backend.

Another thing is doing parallel computing is to know when to turn off the hyperthreading, which has already been discussed in another blog in detail (https://phdstatsphys.wordpress.com/2017/09/12/hyperthreading-faster-or-slower/). The take-home message is to set the number of cores to use as the number of physical cores you have. That is why I set the default value of nthread in my map_pc function to be 4, which is the number of physical cores on a normal desktop.

This is the time I record for using different numbers of cores to do the stationary test.

Screenshot from 2017-09-22 17-10-38

You could see clearly that when I set the nthread to be 4 in the map_pc function, it runs the fastest.

Of course, the next step to further increase the speed is to use a computing cluster like AWS.

Read More