Make your R code faster: part 2

I know it is cliche to quote Kunth’s famous saying “premature optimization is the root of all evil” in anything related to optimization. But it can never be emphasized enough. It means that when you are writing codes you should do optimization in a top-down fashion. That is to start with a program that does what you want and profiling the code to find the slowest part to optimize.

What Kunth did not point out is that there is a fine line between premature optimization and writing faster code. It would be beneficial if you have some common sense in writing code that is fast, robust and easy-to-maintain.

This series of blogs is to do some common sense benchmark to compare several frequently used functionals and loops in R in a very limited way.

A general rule of thumb in both R and Python is that simpler and shorter code is faster.

Benchmark and Profiling: the tool for measuring code performance.

After you have some codes that do what you want, you might want to run it on bigger data or run it thousands of times, then maybe you will find yourself waiting a long time for the program to finish. Time for optimization!

First Step: profiling

The first step in doing any optimization is to find where the coding is running for the longest time by profiling. In R, you have the profvis package built on top of the Rprof in the base package. The usage is really simple.


library(profvis)

profvis({

# copy and paste your code here.

})

The code and output would look like this.

There is a small tricky thing with profvis. If you define a function to do a calculation, you will need to include that function definition inside the profvis({}) function call in order to get the detailed memory and time usage.  The output from the profvis function call would be pretty straightforward, but if you need some more detailed examples and explanations you could see the example on official documentation, which is really amazing.

https://rstudio.github.io/profvis/

Be aware that the profvis package may not be able to detect the code from C/C++, some base R functions, and parallel computation.

Another reason I love profvis package is that you could easily share the profiling results with others by either sending others the output file by


htmlwidgets::saveWidget(p,"profile.html')

or publish it on Rpubs.com like an R-markdown file.

Second step: benchmark

Now you know which part of the codes occupies the majority of your computation time. Generally, there is huge room for speed improvement in R.

You should first check if you code suffers from the two major problems in R as I have written in the first blog in this series. ()

 https://phdstatsphys.wordpress.com/2017/12/20/make-your-r-code-runs-faster/

Next, you should check if the operations you do have a predefined function in R or some packages on CRAN/Bioconductor.

The third step is to write your functions in C/C++ code by using Rcpp. The last resort is to do a calculation in parallel if possible.

Then, you will probably have several different ways to do the operations that you found to take the longest time in your calculation. It is benchmark time!!!!

Doing benchmark in R is extremely simple. Just load the microbenchmark package and benchmark.


library(microbenchmark)

microbenchmark(

# enter your functions here, separated by comma like below.

your_function1(data, params)

,your_function2(data,params)

, ...... # more of your functions.

,times=100)

Under the hood, microbenchmark function executes each function 100 times by default. You could change the default times executed by options times=1000 or times = 10.

Another great feature of microbenchmark is that you could plot the execution time very easily. How easily? Let me show you.


library(microbenchmark)

benchmark_result=microbenchmark( your_function1(data,params), your_function2(data,params), times=1000)

ggplot2::autoplot(benchmark_result)

The resulting plot would look something like this.

benchmark1

Example use

Grab a cup of coffee or tea and let us do a benchmark.

We are going to start with a simple example. Try to calculate the mean value of each column in a data frame assuming all data are numeric.

First, we need to define the functions to calculate the column mean and check if they actually give the correct results for a simple test case. Unit testing would be most ideal, but I have not mastered that skill yet. If you know how to do that in R using testthat package, that is the way to go.

Below is the R code for doing column mean in several common ways, including the functional programming family lapply, apply, vapply, map in the purrr package and the loops like for and foreach.

Benchmark for calculating column means for non-high dimensional data.


lapply_colmean=function(data){
# this calculates column mean value since dataframes are comprised of lists.
unlist(lapply(data,mean))
}
apply_colmean=function(data){
apply(data,2,mean)
}
map_colmean=function(data){
purrr::map_dbl(data,mean)
}
foreach_colmean=function(data){
library(foreach)
foreach(i=1:dim(data)[2],.combine = c)%do%{mean(data[,i])}
}
for_colmean=function(data){
mean_vector=c(NA,length=ncol(data))
for(i in 1:ncol(data)){
mean_vector[i]<-mean(data[,i])
}
return(mean_vector)
}
vapply_colmean=function(data){
vapply(data,mean,numeric(1))
}
# need to confirm the functions return the correct result.
test_dataframe=data.frame(1:10,1:10,1:10)
lapply_colmean(test_dataframe)
apply_colmean(test_dataframe)
map_colmean(test_dataframe)
foreach_colmean(test_dataframe)
for_colmean(test_dataframe)
vapply_colmean(test_dataframe)

Next, let us benchmark those functions on 1000 by 100 data frame.


library(microbenchmark)
row_num=1000
col_num=100

library(magrittr)
library(ggplot2)
random_dataframe_colmean=matrix(rnorm(row_num*col_num),nrow=row_num,ncol=col_num)%>%as.data.frame()
microbenchmark(
lapply_colmean(random_dataframe_colmean)
,vapply_colmean(random_dataframe_colmean)
,apply_colmean(random_dataframe_colmean)
,map_colmean(random_dataframe_colmean)
,foreach_colmean(random_dataframe_colmean)
,for_colmean(random_dataframe_colmean)
,colMeans(random_dataframe_colmean)
,times = 1000
)%>%autoplot

This is the result of the benchmark.

benchmark1

If the picture is not clear enough, you could see the complete code and result in my Rpubs.

 http://rpubs.com/edwardcooper/faster_r_codes_2

The conclusion from the benchmark: 

When the number of columns is smaller than the number of rows (non-high dimensional data):
1) The rank of different functions is vapply>lapply > colMeans > map > for > apply >> foreach.
2) The exceedingly slow speed of foreach is kind of surprising to me.
3) lapply is such a fast functional and it is faster than the colMeans function, which is dedicated to calculating the column mean. It is even faster than the simple C I wrote. (Add Rcpp here later).

Benchmark for calculating column means for high dimensional data.

I will just transpose the data to convert the data frame into a high dimensional data.

Let us see how each method would perform for calculating column mean for high dimension data.


library(microbenchmark)
library(magrittr)
library(ggplot2)
# transpose the previous matrix.
random_dataframe_colmean_t=t(random_dataframe_colmean)
microbenchmark(
lapply_colmean(random_dataframe_colmean_t)
,vapply_colmean(random_dataframe_colmean_t)
,apply_colmean(random_dataframe_colmean_t)
,map_colmean(random_dataframe_colmean_t)
,foreach_colmean(random_dataframe_colmean_t)
,for_colmean(random_dataframe_colmean_t)
,colMeans(random_dataframe_colmean_t)
,times = 1000
)%>%autoplot

The resulting plot isbenchmark2

The takeaway from this is that:

When the number of columns is bigger than the number of rows (high dimensional data):
1) the speed of different functions is colMeans >> for \approx apply > foreach > map > lapply \approx vapply.
2) the speed of lapply is the slowest in this situation.

 

Let us profile the code to see where the time is spent in the foreach_colmean and lapply_colmean calculation.


library(profvis)

row_num=1000
col_num=100

library(magrittr)
library(ggplot2)
random_dataframe_colmean=matrix(rnorm(row_num*col_num),nrow=row_num,ncol=col_num)%>%as.data.frame()
profvis({

foreach_colmean=function(data){
library(foreach)
foreach(i=1:dim(data)[2],.combine = c)%do%{mean(data[,i])}
}
foreach_colmean(random_dataframe_colmean_t)

},interval=0.05)

The profiling result is

profiling

 

The plot may not be clear enough, you could look at my Rpubs link for all the results and code. http://rpubs.com/edwardcooper/faster_r_codes_2

The reason why foreach is slower for this data size is that there is a lot of error-handling behind the scenes. So that you should never use foreach in you lowest layer of loops. You should put it on the outer layers for error handling and parallel programming.

TL;DR

The performance of different algorithms changes with data dimension and data size.

If there is one algorithm dedicated to doing something, that is most likely the fastest method.

That is enough for one blog. For the next blog in this series, I am going to do a case study on speed up the MSD (mean squared distance)  calculation for more than 25 thousand atoms in 5000 timesteps for 17 temperatures from 40hrs even using data.table to 2hrs. How did I do that? It is a combination of choosing right library to use and parallel programming.

The full code and results are in  http://rpubs.com/edwardcooper/faster_r_codes_2.

I forgot to attach the session info at the end of the Markdown file. So here it is.

rsession

 

Advertisements