Make your R code faster : part 1

If you are coming from a C or Java background, you definitely will complain that R is slow. If you are new to programming, maybe you have heard about people talking that R code is slow.

In this blog, I hope to show you how to make your R code faster since it has a very different language design compared to C/C++, Java or even Python.

Disclaimer: The content of this blog is a combination of my 2-year experience using R, the famous R inferno book, Advanced R by Hadley and the online datacamp class on writing efficient R code. Special thanks to Hadley for making his Advanced R book free online. 

Common pitfalls to avoid

Never grow your data in R.

It is perhaps the number one thing you should consider when you are writing any R code.

Suppose we want to create a vector from 1 to 1000. How do we do that in R?

There are four possible ways:

* Add a number to the sequence using a for-loop.
* Pre-allocate a vector of size 1000 and change the values using a for-loop.
* Using seq function in base R.
* Use the colon “:” operator

Let us compare the computation time for these four methods.

First, we will define the functions to do these calculations.

# Add a number to the sequence using a for loop.
growing=function(n){
  x=NULL
  for(i in 1:n){
  x=append(x,i)
  }
  return(x)
}

# pre-allocate a vector
preallocate=function(n){
  x=vector(mode="integer", length=n)
  for(i in 1:n){
    x[i]=i
  }
  return(x)
}

# Using seq function in base R.
seq(1,100,by=1) 

# use the colon : operator
1:100

Next, fire up the microbenchmark package to benchmark them.

library(microbenchmark)
library(magrittr)
microbenchmark(
  growing(1000),
  preallocate(1000),
  seq(1,1000),
  1:1000,
  times=1000L
)%>%print()

b1

It is easy to see that the colon operator is the clear winner here.

The difference would be more significant if we are building a longer vector.

library(microbenchmark)
library(magrittr)
microbenchmark(
  growing(10000),
  preallocate(10000),
  seq(1,10000),
  1:10000,
  times=1000L
)%>%print()

b2

It is computationally expensive to allocate memory to R function so that the growing function is very very slow. That is why the first method is so slow.

Doing memory pre-allocation could avoid the expensive memory allocation operation. But the preallocate function still makes one thousand times more function call compared to seq and colon functions. That is why the second method is also slow.

The difference between seq and colon function comes as a surprise to me. It must come from the difference in the C and FORTRAN code underneath these two functions.
Let me know if you know why there is a big difference between them in the comment below.

The difference between preallocate function and seq is the difference between vectorized form vs non-vectorized form.

Vectorize your calculation.

R is faster when you use a vectorized form of calculation. Luckily, most R functions do calculations in a vectorized fashion.

Suppose we want to multiply each element of a vector by 2. Let us define a function non_vector_mutiply2 that takes a input vector and multiply every item by 2.

non_vector_mutiply2=function(x){
  x_multi=vector(mode="integer", length=length(x))
  for(i in 1:length(x)){
    x_multi[i]=x[i]*2
  }
  return(x_multi)
}

non_vector_mutiply2(c(1,2))

 

Let us benchmark the above function with “^” operator.

library(microbenchmark)
# use the fastest methods in generating a sequence
vector=1:1000
microbenchmark(
  non_vector_mutiply2(vector),
  vector*2,
  times=1000L
)

b3

The vectorized ^ operator is the clear winner here.

Another example of vectorized vs non-vectorized function performance.

Suppose we want to get the cumulative sum of a sequence. Let us benchmark the vectorized R function cumsum and a non-vectorized function.

non_vector_cumsum=function(x){
  x_multi=vector(mode="integer", length=length(x))
  x_multi[1]=x[1]
  for(i in 1:(length(x)-1) ){
    x_multi[i+1]=x_multi[i]+x[i+1]
  }
  return(x_multi)
}

non_vector_cumsum(c(2,4,8))

library(microbenchmark)
vector=1:1000
microbenchmark(
  non_vector_cumsum(vector),
  cumsum(vector),
  times = 100L
)

b4

Vectorized R function cumsum is the clear winner here.

R code optimizes a lot of calculations in a vectorized way with C and FORTRAN. Thus, it is always a good idea to go with the function pre-defined in R. When in doubt, wrap all your calculation in a function and do a benchmark with microbenchmark package.

If you ever want to profile your code in R, you could use provis package. You just need to wrap every calculation you want to do inside the


provis({

# put all your calculations here.

})

That concludes the two simple methods to make your R code faster.

 

You could see the full code and blog on my Rpubs (http://rpubs.com/edwardcooper/faster_r_codes).

Read More

Advertisements

Reproducible data analysis guide

I am a big supporter of reproducible research as I have wasted a lot of time trying to reproduce other’s work. It is a painful experience.

British Ecological Society published a guide for reproducible research recently.

Even though the original target audience is people doing ecological research, but the principles apply to any discipline. If you are doing any data analyst work, in academia or industry, I think it would be very beneficial to read it.

The article mainly talks about the use of R and Python in reproducible research but its guidelines apply to the analysis done in any language.

 

Enough with my mumble jumble, take a look for yourself.

 

http://www.britishecologicalsociety.org/wp-content/uploads/2017/12/guide-to-reproducible-code.pdf