In this blog, I will talk about how I use R to parallelize my stationary hypothesis testing on the time series data with R.
Main part: How to parallelize statistics calculations in R.
If your statistics calculation like hypothesis testing, parameter estimation, or stochastic process simulation takes a long time, you should parallelize your calculation. How do you define “long”? I personally would think that if your calculation takes longer than 3 mins (about the time for a cup of tea), then it needs some boost in speed with some simple steps in R. The same could be done in python with Dask.
More often than not, the long calculation is in a for-loop, or several independent calculations like cross-validation.
These are the steps to take to speed up your statistics calculation:
Step 1, replace your loop with functional from the apply family like lapply,sapply,vapply,tapply. Or the functional from the map family in purr package.
If it is still longer than 3 mins, then take step 2.
Step 2, replace your loop with foreach function, so that calculation could be parallelized.
The foreach function is very easy to learn, but it would take some time like 1hr to read the tutorial and figure out how to work with the doParallel package to register a parallel backend.
Step 3, if you do not want to learn foreach and doParallel but still want to do statistics in parallel, I wrote a wrapper function on top of foreach and the doParallel package called map_pc, which is basically a parallelized version of apply/map function.
You could easily import this function into R like this:
The example use of map_pc to calculate the sum of each column.
First, we make a dataframe with 10 columns and 10 million rows with a gaussian distribution with mean zero, variance 1.
Second, we apply the map_pc function to calculate the sum of each column.
Finally, we could examine the result with summary function.
Here is a little bit explanation on the arguments of the map_pc function.
The first argument is data, the second argument to the function is the function to apply to the data column wise. The third argument is nthread, the number of thread to use. The fourth argument is the type, choose FORK on Linux or Mac, choose PSOCK is you are on windows (I did not test it on Windows).
Here are some examples of how I use map_pc to do the stationary test in parallel. https://github.com/edwardcooper/lammps/blob/master/st_extract_cal.R
This is another reason why you should always write any calculations into functions as I mentioned in a previous blog ( https://phdstatsphys.wordpress.com/2017/09/05/benefits-of-writing-functions/ ).
On a side note, I do know of the mcapply and clusterApply series of functions (https://stat.ethz.ch/R-manual/R-devel/library/parallel/html/clusterApply.html), but writing calculation in parallel is a lot of fun for me. Plus, if there is any problem with my code, I could easily fix it. There is also the benefit of choosing your favorite parallel backend.
How to further the R code performance
Now you have tried the methods I mentioned above, but calculation time is still longer than 3 mins. Are there any other options? Of course, R is a rather flexible language. If you have any problem with the memory limitation, you could use the bigmemory,
If you have any problem with the memory limitation, you could use the bigmemory, ff and many more packages to work with data that too big be loaded into memory.
If your statistics calculation is slow because of the source code is poorly written, then you could write your own functions in C/C++ and use it in R, see the tutorial from Hadley Wickham here: http://adv-r.had.co.nz/Rcpp.html.
If your data comes from several sources or in a database, then you could try sergeant package to easily load data from various sources. It utilizes the Apache Drill as the backend.
Another thing is doing parallel computing is to know when to turn off the hyperthreading, which has already been discussed in another blog in detail (https://phdstatsphys.wordpress.com/2017/09/12/hyperthreading-faster-or-slower/). The take-home message is to set the number of cores to use as the number of physical cores you have. That is why I set the default value of nthread in my map_pc function to be 4, which is the number of physical cores on a normal desktop.
This is the time I record for using different numbers of cores to do the stationary test.
You could see clearly that when I set the nthread to be 4 in the map_pc function, it runs the fastest.
Of course, the next step to further increase the speed is to use a computing cluster like AWS.