Fix error: unable to find variable “optimismBoot”

Error in Parallel model training in caret

Today is September 30, 2017. The error was reported on Github: https://github.com/topepo/caret/issues/706.

If you have tried to train any machine learning models with train function in caret package in r recently, then it is likely that you have run into an error like this:

Error in e$fun(obj, substitute(ex), parent.frame(), e$data) :
unable to find variable "optimismBoot"

The problem is within the caret package when you work with some parallel backend.

The solution is to reinstall the caret package from GitHub, where Max Kuhn had fixed the problem. It is not currently on CRAN yet.

If you already have the devtools package, then you should install devtools package first.

install.packages("devtools")

If you already have the devtools, then just install the caret from Github.

devtools::install_github('topepo/caret/pkg/caret')

That should have the problem fixed.

Fix for GAM model prediction in R

Need to load mgcv library before making any prediction on the GAM model in R.

library(mgcv)

 

Latex on Ubuntu

If you have not installed latex on your Linux distros, then the preferred way is to install texlive for knitr to compile your code and plots into PDF.

sudo apt-get install texlive-full

I get the solution from here: http://milq.github.io/install-latex-ubuntu-debian/

Advertisements

Doing statistics in parallel with R

In this blog, I will talk about how I use R to parallelize my stationary hypothesis testing on the time series data with R.

Main part: How to parallelize statistics calculations in R.

If your statistics calculation like hypothesis testing, parameter estimation, or stochastic process simulation takes a long time, you should parallelize your calculation.  How do you define “long”? I personally would think that if your calculation takes longer than 3 mins (about the time for a cup of tea), then it needs some boost in speed with some simple steps in R. The same could be done in python with Dask.

More often than not, the long calculation is in a for-loop, or several independent calculations like cross-validation.

These are the steps to take to speed up your statistics calculation:

Step 1, replace your loop with functional from the apply family like lapply,sapply,vapply,tapply. Or the functional from the map family in purr package.

If it is still longer than 3 mins, then take step 2.

Step 2, replace your loop with foreach function, so that calculation could be parallelized.

The foreach function is very easy to learn, but it would take some time like 1hr to read the tutorial and figure out how to work with the doParallel package to register a parallel backend.

Step 3, if you do not want to learn foreach and doParallel but still want to do statistics in parallel, I wrote a wrapper function on top of foreach and the doParallel package called map_pc, which is basically a parallelized version of apply/map function. 

map_pc.png

You could easily import this function into R like this:

source(“https://raw.githubusercontent.com/edwardcooper/lammps/master/map_pc.R”)

The example use of map_pc to calculate the sum of each column.

First, we make a dataframe with 10 columns and 10 million rows with a gaussian distribution with mean zero, variance 1.

random_data=data.frame(matrix(rnorm(1e8),ncol=10,nrow=1e7))

Second, we apply the map_pc function to calculate the sum of each column.

cumsum_result=map_p(data=random_data,sum)

Finally, we could examine the result with summary function.

summary(cumsum_result)

Here is a little bit explanation on the arguments of the map_pc function.

The first argument is data, the second argument to the function is the function to apply to the data column wise. The third argument is nthread, the number of thread to use. The fourth argument is the type, choose FORK on Linux or Mac, choose PSOCK is you are on windows (I did not test it on Windows).

Here are some examples of how I use map_pc to do the stationary test in parallel. https://github.com/edwardcooper/lammps/blob/master/st_extract_cal.R

This is another reason why you should always write any calculations into functions as I mentioned in a previous blog ( https://phdstatsphys.wordpress.com/2017/09/05/benefits-of-writing-functions/ ).

On a side note, I do know of the mcapply and clusterApply series of functions (https://stat.ethz.ch/R-manual/R-devel/library/parallel/html/clusterApply.html), but writing calculation in parallel is a lot of fun for me. Plus, if there is any problem with my code, I could easily fix it. There is also the benefit of choosing your favorite parallel backend.

How to further the R code performance

Now you have tried the methods I mentioned above, but calculation time is still longer than  3 mins. Are there any other options? Of course, R is a rather flexible language. If you have any problem with the memory limitation, you could use the bigmemory,

If you have any problem with the memory limitation, you could use the bigmemory, ff and many more packages to work with data that too big be loaded into memory.

If your statistics calculation is slow because of the source code is poorly written, then you could write your own functions in C/C++ and use it in R, see the tutorial from Hadley Wickham here: http://adv-r.had.co.nz/Rcpp.html.

If your data comes from several sources or in a database, then you could try sergeant package to easily load data from various sources. It utilizes the Apache Drill as the backend.

Another thing is doing parallel computing is to know when to turn off the hyperthreading, which has already been discussed in another blog in detail (https://phdstatsphys.wordpress.com/2017/09/12/hyperthreading-faster-or-slower/). The take-home message is to set the number of cores to use as the number of physical cores you have. That is why I set the default value of nthread in my map_pc function to be 4, which is the number of physical cores on a normal desktop.

This is the time I record for using different numbers of cores to do the stationary test.

Screenshot from 2017-09-22 17-10-38

You could see clearly that when I set the nthread to be 4 in the map_pc function, it runs the fastest.

Of course, the next step to further increase the speed is to use a computing cluster like AWS.

Read More

Hyperthreading: faster or slower?

So what is hyperthreading? It is an Intel tech that embedded in almost all modern CPUs. It split one real core into two logical cores so that one real core could handle “twice” the workload in parallel. (I am not familiar with AMD CPUs.)

You must notice that I put a quotation mark around the “twice”.

If I could split my core into two logical cores could give me twice the performance, then why not split one core into 4 logical cores, 8 logical cores or millions of logical cores to gain a huge performance. You could see that is definitely not possible.

Will it always make my computer faster? The answer is a definite no. When will it make my computer faster? If you use your computer for non-computation intensive work, a.k.a browse the internet, write blogs, watch Netflix, and all kinds of day-to-day use.

Yes, your computer actually works a little faster with hyperthreading for day-to-day light use.  You felt like you have won a million bucks.

It makes your computer faster when the computer could process the data faster than it could read from your RAM. In other words, it makes your computer faster by being able to do a huge amount of different small computations.

If you are doing a small amount of same long computations, hyperthreading might make it slower.

That is just more confusing, right? Let me explain those two scenarios in more details.

Suppose that your computer is a one-hand Gorilla could life a maximum of 3.5GHz of things, but since it only has one hand, it could only lift one thing at a time. And the time required to switch from one lifting to another is about the same.

What the hyperthreading does is that it split the power of one hand into two hands, so that each hand could lift 1.75GHz/hand.

Now if you want to lift 100 things with 1GHz, it would be faster if you have two hands.

But if you want to lift 4 things of 25GHz, it would take longer if you have two hands. It is because you need to coordinate two hands to lift those things, which also reduce your power in a way. In terms of the computer cores, it means you need to communicate between those two cores to make the parallel computation correctly.

The actual situation is more complicated than this example, but it kind of reflects the gist of Hyperthreading.

Suppose you want to train a machine learning model, which takes longer than 10 mins, then it would be faster if you turn off the hyperthreading. Or if you run any parallel programs that are not I/O intensive, then you are better off turning off the hyperthreading.

Read More

Benefits of writing functions

As has been mention in Hadley’s book (http://r4ds.had.co.nz/iteration.html), the benefits of writing things into functions have a lot of perks for scientific computing.

First, it is easier for you or others to read.

Second, it is easier to change when need changes.

Third, it is less likely to make a mistake.

It could be summarised as easy to maintain. Speaking of easy to maintain, it is best to use Git to record the history of your code, but I digress.

There are other benefits for writing things in functions. For example, it is less likely to have conflict in variable or dataset name conflicts.

  1.  It is less likely to have conflict in variable or dataset name conflicts.
  2.  It is much easier to parallelize your code when you write things into functions.
  3.  It is easier to benchmark and optimize.
  4. The list goes on…….

Add your thought on functional programming in the comment below, and I will add it to the list.

In R, you could pass a function to another function. That is an extremely powerful idea, and it’s one of the behaviors that make R a functional programming language.

See the example from Hadley book (http://r4ds.had.co.nz/iteration.html#for-loops-vs.functionals).

This is a quick example.

col_summary <- function(df, fun) {
out <- vector("double", length(df))
for (i in seq_along(df)) {
out[i] <- fun(df[[i]])
}
out
}

Version control: first hands-on for newcomer

If you have read my previous blog on version control and convinced that you probably need version control or you probably want to give it a try just for fun, but you did not find the link I give you too helpful. Or just find it too long to read.

I strongly suggest you try this interactive guide from Github: https://try.github.io/levels/1/challenges/1

I also found this youtube video to be quite awesome for beginners. https://www.youtube.com/watch?v=SWYqp7iY_Tc

Or you have read so much about so-called “tutorials” too complicated for you. I am here to give you the most basic usage of Git. If you could get started to use it, then a lot of things would be more natural to you later.

If you could get started to use it in a simple way first, then you could learn more advanced usage later just by googling.

So if you have a computer with command line available to you, then we could get started (This probably does not work well with Windows).

The first thing you should do is to register an account on Github (https://github.com/join?source=header-home).

After registering and logging into Github, you should get to a similar page like this one.

微信截图_20170904163230.png

Notice the + button at the top right corner? Use that to create a new repository.  Add a README.md file and a license to set up a new project.

Enter the new repository and you will see something like this.

微信截图_20170904164027

Click the Clone or download button, then copy the link (For my project here, the link is https://github.com/edwardcooper/lammps.git ).

If you have problems with registering Github account, you could just use my link https://github.com/edwardcooper/lammps.git to test the most basic functionality of git below.

0) You need to set up the git environment.

$ git config --global user.name "Your_name"

$ git config --global user.email "Your_email"

$ git config --global color.ui "auto"

$ git config --global core.editor "nano -w"
1) In the command line on your computer, you type (replace the link below with your link)

$ git clone https://github.com/edwardcooper/lammps.git

2) Make some changes to any of the files.  Then, you could use

$ git status

to see if there is anything changed.

3) To record the changes you will need to add that file to the staging area. It is like adding something to the shopping cart.

240_F_26124443_QQVqQWwQGQFqBQg9QACdpktxYQ7xIRkY

$ git add file_name_you_have_changed

3) Then you will need to record that change permanently to the history of the project. It is like Checking out in online shopping.

maxresdefault

$ git commit -m "Add commit message."

The -m here means adding commit message to the history.

Now you have changed and record the change. But we have not updated the change in the GitHub repository.

To add the change to Github, you type

$ git push

and enter your username and password to upload the change to Github.

That is the most basic use of Git with Github. Hope you find it helpful. Leave a comment below if you have any questions or find any errors on my blog.

If you are not satisfied with my tutorial anyway, you could use this interactive tutorial: https://try.github.io/levels/1/challenges/1.

This is another great tutorial written by a computational scientist: http://kbroman.org/github_tutorial/

The official tutorial from Github: https://guides.github.com/activities/hello-world/.

If you are a Mac or Windows user, there is an excellent instruction on how to install git on Codecademy: https://www.codecademy.com/articles/git-setup.

There is one more tutorial if you still stuck on installing git:http://guides.beanstalkapp.com/version-control/git-on-windows.html.

 

Let me know if you have any more excellent tutorials for git or Github.

ROC curve with Type I and II error: the short story

OK, I lied. This is not gonna be a short version since you could not talk about the full story of ROC curve without explaining the Type I and Type II error clearly. But if you are only interested in what ROC curve means, it would be a short story.

What is exactly a ROC curve? ROC is short for Receiver Operating Characteristics. A pretty strange name, right? It sure does not sound like anything to do with statistics.

The ROC curve got its name from its usage in world war II. Here is a little history. It was after the Pearl Harbor event in Hawaii, the U.S army began to research on how to improve the ability of the radar receiver operator to detect and distinguish the Japanese air force.

Then, the ROC curve, which got its name from radar receiver operator, was used to compare the capability of different methods to detect Japan air force.

 

Enough with the history. What exactly is a ROC curve? How does it work to compare different models?

The x-axis of ROC curve is the false positive rate, and the y-axis is the true positive rate. And it looks like this.

You must be super confused as I was when I first learned about it. How does it relate to the sensitivity, Type II error and Type I error as we are discussing in the previous blog?

The short answer: the x-axis is the Type I error rate and the y-axis is the Power of statistical test (sensitivity). If we change the Type I error rate in the hypothesis test, we would change the sensitivity as is shown in the ROC curve. 

If you are only interested in what the ROC curve represents, then you could stop here. But if you are ever interested in knowing how and why the Type I error relates to the Power of statistical test (sensitivity), then you should continue as I explain it below.

Read More

Why anyone should use version control?

If you are from a computer science background, then you probably use version control like you breathe every day.

But outside the IT industry, it is not a common practice to use version control. Even among the computational scientists, version control is not used by everyone.

I will show you in this blog that version control is not only easy to start with but also very useful especially you are writing anything on a computer.

What does version control do exactly? In layman’s term, it stores different versions of the same file that you saved in the version control system so that you could go back to.

If something from the above image ever happened to you, then version control will be your best friend.

Git software is the default for version control, and GitHub is the current industry standard for hosting whatever you write online.

But there is a catch! Whatever you are hosting on the Github would be publicly available, meaning that everyone with internet access could see what you write unless you purchase private repository from Github.

What if you do not want to pay for it? If you have an Edu email address, then you could have the unlimited private repository on Bitbucket, which is another popular site for hosting.

If you are convinced about using Git, then you could proceed with me. If you find version control of little use to you, then I would not waste your time any longer. Good day and come back for my next blog. Or leave a comment below.

(Added side note, if you are either a Mac or Linux user, then everything would be fine but if you are a Windows user things might be kind of tricky since I have not used Git in Windows before. But if you are interested in using Git with Windows, then leave a comment below and I will write about it in my next blog.)

Start with this tutorial and you could start using Git in a matter of minutes. https://zonca.github.io/git-novice/

If you have some experience with Git, try this exercise.  https://github.com/zonca/conversion_tofix

I will try to explain the steps in more clear and simplified way in the next blog.