Docker for data analysis: a hands-on tutorial (jupyter and rstudio-server)

Dependency, dependency, and dependency! Dependency is an evil being that used to haunt software development, now it has come to data analysis. Out of the 14 million possibilities, there is one way to defeat it. The answer is Docker!

Based on whether you will use rstudio-server or jupyter notebook, the following blog will split into two parts. One part is about how to run a Docker container for jupyter, another part is about how to run a container for rstudio-server, somewhat similar to jupyter.

Before we proceed, it is recommended that you create an account on DockerHub, a hosting service for Docker images. See the line here.

Jupyter notebook

After registering an account on DockerHub, you could search for the possible Docker images there. After a little search and trial, I found this container. It has pre-installed R, Python3, Julia and most of the common images for data science. See more details on its DockerHub.

How to use it? Simply type the below command into your terminal:


docker run -it --rm -p 8888:8888 jupyter/datascience-notebook

After downloading about a 6GB Docker image, it would start the jupyter notebook, you will have something like this.

docker6

docker7.png

On the second line of the second picture, you see a sentence “to login with a token: …..”.Simply by clicking the http address, you would see a jupyter notebook.

jupyter

Then, you have the jupyter notebook with support for R, Python3, and Julia. Let us see if the most common packages are installed.

jupyter2

Let us test if most R packages are installed.

jupyter3

Worth noting here that data.table package is not installed.

Let me try to install my automl package, dplyr and caret package from Github.

jupyter4.png

It seems that there are some issues with installing from Github. But all the R-cran install works perfectly.

Multiple terminals.

In order to download some files or made some changes to the Docker environment, you could open up another terminal by


docker exec -it  bash

After you are done, you could use ctrl+p+q to exit the container without interrupting the container.

Rstudio-server

For those of you who did not know Rstudio-server, it is like the jupyter notebook for R.

The image I found on the DockerHub is this.

To use it, you should first download it.


docker pull dceoy/rstudio-server

Then, you run the image with the command below:


docker container run --rm -p 8787:8787 -v ${PWD}:/home/rstudio -w /home/rstudio dceoy/rstudio-server

The Rstudio-serve would not start itself, you need to type the address below into your browser.

http://127.0.0.1:8787/

Then, you will need to enter the username and password, both of which are rstudio.

Let us see if the common packages are installed.

rstudio

Then, let us try to install my automl package from Github.

rstudio2.png

rstudio3

It works perfectly. (I still have some dependency issues with some of the packages to resolve in my package, but the conflicts are printed here.)

 

Read More

Advertisements

Docker installation on Ubuntu 16.04LTS

Why and What is Docker?

What is docker? It is a container, a lightweight software shipment tool.

Why do you need docker doing data science? Simplifies the installation process for multiple machines. That makes your result reproducible and portable to other machines. See this great article for more details. https://towardsdatascience.com/how-docker-can-help-you-become-a-more-effective-data-scientist-7fc048ef91d5

If you use virtual machines (VM) a lot and really hates it for consuming too many resources, docker is a great alternative with much less overhead since it is rather lightweight.

Why this blog?

I run into some confusion when I tried to install docker for the first time even after reading the official document, which is for docker enterprise version. What I want is the community version.

You could only install docker on Windows 10 Pro or Enterprise edition. It means that if you have a Windows 10 Home edition, which comes with most PCs, you could not install it. I tested it myself as of 5/3/2018. I have no mac device. If you do let me know how it works on mac.

Ubuntu definitely works. I tested it on my ubuntu 16.04 LTS both on a laptop and a desktop.

How to install it?

First of all, the installation from DigitalOcean does not work.

The solution is from AskUbuntu. The link is here

To summarise it for myself and others who love my blogs. Below is a summary of the commands.

The process is split into three steps:

1. set up docker repository

2. install docker community edition.  

3.verify the installation.

 


sudo apt-get update
sudo apt-get install apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"

sudo apt-get update
sudo apt-get install docker-ce

sudo docker run hello-world

 

The second step in installing docker-ce would require downloading a file of 188MB as of May, 3rd, 2018. It took less than a minute for me to finish the installation.

 

It has been a while since I posted a blog since I have been so busy with finding a job and graduating. I will publish a new blog soon about doing the parallel computation in R.

Plus, I am actively developing a new package called automl in R to make training hundreds of machine learning models in a few lines of code. Check it out on my Github for more. https://github.com/edwardcooper/automl

Fix error:module ‘html5lib.treebuilders’ has no attribute ‘_base’

I encounter this error when I tried to import bs4 in python3. The problem is not in bs4 but in its dependency library html5lib.

python2 (didn’t test myself)

There are some fixes that recommended for python2

pip install --upgrade beautifulsoup4
pip install --upgrade html5lib

The above solution comes from SO (https://stackoverflow.com/questions/38447738/beautifulsoup-html5lib-module-object-has-no-attribute-base).

python3 (tested myself)

Solution1:
But the above solution did not work for me at all since I want to work with python3, and changing pip to pip3 does not solve the problem.

I did find that down-grading html5lib library to a previous version solved the problem.

sudo pip3 install html5lib==0.9999999

This is suggested in SO (https://github.com/coursera-dl/coursera-dl/issues/554) and Launchpad (https://bugs.launchpad.net/beautifulsoup/+bug/1603299)

Solution2:
If you are not happy with just a temporary patch by downgrading it into an older version. Then, the obvious solution is to uninstall bs4,html5lib libraries and install them again, since this bug is already fixed as Thanksgiving in 2017.

 sudo apt remove python3-bs4 python3-html5lib
sudo apt install python3-bs4 python3-htmllib

See you all next time. Let me know in the comment if the above solution solved your problem.

Fix error: unable to find variable “optimismBoot”

Error in Parallel model training in caret

Today is September 30, 2017. The error was reported on Github: https://github.com/topepo/caret/issues/706.

If you have tried to train any machine learning models with train function in caret package in r recently, then it is likely that you have run into an error like this:

Error in e$fun(obj, substitute(ex), parent.frame(), e$data) :
unable to find variable "optimismBoot"

The problem is within the caret package when you work with some parallel backend.

The solution is to reinstall the caret package from GitHub, where Max Kuhn had fixed the problem. It is not currently on CRAN yet.

If you already have the devtools package, then you should install devtools package first.

install.packages("devtools")

If you already have the devtools, then just install the caret from Github.

devtools::install_github('topepo/caret/pkg/caret')

That should have the problem fixed.

Fix for GAM model prediction in R

Need to load mgcv library before making any prediction on the GAM model in R.

library(mgcv)

 

Latex on Ubuntu

If you have not installed latex on your Linux distros, then the preferred way is to install texlive for knitr to compile your code and plots into PDF.

sudo apt-get install texlive-full

I get the solution from here: http://milq.github.io/install-latex-ubuntu-debian/