Summary on machine learning workflow


A complete machine learning scenario starts with a problem that you want to solve and ends with a mechanism by which you can make predictions about new instances of similar data.


Prerequisite for problems that could be solved by machine learning

 1.The problem could be solved by finding patterns in data

 2.You have access to a large set of existing data that includes the attribute (called a feature in machine learning) that you want to be able to infer from the other features.


If the above requirement is met, the steps for solving the machine learning problem is as follows.


Define your problem, find relevant data, define your performance metrics.

Define the problem means you need to define what is your target variable and what is the features are necessary for the problem. Like in the Zillow competition, your target variable is the price and all other variables like the size of the room, the number of bedrooms and etc are features.


No matter how powerful the machine learning algorithm is, you could not build a model without data. Like Andrew Ng said in his speech at Stanford business school, the major barrier for small companies is data. The data could come from an open source, or paid service or even legal website scraping (more on the legal issues with website scraping in later blogs). Of course, many tech giants like Google or Facebook collects data from the usage of their services. I would say this is one of the most important skills that is most often overlooked. 


Once you have the data and define the target variable, you need to define a loss function to minimize or maximize for the problem.


Data exploration and cleaning.

This step is crucial but is less obvious to many beginners. The quality of the data determines the upper limit of the accuracy of your model. No matter how fancy your algorithm is, your model could not perform well if the data is corrupted.


But datasets in real-world situations have strange values, missing values and even simply wrong, we will need to explore the dataset and correct them.

Baseline model and feature selection

Next, we will need a baseline model to first set the lower minimum for performance metrics, which I usually use random forest and xgboost since they are fast, effective and could handle missing values well. 

The criterion for choosing baseline model:

  1. Stable and good-enough performance, fast to get results
  2. Best if NA is allowed

The second point is kind of importance since we will definitely need to use baseline model to test if the data cleaning process makes prediction better or worse.


A common practice aside from using cross-validation to examine your model performance is to split the dataset into a training dataset and a validation dataset. Train the model on the training dataset and evaluate the data on the validation dataset


Once we have the baseline model, we will need to look at the importance of each variable in predicting the target variable to select some of them for the next step.


There is also the practice of creating or dropping features based on the domain knowledge or just intuition and check the importance of the new set of features with the baseline model.

PCA is the often chosen method.


Machine learning algorithm spot check

We will usually need to use some sampling method to work with the class imbalance in the data.


The next step is to find several best learning algorithms with all the features selected in the previous step.


Then, we will need to fine tuning those models with the hyperparameter search.

I often found that using adaptive resampling method increases model performance a little bit wit some cost on computation time. Check how to do that in R


Build an ensemble model and train it on the entire dataset.

The last step is to find the correct stacking, bagging or boosting method to build your own ensemble model for the problem.
I will add more details in the future.

Stay tuned.


Derive equations in backpropagation

Deriving equations! My favorite thing in life. Yes, I just that nerdy.



Where does the equation come from?

I was reading a blog on introductions to deep learning by Micheal Nielsen. Here is his book in case you are interested.

On a side note, if you have ever worked in quantum information or have taken any courses on quantum information. You will definitely use or hear about his book. It was a great book on quantum information and computation.


Why should I care?

The backpropagation is a method to quickly calculate the partial derivative in order to update the weights and biases in using gradient descent to find the minimum in loss function.

In a more layman’s words, backpropagation is the algorithm that makes training large neural networks possible using a modern computer.

But the math behind backpropagation algorithm is really simple that could be easily understood by anyone with some knowledge of chain rule and matrix multiplication. 

Setting up parameters and symbols.

1) The first is the Hadamard product, which is defined as



Let us give it an example,

hadamard product


which is given the book I mentioned above.

2) C is the loss function, which is defined as




which could also be expressed as


as defined in the book.


3) While  z^l is defined as


, \delta is defined as


Let’s derive!


First equation:

Let us change BP1 into a verbose form



Recall that  \delta^L_j \equiv \frac{\partial C}{\partial z^L_j }  , thus we have


 \delta^L_j \equiv \frac{\partial C}{\partial z^L_j } = \sum\limits_{k} \frac{\partial C}{\partial a^L_k } \times  \frac{\partial a^L_k}{\partial z^L_j}= \frac{\partial C}{\partial a^L_j } \times  \frac{\partial a^L_j}{\partial z^L_j}  .

Last equal sign stands valid since a^l_j only depends on  z^l_j ,  as is defined in the equation below


Therefore,    \delta^L_j \equiv \frac{\partial C}{\partial z^L_j }=  \frac{\partial C}{\partial a^L_j }  \times \frac{\partial a^L_j}{\partial z^L_j} =    \frac{\partial C}{\partial a^L_j } \times \sigma'(z^L)  

Thus, we have prove the first equation.



Second Equation:


Based on equation 1, we could have

\delta^l_j=\frac{\partial C}{\partial a^l_j } \frac{\partial a^l_j}{\partial z^l_j}  .

Next, we need to further break down the first term.

\frac{\partial C}{\partial a^l_j } =\sum\limits_{k} \frac{\partial C}{\partial z^{l+1}_k } \times \frac{\partial z^{l+1}_k }{\partial a^l_j}=\sum\limits_{k} \delta^{l+1}_k \times   \frac{\partial z^{l+1}_k }{\partial a^l_j}\\=\sum\limits_{k} \delta^{l+1}_k \times  w^{l+1}_{kj}=(w^{l+1})^T \delta^{l+1} .

Putting everything together, we have

\delta^l=(w^{l+1})^T \delta^{l+1} \odot \sigma'(z^l) 

Thus, we have proved the second equation. My derivation is a little different from the derivation from the book but it is basically the same.

Third Equation (Not derived in the book):


This one is relatively easy.

\frac{\partial C}{\partial b^l_j}= \frac{\partial C}{\partial z^l_j} \times \frac{\partial z^l_j}{\partial b^l_j}

Since  \frac{\partial z^l_j}{\partial b^l_j} =1   from the definition that z^l_j=w^l_{kj}a^{l-1}_j+b^l_j .

Thus we have \frac{\partial C}{\partial b^l_j}= \frac{\partial C}{\partial z^l_j} \times 1=\delta^l_j .

Fourth Equation (derivation not included in the book):


Applying chain rule as usual,

\frac{\partial C}{\partial w^l_{jk}}=\frac{\partial C}{\partial z^l_j} \times \frac{\partial z^l_j}{\partial w^l_{jk}}=\delta^l_j \times a^{l-1}_k  .

Thus, we have proved the last equation.


I will have to admit that those equations are not so beautiful as I expected in the WordPress. I will add more notes on the missing derivation in the book using other methods.











Memo: Fix for R version on Ubuntu 16.04 LTS (tidyverse,xgboost)

If you use

sudo apt-get install r-base

in Ubuntu 16.04 LTS. You will install R version 3.2.3-4 as of June 19, 2017.

This will be problematic is you are going to use the famous xgboost package for machine learning. When you tried to install xgboost directly or use it through caret package, there will be an error showing that the ” xgboost” package is not supported for R version 3.2.3.

So you will need to do the following in the shell. I have just verified this method. Using the method listed on the r-cran website does not solve the problem (

The solution came from this website (

sudo apt-key adv --keyserver --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
sudo add-apt-repository 'deb [arch=amd64,i386] xenial/'
sudo apt-get update
sudo apt-get install r-base-core

I just put it here to summarize the code for future reference.

Another thing to keep in mind is that if you use

sudo -i R

to install packages, then all users could use the packages you installed.


Added notes on install tidyverse package.

If you install tidyverse on a fresh install onto Ubuntu 16.04 LTS, there is going to be some errors in installing it. You will need to follow the instructions on R console for missing packages needed to install it in the shell.

Or you could just enter the following command in the Terminal to install necessary packages in Ubuntu before installing tidyverse package in R.

sudo apt-get install libxml2-dev libcurl4-openssl-dev libssl-dev

See the official rstudio-server website for updated installation instruction.

The bash to install the rstudio-server.

sudo apt-get install gdebi-core
sudo gdebi rstudio-server-1.0.153-amd64.deb

The benefits of installing rstudio server is that you could access the Rstudio through a web browser by typing


The username is just your username on the server and the password is the password you used for the server. Use

hostname -I 

to check the ip address assigned to your server.

Dropbox Headless Install via command line

cd ~ && wget -O - "" | tar xzf -

Need to verify your Dropbox account by GUI.

Sometimes you need to install a software with all dependencies

gdebi your-downloaded-package.deb

I got the solution here: