How to write production level code

All I am writing here is my personal summary for this blog on medium, and I also reshape the order in which you should improve on.

Why production level code?

Ability to write a production-level code is one of the sought-after skills for a data scientist role— either posted explicitly or not.

The production level code has several features.

  1. Modular code:  Decompose the code into low-level, medium-level and high-level functions. Larger functions are harder to debug.
  2. Version control: use git with branch.
  3. Readability: comments/docstring and self-explanatory function and variable names.
  4. Unit testing: the unitest module in Python.
  5. Logging: Records only actionable information such as critical failures during runtime and structured data such as intermediate results that will be later used by the code itself.
  6. Code optimization: better space and time complexity.
  7. Compatibility with ecosystem

 

For a new graduate without much experience, you definitely should write your code into smaller functions, make the code more readable for future maintenance (add doc-string and more comments), add the unit test of the functions/class and put it on Github.

The best way for me to practice all that was to rewrite one or more of pet projects into a small python package, web application (Flask in Python), restful API hosted on AWS.

It will practice your coding skills and better visibility when applying for jobs. 

Here is another python guide targeted for data scientists, which I think is very practical and awesome.

Advertisements

Kmeans algorithm implemented in Python

For the unsupervised learning algorithm, the Kmeans algorithm is one of the most iconic unsupervised learning methods. In this blog, I will try to explain the Kmeans algorithm and how to implement it in Python.

  1. The first step is to generate some random data points to be the cluster centers based on the number of clusters to separate the data into.
  2. The second step is to assign data points to different clusters based on a distance metric. (You could choose the Euclidean distance, Manhattan distance, Mahalanobis distance, or cosine similarity based on the detail of the project. I will talk about the different distance/similarity metrics in a future blog.)
  3. The third step is to calculate the new cluster center based on the new cluster assignment. The most simple way to calculate the new cluster center is by taking the mean of the data in the new clusters.
  4. Iterate step 2 and 3 until the changes in the center of the clusters are smaller than a specific value.

The GIF below summarized all the steps stated above.​_

Side note:

  1. Because of the random initial cluster centers, the Kmeans algorithm will lead to a slightly different result. Then, it is customary to run Kmeans several times to determine if the algorithm converges to a similar solution.
  2. Due to the fact that the Kmeans algorithms use some type of distance metric, it is crucial to transforming the data to have zero mean and unit variance or in the {0,1] range.
  3. The final clusters can be very sensitive to the selection of initial centroids and the fact that the algorithm can produce empty clusters, which can produce some expected behavior. There is a specialized algorithm called kmeans++ optimized for the initialization of the cluster centers.
  4. The final selection of the Kmeans algorithm and the number of clusters in Kmeans can be done by choosing the algorithm that has the lowest within-cluster-variance and many other metrics that can be chosen from. See this blog if you want to know more.

 

This jupyter notebook in my Github describes all the code and the details of how I implemented the Kmeans and apply it to an image to do color vector quantization.

 

I have also refactored all the code above into a class in python with the fit, predict, transform method as implemented in sklearn and some other helper functions for ease of use. It is also in my GitHub. I have also added some examples on how to use the Kmeans algorithm I implemented.

 

Let me know if you have any questions.

If you have just started to learn python, this python tutorial by Corey is the most awesome I have ever seen. He explains everything so well and covers a lot of practical topics.