Word2vec: how to train and update it

In this blog, I will briefly talk about what is word2vec, how to train your own word2vec, how to load the google’s pre-trained word2vec and how to update the google’s pre-trained model with the gensim package in Python.

What is word2vec?

If you ever involved in building any text classifier, you would have heard of word2vec. Word2vec was created by a team of researchers led by Tomáš Mikolov at Google. It is an unsupervised learning algorithm and it works by predicting its context words by applying a two-layer neural network. To understand more about word2vec under the hood, you can refer to the Youtube video by Stanford University.

The end result of word2vec is that it can convert words into vectors.

SentimentAnalysis3-9dfc6939367c31e13792792adad5f059

There are two algorithms to generate the encoding which are continuous bag-of-words approach and skip-gram. They are depicted in the graph below.

word2vec_diagrams

How to train a word2vec?

from gensim.models import Word2Vec
sentences = [['yes', 'this', 'is', 'the', 'word2vec', 'mode']\
,[ 'if',"you","have","think","about","it"]]
model = Word2Vec(sentences,size = 10, window=5,min_count = 1)

There are more options for the training where the size option determines the dimensions of the word vectors, the window option is the number of the forward and backward words used in training, and min_count is the number of minimum times a word needs to appear to be included in the training.

Save model and load the model

After you finished training the model, you can save it as follows:

model.save(your_file_name_for_saving_in_the_disk)

If you want to reload the model into workspace again, you can load it as follows:

model.load(your_file_name_for_saving_in_the_disk)

For more details on how to use the gensim package for word2vec model, see the tutorial by the gensim author. 

How to update the word2vec model?

from gensim.models import Word2Vec
old_sentences = [["bad","robots"],["good","human"]]
new_sentences = [['yes', 'this', 'is', 'the', 'word2vec', 'model']\
,[ 'if',"you","have","think","about","it"]]
old_model = Word2Vec(old_sentences,size = 10, window=5, min_count = 1, workers = 2)
old_model.wv.vocab
old_model.save("old_model")
new_model = Word2Vec.load("old_model")
new_model.build_vocab(new_sentences, update = True)
new_model.train(new_sentences, total_examples=2, epochs = 1)
new_model.wv.vocab

This is the results of above code.

Screenshot from 2018-12-27 04-29-44

The google pre-trained word2vec model

Google has published a pre-trained word2vec model. It is trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. Here is the download link for the google’s pre-trained 300-dimensional word vectors GoogleNews-vectors-negative300.bin.gz. The binary file (GoogleNews-vectors-negative300.bin) is 3.4 Gigabytes being unzipped. For more information about the word2vec model published by Google, you can see the link here.

After downloading it, you can load it as follows (if it is in the same directory as the py file or jupyter notebook):


from gensim.models import KeyedVectors
filename = 'GoogleNews-vectors-negative300.bin'
model = KeyedVectors.load_word2vec_format(filename, binary=True)

“Transfer learning” on Google pre-trained word2vec

Update your word2vec with Google’s pre-trained model

It is a powerful pre-trained model but there is one downside. You can not continue the training since it lacks hidden weights, vocabulary frequencies, and the binary tree. Therefore, it is not possible right now to do transfer learning on Google’s pre-trained model.

You might have some customized word2vec you trained and you are worried that the vectors for some common words are not good enough. 

To solve the above problem, you can replace the word vectors from your model with the vectors from Google’s word2vec model with a method call intersect_word2vec_format.


your_word2vec_model.intersect_word2vec_format('GoogleNews-vectors-negative300.bin', lockf=1.0,binary=True)

See the documentation here for more details on this new method.

The method described above is not exactly transfer learning but it is quite similar.

How to really do transfer learning using Google’s pre-trained model for your customized dataset?

It is often argued that there is little to no benefits to continue the training on Word2Vec model.

Imagine a situation that you have a small customized dataset so that the word2vec model you trained is not good enough but you also worried that pre-trained vectors do not really make sense for you for some common words. I argue that this is a pretty common situation and that is why transfer learning on Convolutional Neural networks are so popular.

Here is the code to accomplish that.


from gensim.models import Word2Vec

sentences = [["bad","robots"],["good","human"],['yes', 'this', 'is', 'the', 'word2vec', 'model']]

# size option needs to be set to 300 to be the same as Google's pre-trained model

word2vec_model = Word2Vec(size = 300, window=5,
min_count = 1, workers = 2)

word2vec_model.build_vocab(sentences)

# assign the vectors to the vocabs that are in Google's pre-trained model and your sentences defined above.

# lockf needs to be set to 1.0 to allow continued training.

word2vec_model.intersect_word2vec_format('./word2vec/GoogleNews-vectors-negative300.bin', lockf=1.0, binary=True)

# continue training with you own data

word2vec_model.train(sentences, total_examples=3, epochs = 5)

Here is the result of above code.

transfer_learning_google_word2vec

 

 

A lean approach to Data science phone Interview

This is my curated checklist before phone interview and onsite interview.

Practice “Tell me about yourself/your background/…..”

Pre-Phone interview

  1. Company website dive
  2. Prepare questions you have
  3. Review basic Stats
  4. Review Machine learning basics (Supervised and unsupervised)
  5. Review Python basics

Company website/app dive

Look at the company website for how they work, how they grow their business, how they work compared to competitors and etc. Most importantly, why do they need data science? This process was to get yourself familiar with the business and think about how does the data part contribute to their business.

 

Prepare questions you have

After looking at the website or app, you might have some ideas about how they can benefit from predictive modeling or A/B testing. There are some questions that you wanted to ask but maybe will be answered before you asked.

  1. How is the data science team structured? What are the team’s current size and the team’s background? What is the expected growth for the team for the next year? What are other areas of the company do you think data science team can make a big impact?
  2. Where do you think data science team can make the biggest contribution right now? What are the biggest challenges for the team right now? How are you dealing with those challenges?
  3. What are some possible projects I will work on if I am hired? What is the onboarding process? What is the tech stack for the data science team? How does the data-science team work with other teams? Central data science team or de-centralized team for each department?
  4. Who do you think are your competitors, and what is your estimated respective market share? What’s your advantage over your competitors?
  5. Some company or industry-specific questions.

Personal favorite questions: Tell me more about the interaction you have with the data security team, data engineering team, and software engineering team. What is the strategy for anonymizing data for predictive modeling? How do you deal with the exponentially increasing data size?

You will see if the biggest challenges for the company are from these areas and see if the answer is consistent. Sometimes it is good to ask related but not exactly the same question to gauge if the answer is consistent from the interviewer.

Review basic stats

Go over some hypothesis testing by hand and review the assumptions of some basic hypothesis tests. You will be surprised that you forget so much. 

  1. Central limit theorem: The mean of an iid sample from the population will follow a normal distribution if the sample size is big enough.
  2. Bootstrapping: Sample without replacement. How do you use bootstrapping for estimating confidence interval? How do you use bootstrapping for testing model robustness(variance part of the bias-variance trade-off)?
  3. t-test, Welch’s t-test, Chi-squared test, Wilcoxon test. How to calculate the test statistic and what are the assumptions behind all these tests?
  4. How to do power analysis? Try to do one simple power analysis for t-test and chi-squared test by hand or using Python. Even if you are so confident about it. Just do it once without any material. 

Review Machine learning basics

  1. Linear regression: Loss function (with/without regularization), Analytic solutions and a gradient descent approach.
  2. Logistic regression: Loss function (with/without regularization), gradient descent solution.
  3. Decision trees. Gini index vs entropy vs hinge loss. What is a random forest? How does it work? Random forest vs decision tree.
  4. Kmeans and PCA are the most commonly asked unsupervised learning methods.
  5. Different metrics for machine learning: F-measure (F1, F2, F0.5), recall, precision, kappa, ROC curve, AUC (area under the ROC curve), balanced accuracy.

Review Python basics

  1. List and its methods. append, extend, insert, slice and indexing.
  2. Dictionary and its methods.
  3. for loop
  4. String methods
  5. Sorting list

For the data science phone screen or onsite, I was never asked Linked List/Graph/Tree. I would say that do not worry about it if you do not know about it.

Post-Phone interview

Send a follow-up email to the interviewer or the HR. Then you wait.

How much time do you need for prep? 4hours

website + questions = 1 hr

stats = 1 hr

machine learning = 1 hr

python basics = 1 hr

Bonus Tips:

There will always be unexpected questions. Be calm, be a nice human being, be positive and joyful. I wish you the best of luck. I know all of you will find your dream job soon.

More detailed prep material to come soon.

 

How to write production level code

All I am writing here is my personal summary for this blog on medium, and I also reshape the order in which you should improve on.

Why production level code?

Ability to write a production-level code is one of the sought-after skills for a data scientist role— either posted explicitly or not.

The production level code has several features.

  1. Modular code:  Decompose the code into low-level, medium-level and high-level functions. Larger functions are harder to debug.
  2. Version control: use git with branch.
  3. Readability: comments/docstring and self-explanatory function and variable names.
  4. Unit testing: the unitest module in Python.
  5. Logging: Records only actionable information such as critical failures during runtime and structured data such as intermediate results that will be later used by the code itself.
  6. Code optimization: better space and time complexity.
  7. Compatibility with ecosystem

 

For a new graduate without much experience, you definitely should write your code into smaller functions, make the code more readable for future maintenance (add doc-string and more comments), add the unit test of the functions/class and put it on Github.

The best way for me to practice all that was to rewrite one or more of pet projects into a small python package, web application (Flask in Python), restful API hosted on AWS.

It will practice your coding skills and better visibility when applying for jobs. 

Here is another python guide targeted for data scientists, which I think is very practical and awesome.

Kmeans algorithm implemented in Python

For the unsupervised learning algorithm, the Kmeans algorithm is one of the most iconic unsupervised learning methods. In this blog, I will try to explain the Kmeans algorithm and how to implement it in Python.

  1. The first step is to generate some random data points to be the cluster centers based on the number of clusters to separate the data into.
  2. The second step is to assign data points to different clusters based on a distance metric. (You could choose the Euclidean distance, Manhattan distance, Mahalanobis distance, or cosine similarity based on the detail of the project. I will talk about the different distance/similarity metrics in a future blog.)
  3. The third step is to calculate the new cluster center based on the new cluster assignment. The most simple way to calculate the new cluster center is by taking the mean of the data in the new clusters.
  4. Iterate step 2 and 3 until the changes in the center of the clusters are smaller than a specific value.

The GIF below summarized all the steps stated above.​_

Side note:

  1. Because of the random initial cluster centers, the Kmeans algorithm will lead to a slightly different result. Then, it is customary to run Kmeans several times to determine if the algorithm converges to a similar solution.
  2. Due to the fact that the Kmeans algorithms use some type of distance metric, it is crucial to transforming the data to have zero mean and unit variance or in the {0,1] range.
  3. The final clusters can be very sensitive to the selection of initial centroids and the fact that the algorithm can produce empty clusters, which can produce some expected behavior. There is a specialized algorithm called kmeans++ optimized for the initialization of the cluster centers.
  4. The final selection of the Kmeans algorithm and the number of clusters in Kmeans can be done by choosing the algorithm that has the lowest within-cluster-variance and many other metrics that can be chosen from. See this blog if you want to know more.

 

This jupyter notebook in my Github describes all the code and the details of how I implemented the Kmeans and apply it to an image to do color vector quantization.

 

I have also refactored all the code above into a class in python with the fit, predict, transform method as implemented in sklearn and some other helper functions for ease of use. It is also in my GitHub. I have also added some examples on how to use the Kmeans algorithm I implemented.

 

Let me know if you have any questions.

If you have just started to learn python, this python tutorial by Corey is the most awesome I have ever seen. He explains everything so well and covers a lot of practical topics.

 

an ML business case question

I have been helping people with ML case interviews recently and came up with some questions involving ML/Stats/Coding in one case. This ML case problem aims to wrap many interview questions I had in the past. Here it goes.

 

Suppose you are the head of data product manager at Google. You are assigned a task to implement a feature in Gmail to filter spam emails. What kind of data do you need from google database to achieve that?

Suppose IP address is extremely important, but it has many categories that one-hot-encoding will blow up the memory, what could you do about it?

If the outcome class is very imbalanced, what could you do to handle it? When would you use F2 over F1 in this scenario? What metric do you use to select the best model? And how do you test the performance of your model after Gmail is released to the market?

 

 

 

 

 

 

 

 

 

 

 

My original answers to the questions above.

1) IP address, mac address, the subject of an email, (no email content for privacy), and etc 2) binary encoding, numerical encoding could help, but do categorical encoding if possible. 3) using sampling inside the cross-validation like up, down, smote, rose and other variants of smote, and Tomek links. Smote with Tomek Links works really well. 4) When you want to emphasize recall over precision, you use F2 over F1. 5) Explain F1, AUC, Kappa, Sensitivity and etc. 6) More open-ended question.

 

 

Some possible follow-up questions.

1) How do you get the label for the data whether it is spam or not? How do you define the multi-user spam consensus quantitatively? Suppose that some security experts marks some email as fishing email, how could you utilize these labels for your model?
2) If reducing the false positives is important to you, do you still want to use F1 as a metric? Should you choose another F-measure? Should it be F2 or F0.5?
3) If you use F-measure, it would mean that you do not really care about predicting a non-spam email correctly, assuming spam email is a positive class. Do you think that makes sense for Gmail like a product? If not, what kind of metric should you choose?
4) Suppose that the logistic regression is the first model you used for spam filtering, but you suspect that there is collinearity in features that are affecting your model stability(high variance). What could you do to combat that problem? If you are using L2/L1 regularization, how does the loss function changes?
5) You could use PCA or L1/L2 regularization to deal with the problem mentioned above. How do you decide which method to use historical data to compare the performance?
6) How do you conduct an online A/B testing to compare which method to use? What kind of metrics do you want to measure?
7) Suppose your boss told you that they care most about 10 metrics. Is it going to be a problem if you want to test the difference in 10 metrics? How do you deal with that problem?
8) Suppose that you had heard ideas about random forest but there are no implementations of such an algorithm yet. Now one of your colleagues has given you a function called gini_split. How do you implement the random forest algorithm?

Choosing the right metric for your data science/machine learning project: class imbalance

A few days ago, I read a fantastic article about how to deliver on ML projects on medium. It got me thinking about all the mistakes I made when I was first learning about machine learning. Back then, I was so fascinated by the different kinds of machine learning models available and learned my first lesson when I was training an XGBoost model on a highly imbalanced dataset using accuracy as the metric.

The lesson learned was that choosing the right metrics are so much more critical than selecting correct algorithms. Selecting the right metric is vital to the success or failure for a business problem. 

Metrics in Imbalanced class classification

Choosing the correct metric for an imbalanced classification problem is a crucial first step in building the models. For credit card fraud detection, this problem is most profound since the majority of the transactions are legit.

If you use the accuracy as a metric, you will likely get a really high accuracy using any model. Imagine that out of 100 transactions, there is 1 fraudulent one. The accuracy for the model that predicts all transactions to be legit is 99%. But this model is useless at best.

What kind of metrics could you use? There are metrics like F-measure, ROC/AUC, recall, kappa, precision, balanced accuracy and etc., which all measure a different aspect of the model. Sklearn website has more metrics to choose from.

Different metrics

The precision is that given the predicted value are true, what is the true label percentage, expressed in conditional probability as  P( True label | predicted True )

The recall is that given the labels are true, what is the percentage of them are predicted to be true, expressed in conditional probability as P( predicted True | True label ). The recall is equal to the sensitivity in the hypothesis testing.

The F1 score is the harmonic mean between precision and recall. F2 gives higher weight to recall while F0.5 gives higher weight to precision. Why harmonic mean instead of the arithmetic mean? The explanation comes from StackOverflow.

Because the harmonic mean punishes extreme values more.

Consider a trivial method (e.g. always returning class A). There are infinite data elements of class B and a single element of class A:

Precision: 0.0
Recall: 1.0

When taking the arithmetic mean, it would have 50% correct. Despite being the worst possible outcome! With the harmonic mean, the F1-measure is 0.

Arithmetic mean: 0.5
Harmonic mean: 0.0

In other words, to have a high F1, you need to both have a high precision and recall.

This is not the end of the story since you need to define what is the positive class for the precision and recall. Regarding credit card fraud detection, you are more interested in the fraudulent case than the legit case so that fraudulent transaction should be set to be the positive case.

What is ROC/AUC? I have talked about how to understand ROC curves from the hypothesis testing perspective in previous blogs.  In short, ROC plots the True Positive rate(Recall) vs. False Positive rate.

True Positive Rate = P(Predicted True| True Label) = Recall

False Positive rate = P(Predicted True | False Label)

AUC is short for Area Under the Curve. Since the ROC/AUC metric takes the negative class into consideration, the metric could be inflated from the class imbalance, but if you care about the correct prediction for both class, then ROC is a better choice than Accuracy.

Another replacement for ROC/AUC is the balanced accuracy, in which the metric re-weighted the accuracy for negative and positive class. This is an excellent complement for the ROC/AUC score.

Cohen’s Kappa is another metric that balanced the accuracy for imbalanced class problems. It is defined by

kappa

where P_o is just the accuracy and P_e is the hypothetical probability of agreement due to chance. The Wikipedia page here has some explanation of Kappa if you are not familiar with it. This is another source that explains the kappa metric from PSU online classes.

Some Discussion: F-measure vs. ROC/AUC

Some people have argued before that ROC curve should not be used in highly imbalanced classification problems in Kaggle since the ROC curve is inflated due to class imbalance. For the scripts he presented, the AUC/ROC score does not make sense but it is not always true.

This statement from Kaggle below described when you should use ROC/AUC vs

“True negatives need to be meaningful for ROC to be a good choice of measure. In his example, if we’ve got 1,000 pictures of cats and dogs and our model determines whether the picture is a cat (target = 0) or a dog (target = 1), we probably care just as much about getting the cats right as the dogs, and so ROC is a good choice of metric.

If instead, we’ve got a collection of 1,000,000 pictures, and we build a model to try to identify the 1,000 dog pictures mixed in it, correctly identifying “not-dog” pictures is not quite as useful. Instead, it makes more sense to measure how often a picture is a dog when our model says it’s a dog (i.e., precision) and how many of the dogs in the picture set we found (i.e., recall). ”

TL;DR Use F-measure if you did not care about negative class and Use AUC/ROC, balanced accuracy, and Kappa if you care about the performance in both classes. 

For the multi-classification problems, there are two types of F-measure. The Micro-averaged and Macro-averaged F-measures.  The Micro-averaged F-measure calculates the recall and precision using data from all the classes, and Macro-averaged calculates the recall and precision for each class and averages it.

Screenshot from 2019-01-01 20-41-40

For a complete review of all the metrics for classifications, see this paper.

 

 

If you read this article before Nov 13, there are two equations errors for kappa and Precision_micro and Recall_micro, which are now corrected.

My Data Science interviews

For more DS/ML interview questions, see this blog post.

Wondering how to efficiently prep for the upcoming DS phone interview? See this blog https://phdstatsphys.wordpress.com/2018/12/17/a-lean-approach-to-data-science-phone-interview/

Capture

What data science today was considered boring data processing back in the 1990s. WENUS, Weekly Estimated Net Usage Systems.

If you want to know what is a data scientist, check out this article on medium https://medium.com/@m.sugang/what-ive-learned-as-a-data-scientist-edb998ac11ec

This is a personal diary on my interview journey as I look for data scientist jobs. Buckle up and bookmark it since it is going to be a long one.

Some reminder to myself:

  1. Prepare and use the product if possible.
  2. Practice behavior questions the day before or on that day.
  3. Always be positive.
  4. Write thank-you letters to recruiters and interviewers if possible.
  5. Stick to things you know and be confident.
  6. For whiteboard coding, always run a test case with some edge case.

Some interview prep questions for ML/DS:

Background Questions:
1. Walk me through your resume (or some variant thereof)
2. Explain your past Ph.D./scientific work
3. Which techniques (that are relevant for Data Science) did you use in that scientific work?
4. What languages are you familiar with?

Technical Questions (In order of frequency of appearance):

1. Regression (the short answer here is to know everything) (Always)
– p-values, interpretation of coefficients, interaction coefficients
– know how to factor variables are dummy coded, coefficient interpretation
– linear regression assumptions: a linear relationship, the normality of residuals, no autocorrelation,  no multicollinearity, homoscedasticity

– residual analysis, adding non-linear terms
– difference between L1 and L2 penalization, when you would use each one: L2 performs better in most scenarios and leads to small but never zero coefficients, L1 reduce the dimension of feature space.
– how to deal with highly correlated variables: PCA/regularization.
– walk me through how you would approach building a regression model to predict Y given you have data X
– how do you choose the right number of predictors
– logistic regression: how do you implement stochastic gradient descent for logistic regression with L2 regularization.

2. General predictive modeling (Always)
– train/test/validate sets, cross-validation, parameter selection, predictor selection, etc.

3. Random Forrest (Always)
– how a tree is grown
– purpose of increasing a forest
– purpose of selecting a subset of variables at each split
– variable importance
– knowing XGBoost vs random forest. How to split for the XGBoost?

4. Matrix Factorization (Rare)
– PCA, when do you use it? How are the components ordered? de-correlate features, reduce the dimension in the high dimension/sparse data. Ordered by the variance.
– I’ve never been asked about factor analysis, but it is great to know and makes other related questions easier

5. K-means clustering (often)
– What is the loss function? When does it converge? RMSE
– know the algorithm iteration steps
– how do you choose the right number of clusters? domain knowledge or use DBSCAN.
– is the optimization convex? The loss function is convex.

6. Standard Stats (Often)
– t-test assumptions: sample from the normal distribution, sample independence, homoscedasticity

– z-scores

– chi-square test: the residual is normally distributed, needs a large number of data for the approximation of chi-squared distribution to hold, independence.

– know covariance and correlation equations

– when splitting the data by random users, you need to take into consideration that some users may interact with each other. For example, if you are testing a new pricing algorithm in Uber, you should not split the users in the same city/neighborhood. If you are testing a new ranking algorithm for encouraging users to answer questions, then you should not let expose the same questions to the different algorithms. After a lot of high-quality answers, users are less likely to add their answer.

7. Time Series (Rare)
– ARIMA model, PACF vs ACF, how to choose the AR and MA terms.
– unit root test, normality test.

I’ve also never been asked any SQL queries yet in interviews yet.

The interview experience:

I going to write about my interview experience in a chronical order with good and not-so-good things.

Phone interview/JPMorgan/Machine learning Engineer intern/rejected

What is the overfitting? How to overcome it?

If you have some data that are partially labeled, will that improve your supervised learning model?

What is bias-variance trade-off?

Write out the loss function for logistic regression. How to add L1/L2 regularization?

How to apply the Stochastic Gradient Descent (SGD) to logistic regression?

Given a list that contains some animals like [“bear”, “bird”, “bird”, “chicken”, “mouse”, “mouse”], find out the animal that appears most frequently in the list.

Phone interview/Hidimensional/a referral firm/rejected

What is the global cost function for the random forest? Some more questions about basic ML algorithms like Logistic regression.

Good feedback about my interview skills.

Website: https://hidimensional.com/

Zoom interview/Insight data science/A data science boot-camp for Ph.D./Offer

Tell me about your Ph.D. What are your conclusions?

Prepare a small ML project. Get asked questions about why do you choose this algorithm vs. another.

https://www.insightdatascience.com/

If more people are interested in my experience, I will write it on another blog.

Onsite interview/Wayfair/Senior data scientist/rejected

The onsite interview consists of 4 interviews. First two are technical, the third one is to introduce you to the team structure, the fourth one is also technical. They focus a lot on applying ML algorithms to the business case, expect 2-3 interviewers to ask about how to use ML in a business case.

All my interviewers got changed last minute and some other last minute changes. Two interviewers prepared the same coding question.

Coding: Suppose you have a basket of things like 5 sofas, 6 beds, and 7 lamps {“sofa”: 5, “bed”: 6, “lamp”:7}, write a function to randomly pick an object.

Business case: How do you come up with an algorithm to predict the auction price of an advertisement for a blue sofa on google search? What metrics could you use to measure if the new algorithm improved? What if there are only a few data points for some new items so that you ML algorithm does not have enough data?

Hackerank/Goldman Sach/Decision data scientist/rejected

All multiple choice math questions.

  1. Calculate the volume expanded by y = x^2 when you rotate the curve around the y-axis.
  2. Calculate the length of the curve y = x^2 +1 from [0,1].
  3. The limit of sqrt(6+sqrt(6+sqrt(6+…..))).  sqrt means squared root.
  4. Some combinatorics question. Suppose you have three open-ended strings. Each time you choose two open ends to knot them together. After three rounds, what is the probability that the end result is a circular ring?

Skype interview/Edgestream/Research scientist/Rejected 

You are asked to present your Ph.D. work in a 20-minute  presentation over screen share on skype. Then, you are given a chance to ask them questions. No technical questions.

Phone interview/Gamalon/Data Scientist/Rejected

Introduce you to the role a little bit. Machine learning prototype work + customer facing. 1. What is it that you wish to learn for the next two years? 2. Tell me about yourself.

Phone interview/JPMorgan/Roar team data scientist/Rejected

Just an hour-long interview asking about your background, what you have done in the past and what is your knowledge level on cs/python/tensorflow/generator. No technical questions. One brain-teaser: How many times does 2 appear in the sequence of 1 to 1 million?

Phone interview/Quora/Data Scientist/proceed to onsite interview

Ask about how do you rank the questions&answers of Quora’s feed. How do you use machine learning to build a model to do that? What is recall vs. precision? After you develop the model, how do you use A/B testing to show that your algorithms work better? How to do power analysis? What is power? A comprehensive test of ML and A/B testing.

Onsite interview/Quora/Data Scientist/rejected

The onsite starts from 10:15 am to 3:30 pm. The first 15 mins is an office tour.

The first interview is data practical, which consists of two questions. The first one is a question about data manipulation and the second one is about doing an A/B test on the data.

The second interview is about A/B tests and product intuition. Asked about assumptions behind different tests. What if the data is not normally distributed? Why the data is normally distributed? How do you design an algorithm to decide when you should send out the email? Think about the features you are gonna use and watch out for correlated variables. How do you deal with correlated variables? Regularization vs. tree-based models.

The lunch is mostly about selling themselves to you, letting you ask questions about the team and the company.

The third interview is about different kinds of metrics and how do you do the random splitting for A/B testing. There are a lot of detailed questions about why use this metric, why not using this metric, and if the approach is problematic, how do you solve it.

The fourth interview is a coding problem. Given a large integer stored as a string, how do you come up with a function to do multiplication?

The fifth interview is behavior based. What is the biggest mistake you have made and what have you learned from it? Tell me about a time you break a rule. Tell me about one of the projects you have worked on.

After the interview, I felt that I did not do so well and I am not a good fit for the position. If you have an extensive background in stats, then you should consider Quora’s data science team, which is separate from the ML team.

I got rejected within two business days. I love the product and the team, they are moving really fast in the process, which I really appreciate. The entire onsite interview process is rather seamless.

Technical Assessment + phone screen with HR/CarMax/Senior data scientist/No response after two months. 

The technical assessment has nothing related to data science. Most of the questions are just reading graphs and some percentage calculations under a time limit. I did not get to finish all the questions. I felt that the questions are appropriate for a business analyst.

The phone screen is just a casual conversation with HR about the company and the team. I heard that the next round is doing a business case problem on the phone, which seems to me that they are looking for a business analyst instead of a typical data scientist. My friend told me that there is no ML/Stats/Coding question at all.

 

Some more prep questions from Glassdoor:

1. How to choose two ranking algorithms?
from email or session time, upvote, follow, request an answer, session time, answer a question.

2. What do you want to improve about Quora? What features do you want to include? What features do you like about Quora?

3. There are X answers to a certain question on Quora. How do you create a model that takes user viewing history to rank the questions? How computationally intensive is this model?”

4. You’re drawing from a random variable that is normally distributed X ~ N(0,1), once per day. What is the expected number of days that it takes to draw a value that’s higher than 2?

5. How would you improve the “Related Questions” suggestion process?

6. Write a function to return the best split for a decision tree regressor using the L2 loss.

7. Why do you want to join Quora?

8. Pandas.

9. growth analysis

10. Print the element of a matrix in a zig-zag manner.

Phone screen+data challenge/Watchtower.ai/first data scientist/proceed to onsite interview

I got referred by a friend. The phone screen is mainly talking about your background and what you are looking for in a data science position. They (the two founders) talked about the company they are trying to build. They asked me if I have time to do a data challenge, and even offered me an alternative or an alternative project if I did not have the time.  The told me that they are looking for the first data scientist right now.

I tried to do the project over the weekends and send them my Github account with a simple slide on Monday. Then, I talked with them over zoom about the project and got invited to their office in SF. I could provide my Github repo later if anyone is interested.

The company is relatively small so that everything is moving really fast, which is something I really appreciate and love.

Onsite/watchtower.ai/data scientist/Rejected 

The interview starts at 11:59 am. The entire interview is about implementing the kmeans algorithm and apply it to an image color vector quantization. See more details here on another blog. You will first talk about kmeans, how it works and write it in Python for the first 1.5 hrs. Then, you will go to have lunch with the team. Then, you will continue to implement the kmeans algorithm and apply it to an image. I did not get to finish it in time since I did not know about the proper way to deal with initialization and sometimes a bad initialization leads to all data to be assigned to one single cluster. I got some more time to work on it after the interview. At the end of the onsite interview, you got some time to ask them some more questions and they will ask you some more questions.

I have written about Kmeans and how I implemented it here.

After about 10 days (over the Thanksgiving weekend), they give me a call on about their concerns about both my skills and visa status. The phone call was really nice since they point out where I could improve on. I will need to improve on writing code faster and wiring code that is production ready.

Phone Screen + data challenge/Upstart/data scientist/Rejected

The HR screen was mainly about the team at Upstart and the work they do. Then, I talked about my background and where I am strong at. They really care about whether you have written a lot of code before since the HR asked me how many lines of code I have written before.

After the phone screen with the HR, I was quickly given a data challenge over the weekend. The problem was a simple survival analysis with Kaplan-Meier estimator. But I made two mistakes and did not get to proceed with the interview.

I will write about the lessons I learned from this interview in the future.

Phone screen/Vinya intelligence/data scientist/Rejected.

I was first sent some materials to look at before the phone screen. The phone screen was with the CEO. He talks in great detail about the Vinya intelligence and then I talked about my background, asked a lot of questions about the company and how do they expect the growth to be.

I am currently waiting for them to schedule the onsite interview since Christmas is coming.

After Christmas, I sent a follow-up email but was told that there were problems with their funding and they were not hiring right now.

Phone screen/Root insurance/data scientist/Rejected

First, I talked about the thing I know about Root insurance and the VP will talk about in more detail about Root insurance. Then, I asked some questions about the team, projects and etc.

Then it moves on to technical questions, which is totally surprising to me since the HR did not mention anything about technical questions. But I felt that I should know that since I am scheduled for a 1hr phone call with the VP.

He first asked me about the central limit theorem and how it can be useful. How to estimate confidence intervals and how to use bootstrapping for estimation.

Then, he asks about how to derive the analytical solution for the linear regression from the L2 loss to the final solution. Then we move on to talk about why Logistic regression vs Linear regression, what is the loss function? Why use Logistic regression for 0/1 labeled data. Lastly, we talked about what would you do if the data is imbalanced? What kind of metrics should you use? I also have written about this topic recently on my blog. 

The next step is to do the data challenge, and I am still waiting for a response from them.

I got the rejection letter about a week later.

An important lesson I learned from this phone interview was to refresh the things you think you know before every interview. I will have a checklist/blog on this in the near future. 

Phone screen/CVS supply chain team/data scientist/Rejected

A friend at the Insight Data Science Bootcamp referred me this job. The phone interview was conducted by two people. I was pretty nervous and stuttering during the entire process, especially in the beginning. Both of the interviewers have been through the Insight data science program as me and they have been trying to calm me down.

The question was about how to design a program for the elevator.  How do you design the interface for the buttons in the building and how to design the buttons in the elevator? How do you store a sequence of requests from the building? How do you determine the sequence of actions based on the requests available now? What kind of object in Python should you use for the interface and main program? Python class.

You will need to explain your logic in an abstract way. No whiteboard coding problem.

Phone screen/Wayfair/data scientist Ph.D./Rejected

Applied to Wayfair again after 6 months. I got a phone interview a week after the online application. The interviewer was late for 10 mins due to some confusion.

He first introduces himself as someone from the Direct Mail team and the Data Science team structure in Wayfair. Then, I introduce myself as per usual.

Then he went on to ask me why does Wayfair want to match two or more different devices to the same person (a more complete picture of the user). Then, he asked me to give an example of the usage (better ad tracking and marketing). This is a brief introduction on the topic if you are not familiar with it. https://clearcode.cc/blog/deterministic-probabilistic-matching/

Later we move on to the more technical part of the interview. He asked me about what features and approach will I use to solve that problem. I proposed to get some labeled data first by considering whether they have opened the same email from Wayfair, whether they have login the same account in different computers and etc. After obtaining some labeled data, I discussed some device attributes and user behaviors that can be used in the algorithm. These are the two articles that discussed the attributes and process in cross-device matching in more detail. If you are in a hurry, these two articles are all you need. 

https://blogs.gartner.com/martin-kihn/how-cross-device-identity-matching-works-part-1/

https://blogs.gartner.com/martin-kihn/how-cross-device-identity-matching-works-part-2/

Then, I propose to use the DBSCAN algorithm (unsupervised learning method) to determine if the devices belong to the same user. I will use the labeled data to measure the performance of DBSCAN and tune the two major hyper-parameter minimum number of points and radius.

For more on DBSCAN, see here https://towardsdatascience.com/how-dbscan-works-and-why-should-i-use-it-443b4a191c80.

Then he went on to ask me how do we apply supervised machine learning on this problem. What is your target value? I said that I would assume that each user is a class, which can lead to millions of classes for Wayfair. He suggested that we reduce the problem to whether the two devices belong to the same person. The problem is then transformed into a binary classification problem. I did not get that point even until now as to how to train the model with the data.

In the end, I asked some questions about team growth and career development in Wayfair. The interview was kind of rushed in the end due to the fact that he is late for 10 mins.

 

Some preparation questions from Glassdoor:

Introduce yourself first.
Describe a project that you most proud of?

Questions:
1. How would you evaluate the effectiveness of various ads?
2.(most frequent in glassdoor) Match users using different devices to browse items, the devices are phones or laptops.
3. recommend goods to the customer
4. Given the available data, walk me through how you would value a marketing campaign from a data science perspective.
5. Wayfair decides to not offer phone customer service to half of their online customers. Why would Wayfair decide to do that?
6. Match furniture in a scene given query pictures plus computer vision basics.
7. Data science business case (i.e. click through, customer retention, etc)
8. What are the potential weaknesses of that pricing strategy?
9.sale tag case
10. How do you quantify the overall effect of SALES flag on the Wayfair website? What model to use?
11. Some question about website data analysis, a question on how to evaluate the influence of sale tags on the website.
what’s the difference between random forest and gradient boosting?
The dependent and independent variables. Which algorithms? Why this algorithm? Is there any unbalanced data? What kind of metrics do you use to evaluate the algorithm?
What kind of job will make you excited to go to every morning?

Phone screen/Jasmine22/second data scientist/Rejected

The position was referred by a headhunter in HK. The phone screen was conducted over Google meetings. The interviewer was several minutes late. He just said “introduce yourself” after he got into the Google meeting with no introduction of himself and the team nor did he apologized for his delay. It appeared very rude and disrespectful to me as a candidate. Other than that, I believed that he put the laptop on his legs and kept shaking his leg during the interview. It was just the worst and most disrespectful interview I have ever had so far. This is the end of my complaint.

He started the interview with “introduce yourself”(first red flag). After I talked about my experiences, he went on to ask about one of my recent projects with word2vec. I tried to explain what word2vec is and how I obtained my data, but he seemed to be confused a bit. Then, he went on to ask me about any supervised and unsupervised machine learning model I know. I tried to explain some of the algorithms but he suddenly interrupts me when I was talking about Kmeans (second red flag). He then asks me how do you make sense of the clusters. I tried to explain to him that kmeans was only optimizing on the within-cluster sum-of-squares and you have to make sense of the cluster formation afterward.  He seems to be confused about what I am talking about so that I asked him for some background information on this question. He told me that he is doing something related to clustering companies into different growth potential (big company but slow growth or small company but high growth etc ). After explaining a little bit more about how kmeans works, I came to the conclusion that he had no idea or understanding of how kmeans works as that company’s first data scientist(another red flag). I offered him a solution that is to look into the profile of the company after the clustering process. I then asked him about company growth, possible projects, team structure, etc. Another red flag here was that the company was trying to tackle some supply chain problem and refinancing problem for the small to medium company at the same time.

I got the rejection after a few days saying that I lacked the work experience, which they know with me coming into the interview. But I totally understand that if they have found a candidate with more work experience.

Phone screen + data challenge/ RiskIQ/data scientist (research oriented)/Rejected

I got this interview from the Insight data science. The first call was with the VP of engineering. It was a fun interview. I got to ask many questions regarding the role and the company. I also introduce me about my experience and the kind of problems I have been solving and solved. The call ended early with me moving the next round of interview- the data challenge.

The situation goes downhill from there. When they sent over the data challenge after my reminder email, my name was wrongly spelled. I thought to myself, it could happen to anyone including me with that damned autocorrect. But this is probably the first red flag.

The second red flag was with the data challenge. The data they sent over needs some degree of cleaning, which is quite normal. They want to know if you could and know how to deal with them. But a large part of the data is hashed and the questions are too open-ended. I have to emphasize that the role itself is a bit of research-oriented so that I understand the nature of the open-ended questions. But the goal of the data analysis is so unclear as to you do not know where should you start to explore them at all.

These are the questions that were asked:

Suppose you want to understand more about these hosts and their dependent requests.  What can you derive from this data?  Some ideas:

* what does the distribution of dependent requests across hosts look like?  What other basic statistics might be interesting to describe this data?
* can you classify hosts into reasonable “types” based on these features?  Conceivably, www.google.com will look different from www.nytimes.com, but perhaps www.google.com and www.bing.com are similar?

* are there interesting correlations between dependent requests?
* any other interesting question you might want to answer…
The HR emphasize that I should spend no more than 3hr in this but it took me about 3hr just trying to clean all the data and format it in a way that is easy for further analysis. This is not counting the computation that took overnight to run. Maybe I am not as fluent in analyzing the web traffic data so that I end up spending the entire weekend and part of Monday to finish it.
I have to admit that part of the reason for my ranting is that they reject me. But another part of me began to question the nature of such open-ended data challenges as free exploratory work especially with no technical questions in the phone interview.

 

The HR went into radio-silence for two weeks after I sent in the data challenge and sent me a standard rejection letter. When I asked for some feedback from it, they did not reply with anything. That is when I raised the suspicion and reflect on the entire interview process and write this down.

Leave a comment below if you are interested to know what the data challenge is. I can share my GitHub repo with you.

two Phone interviews+ data challenge/Peloton/product analyst/Just submitted data challenge

I referred by a friend to this position. Applied in April and got a reply in early May. The first call was a casual phone call with HR talking about the role, the company, the team, and the next steps. The HR was really nice mentioning about their really awesome policy on visa sponsorship and green card. That was a really pleasant call for a long while.

The next step was with a senior analyst on the team. She was a really nice person and a pleasure to talk to. She is very polite, respectful and patient with my questions regarding the team. the projects and etc. The interview started with her introducing the team, the company, the eco-system they are trying to build. She then gave me some context on the product and asks me some metrics I will measure to understand the user-engagement. The second question was about how to use user data to understand if the user was using the heart rate monitor or is the heart rate monitor working at all.

Then, I got to ask questions on the team and the onboarding process. The on-boarding process is two projects: one was getting you familiar with the data pipeline they have, the second was a more open-ended research problem.  (I ask onboarding projects because it is a good gauge on  the work they expect you to do and if they are really hiring.)

The next step is a data challenge with AB testing problem. I also have a point of contact if any questions arise.  I had a question on the experiment time on Saturday Night at 11pm, and the contact replied within 10 mins. That is just amazing communication!

Now I am waiting for the verdict for my data challenge. I have put my results on my GitHub here https://github.com/edwardcooper/peloton.

 

Set values in DataFrame with Boolean index in pandas

If you want to set values in pandas data frame with Boolean vectors, you will likely get an error described below.

Have you ever come across this error below when you want to set some values in the pandas data frame?


/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:1:

SettingWithCopyWarning: A value is trying to be set on a copy of
a slice from a DataFrame

See the caveats in the documentation:

http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

"""Entry point for launching an IPython kernel.

Is the warning you get similar to the code below? By setting some values to the pandas data frames with index and column name?


df.some_col_name[index]="A value to set"

df.loc[index,some_col_name]="A value to set"

The correct way to do this is


df.some_col_name.where(~index,other="A value to set")

If you do the where syntax and you get an error like


bad operand type for unary ~: 'float'

It is because that the index is not a boolean, you need to convert the pandas series into boolean values using the code below.


index=index.astype("bool")

df.some_col_name.where(~index,other="A value to set")

This is really annoying and very counter-intuitive and stupid if you are coming from R or Matlab(I suppose?).

Gonna add more pandas fix to the blogs as I learned along the way.

If you want to know more about SettingWithCopyWarning in pandas. You could check out this detailed blog by Dataquest. https://www.dataquest.io/blog/settingwithcopywarning/

Quick summary: seaborn vs ggplot2

Visualization: The start and the end

The visualization part of the data analysis is the initial and extremely crucial part of any data analysis. it could be the exploratory data analysis at the beginning of predictive modeling or the end product for a monthly report.

There are several data visualization packages in Python and R. In R, we have the excellent ggplot2, which is based on the grammar of graphics. In python, we have the amazing seaborn and matplotlib packages. All these packages approach the problem of plotting a little bit different, but they all aimed at plotting the same thing. Once you know the graph you want to plot, it would be easier to master and switch between them.

Of course, there are more fancy plotting solutions available like d3 or plotly.

Another way to look at visualization:

If we compare data visualization to a Taylor expansion, one-variable visualizations are like the first order expansions, two-variable visualizations are like the second order expansions, three-or-more-variable visualizations are like the higher order terms in Taylor expansion.

Visualizing one variable

discrete data

We should use a barplot to count the number of instances in each category.

ggplot2: geom_bar

seaborn: sns.countplot

continuous data

We should use a histogram to accomplish this.

ggplot2: geom_histogram

seaborn: sns.distplot

Visualizing two variables

Two discrete data columns

Use histogram but label another data column with colors (I will talk facet in visualizing 3 or more variables.)

ggplot2: geom_histogram(aes(color=”name_of_another_data_column”))

seaborn: sns.countplot(hue=’name_of_another_data_column’)

One discrete, one continuous data columns

Use boxplot for this.

ggplot2: geom_boxplot

seaborn: sns.boxplot

There are also swarmplot, stripplot, violinplot for this type of job.

Two continuous data columns

We use scatter plot for this.

ggplot2: geom_point

seaborn: sns.regplot,sns.jointplot(kind=’scatter’)

Visualizing three or more variables.

Things are getting complicated here. There are generally two ways to accomplish this. The first method is to label things differently with different colors, sizes or shapes and etc in one graph. The second method is to plot more graphs with each graph visualizing some variables by keeping one variable constant.

Two discrete and one continuous data columns

It could be visualized by visualizing one discrete and one continuous variable with boxplot and use color or facet to visualize another discrete variable.

Two Continuous and one discrete data columns

It could be visualized by visualizing two continuous variables with a scatterplot and use color or facet to visualize another discrete variable.

In ggplot2:


sp <- ggplot(data=tips, aes(x=total_bill, y=tip)) + geom_point()

# Divide by levels of "sex", in the vertical direction
sp + facet_grid(sex ~ .)

In seaborn you could choose factorplot or FacetGrid.


import matplotlib.pyplot as plt
g=sns.FacetGrid(data=tips,row='sex')
g.map(sns.regplot,'total_bill','tip')

 

Three continuous data columns

This needs a 3D scatterplot. This is not implemented in ggplot2 or seaborn/matplotlib, it needs some special packages. See this documentation for python.

This presentation is a good example of how to do more than 2 variables in R using ggplot2.

For the advanced feature like FaceGrid and factorplot in seaborn, see this blog for more examples.

This is a rather short summary and comparison between seaborn and ggplot2, and a discussion of how I viewed the data visualization process. I will add more examples in R/Python in the future.