Word2vec: how to train and update it

In this blog, I will briefly talk about what is word2vec, how to train your own word2vec, how to load the google’s pre-trained word2vec and how to update the google’s pre-trained model with the gensim package in Python.

What is word2vec?

If you ever involved in building any text classifier, you would have heard of word2vec. Word2vec was created by a team of researchers led by Tomáš Mikolov at Google. It is an unsupervised learning algorithm and it works by predicting its context words by applying a two-layer neural network. To understand more about word2vec under the hood, you can refer to the Youtube video by Stanford University.

The end result of word2vec is that it can convert words into vectors.


There are two algorithms to generate the encoding which are continuous bag-of-words approach and skip-gram. They are depicted in the graph below.


How to train a word2vec?

from gensim.models import Word2Vec
sentences = [['yes', 'this', 'is', 'the', 'word2vec', 'mode']\
,[ 'if',"you","have","think","about","it"]]
model = Word2Vec(sentences,size = 10, window=5,min_count = 1)

There are more options for the training where the size option determines the dimensions of the word vectors, the window option is the number of the forward and backward words used in training, and min_count is the number of minimum times a word needs to appear to be included in the training.

Save model and load the model

After you finished training the model, you can save it as follows:


If you want to reload the model into workspace again, you can load it as follows:


For more details on how to use the gensim package for word2vec model, see the tutorial by the gensim author. 

How to update the word2vec model?

from gensim.models import Word2Vec
old_sentences = [["bad","robots"],["good","human"]]
new_sentences = [['yes', 'this', 'is', 'the', 'word2vec', 'model']\
,[ 'if',"you","have","think","about","it"]]
old_model = Word2Vec(old_sentences,size = 10, window=5, min_count = 1, workers = 2)
new_model = Word2Vec.load("old_model")
new_model.build_vocab(new_sentences, update = True)
new_model.train(new_sentences, total_examples=2, epochs = 1)

This is the results of above code.

Screenshot from 2018-12-27 04-29-44

The google pre-trained word2vec model

Google has published a pre-trained word2vec model. It is trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. Here is the download link for the google’s pre-trained 300-dimensional word vectors GoogleNews-vectors-negative300.bin.gz. The binary file (GoogleNews-vectors-negative300.bin) is 3.4 Gigabytes being unzipped. For more information about the word2vec model published by Google, you can see the link here.

After downloading it, you can load it as follows (if it is in the same directory as the py file or jupyter notebook):

from gensim.models import KeyedVectors
filename = 'GoogleNews-vectors-negative300.bin'
model = KeyedVectors.load_word2vec_format(filename, binary=True)

“Transfer learning” on Google pre-trained word2vec

Update your word2vec with Google’s pre-trained model

It is a powerful pre-trained model but there is one downside. You can not continue the training since it lacks hidden weights, vocabulary frequencies, and the binary tree. Therefore, it is not possible right now to do transfer learning on Google’s pre-trained model.

You might have some customized word2vec you trained and you are worried that the vectors for some common words are not good enough. 

To solve the above problem, you can replace the word vectors from your model with the vectors from Google’s word2vec model with a method call intersect_word2vec_format.

your_word2vec_model.intersect_word2vec_format('GoogleNews-vectors-negative300.bin', lockf=1.0,binary=True)

See the documentation here for more details on this new method.

The method described above is not exactly transfer learning but it is quite similar.

How to really do transfer learning using Google’s pre-trained model for your customized dataset?

It is often argued that there is little to no benefits to continue the training on Word2Vec model.

Imagine a situation that you have a small customized dataset so that the word2vec model you trained is not good enough but you also worried that pre-trained vectors do not really make sense for you for some common words. I argue that this is a pretty common situation and that is why transfer learning on Convolutional Neural networks are so popular.

Here is the code to accomplish that.

from gensim.models import Word2Vec

sentences = [["bad","robots"],["good","human"],['yes', 'this', 'is', 'the', 'word2vec', 'model']]

# size option needs to be set to 300 to be the same as Google's pre-trained model

word2vec_model = Word2Vec(size = 300, window=5,
min_count = 1, workers = 2)


# assign the vectors to the vocabs that are in Google's pre-trained model and your sentences defined above.

# lockf needs to be set to 1.0 to allow continued training.

word2vec_model.intersect_word2vec_format('./word2vec/GoogleNews-vectors-negative300.bin', lockf=1.0, binary=True)

# continue training with you own data

word2vec_model.train(sentences, total_examples=3, epochs = 5)

Here is the result of above code.





A lean approach to Data science phone Interview

This is my curated checklist before phone interview and onsite interview.

Practice “Tell me about yourself/your background/…..”

Pre-Phone interview

  1. Company website dive
  2. Prepare questions you have
  3. Review basic Stats
  4. Review Machine learning basics (Supervised and unsupervised)
  5. Review Python basics

Company website/app dive

Look at the company website for how they work, how they grow their business, how they work compared to competitors and etc. Most importantly, why do they need data science? This process was to get yourself familiar with the business and think about how does the data part contribute to their business.


Prepare questions you have

After looking at the website or app, you might have some ideas about how they can benefit from predictive modeling or A/B testing. There are some questions that you wanted to ask but maybe will be answered before you asked.

  1. How is the data science team structured? What are the team’s current size and the team’s background? What is the expected growth for the team for the next year? What are other areas of the company do you think data science team can make a big impact?
  2. Where do you think data science team can make the biggest contribution right now? What are the biggest challenges for the team right now? How are you dealing with those challenges?
  3. What are some possible projects I will work on if I am hired? What is the onboarding process? What is the tech stack for the data science team? How does the data-science team work with other teams? Central data science team or de-centralized team for each department?
  4. Who do you think are your competitors, and what is your estimated respective market share? What’s your advantage over your competitors?
  5. Some company or industry-specific questions.

Personal favorite questions: Tell me more about the interaction you have with the data security team, data engineering team, and software engineering team. What is the strategy for anonymizing data for predictive modeling? How do you deal with the exponentially increasing data size?

You will see if the biggest challenges for the company are from these areas and see if the answer is consistent. Sometimes it is good to ask related but not exactly the same question to gauge if the answer is consistent from the interviewer.

Review basic stats

Go over some hypothesis testing by hand and review the assumptions of some basic hypothesis tests. You will be surprised that you forget so much. 

  1. Central limit theorem: The mean of an iid sample from the population will follow a normal distribution if the sample size is big enough.
  2. Bootstrapping: Sample without replacement. How do you use bootstrapping for estimating confidence interval? How do you use bootstrapping for testing model robustness(variance part of the bias-variance trade-off)?
  3. t-test, Welch’s t-test, Chi-squared test, Wilcoxon test. How to calculate the test statistic and what are the assumptions behind all these tests?
  4. How to do power analysis? Try to do one simple power analysis for t-test and chi-squared test by hand or using Python. Even if you are so confident about it. Just do it once without any material. 

Review Machine learning basics

  1. Linear regression: Loss function (with/without regularization), Analytic solutions and a gradient descent approach.
  2. Logistic regression: Loss function (with/without regularization), gradient descent solution.
  3. Decision trees. Gini index vs entropy vs hinge loss. What is a random forest? How does it work? Random forest vs decision tree.
  4. Kmeans and PCA are the most commonly asked unsupervised learning methods.
  5. Different metrics for machine learning: F-measure (F1, F2, F0.5), recall, precision, kappa, ROC curve, AUC (area under the ROC curve), balanced accuracy.

Review Python basics

  1. List and its methods. append, extend, insert, slice and indexing.
  2. Dictionary and its methods.
  3. for loop
  4. String methods
  5. Sorting list

For the data science phone screen or onsite, I was never asked Linked List/Graph/Tree. I would say that do not worry about it if you do not know about it.

Post-Phone interview

Send a follow-up email to the interviewer or the HR. Then you wait.

How much time do you need for prep? 4hours

website + questions = 1 hr

stats = 1 hr

machine learning = 1 hr

python basics = 1 hr

Bonus Tips:

There will always be unexpected questions. Be calm, be a nice human being, be positive and joyful. I wish you the best of luck. I know all of you will find your dream job soon.

More detailed prep material to come soon.