In this blog, I will briefly talk about what is word2vec, how to train your own word2vec, how to load the google’s pre-trained word2vec and how to update the google’s pre-trained model with the gensim package in Python.
What is word2vec?
If you ever involved in building any text classifier, you would have heard of word2vec. Word2vec was created by a team of researchers led by Tomáš Mikolov at Google. It is an unsupervised learning algorithm and it works by predicting its context words by applying a two-layer neural network. To understand more about word2vec under the hood, you can refer to the Youtube video by Stanford University.
The end result of word2vec is that it can convert words into vectors.
There are two algorithms to generate the encoding which are continuous bag-of-words approach and skip-gram. They are depicted in the graph below.
How to train a word2vec?
from gensim.models import Word2Vec sentences = [['yes', 'this', 'is', 'the', 'word2vec', 'mode']\ ,[ 'if',"you","have","think","about","it"]] model = Word2Vec(sentences,size = 10, window=5,min_count = 1)
There are more options for the training where the size option determines the dimensions of the word vectors, the window option is the number of the forward and backward words used in training, and min_count is the number of minimum times a word needs to appear to be included in the training.
Save model and load the model
After you finished training the model, you can save it as follows:
If you want to reload the model into workspace again, you can load it as follows:
For more details on how to use the gensim package for word2vec model, see the tutorial by the gensim author.
How to update the word2vec model?
from gensim.models import Word2Vec old_sentences = [["bad","robots"],["good","human"]] new_sentences = [['yes', 'this', 'is', 'the', 'word2vec', 'model']\ ,[ 'if',"you","have","think","about","it"]] old_model = Word2Vec(old_sentences,size = 10, window=5, min_count = 1, workers = 2) old_model.wv.vocab old_model.save("old_model") new_model = Word2Vec.load("old_model") new_model.build_vocab(new_sentences, update = True) new_model.train(new_sentences, total_examples=2, epochs = 1) new_model.wv.vocab
This is the results of above code.
The google pre-trained word2vec model
Google has published a pre-trained word2vec model. It is trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. Here is the download link for the google’s pre-trained 300-dimensional word vectors GoogleNews-vectors-negative300.bin.gz. The binary file (GoogleNews-vectors-negative300.bin) is 3.4 Gigabytes being unzipped. For more information about the word2vec model published by Google, you can see the link here.
After downloading it, you can load it as follows (if it is in the same directory as the py file or jupyter notebook):
from gensim.models import KeyedVectors filename = 'GoogleNews-vectors-negative300.bin' model = KeyedVectors.load_word2vec_format(filename, binary=True)
“Transfer learning” on Google pre-trained word2vec
Update your word2vec with Google’s pre-trained model
It is a powerful pre-trained model but there is one downside. You can not continue the training since it lacks hidden weights, vocabulary frequencies, and the binary tree. Therefore, it is not possible right now to do transfer learning on Google’s pre-trained model.
You might have some customized word2vec you trained and you are worried that the vectors for some common words are not good enough.
To solve the above problem, you can replace the word vectors from your model with the vectors from Google’s word2vec model with a method call intersect_word2vec_format.
See the documentation here for more details on this new method.
The method described above is not exactly transfer learning but it is quite similar.
How to really do transfer learning using Google’s pre-trained model for your customized dataset?
It is often argued that there is little to no benefits to continue the training on Word2Vec model.
Imagine a situation that you have a small customized dataset so that the word2vec model you trained is not good enough but you also worried that pre-trained vectors do not really make sense for you for some common words. I argue that this is a pretty common situation and that is why transfer learning on Convolutional Neural networks are so popular.
Here is the code to accomplish that.
from gensim.models import Word2Vec sentences = [["bad","robots"],["good","human"],['yes', 'this', 'is', 'the', 'word2vec', 'model']] # size option needs to be set to 300 to be the same as Google's pre-trained model word2vec_model = Word2Vec(size = 300, window=5, min_count = 1, workers = 2) word2vec_model.build_vocab(sentences) # assign the vectors to the vocabs that are in Google's pre-trained model and your sentences defined above. # lockf needs to be set to 1.0 to allow continued training. word2vec_model.intersect_word2vec_format('./word2vec/GoogleNews-vectors-negative300.bin', lockf=1.0, binary=True) # continue training with you own data word2vec_model.train(sentences, total_examples=3, epochs = 5)
Here is the result of above code.