1

For gensim(1.0.1) doc2vec, I am trying to load google pre-trained word vectors instead of using Doc2Vec.build_vocab

wordVec_google = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)    
model0 = Doc2Vec(size=300, alpha=0.05, min_alpha=0.05, window=8, min_count=5, workers=4, dm=0, hs=1)    
model0.wv = wordVec_google    
##some other code 
model0.build_vocab(sentences=allEmails, max_vocab_size = 20000)

but this object model0 can not be further trained with "labeled Docs", and can't infer vectors for documents.

Anyone knows how to use doc2vec with google pretrained word vectors?
I tried this post: http://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/ but it does not work to load into gensim.models.Word2Vec object, perhaps it is a different gensim version.

g_p
  • 5,499
  • 3
  • 19
  • 22
user3527917
  • 327
  • 2
  • 10
  • 22
  • a similar question has several answers on how to load pretrained vectors into the Doc2Vec model: https://stackoverflow.com/questions/27470670/how-to-use-gensim-doc2vec-with-pre-trained-word-vectors?rq=1 – Anushka--x Jul 09 '18 at 16:11

1 Answers1

4

The GoogleNews vectors are just raw vectors - not a full Word2Vec model.

Also, the gensim Doc2Vec class does not have general support for loading pretrained word-vectors. The Doc2Vec algorithm doesn't need pre-trained word-vectors – only some modes even use such vectors, and when they do, they're trained simultaneously as needed alongside the doc-vectors.

Specifically, the mode your code is using, dm=0, is the 'Paragraph Vectors' PV-DBOW mode, and does not use word-vectors at all. So even if there was a function to load them, they'd be loaded – then ignored during training and inference. (You would need to use PV-DM, 'dm=1', or add skip-gram word-training to PV-DBOW, dm=0, dbow_words=1, in order for such reused vectors to have any relevance to your training.)

Why do you think you want/need to use pre-trained vectors? (Especially, a set of 3 million word-vectors, from another kind of data, when a later step suggests you only care about a vocabulary of 20,000 words?)

If for some reason you feel sure you want to initialize Doc2Vec with wrod-vectors from elsewhere, and use a training mode where that would have some effect, you can look into the intersect_word2vec_format() method that gensim Doc2Vec inherits from Word2Vec.

That method specifically needs to be called after build_vocab() has already learned the corpus-specific vocabulary, and it only brings in the words from the outside source that are locally relevant. It's at best an advanced, experimental feature – see its source code, doc-comments, and discussion on the gensim list to understand its side-effects and limitations.

gojomo
  • 52,260
  • 14
  • 86
  • 115