1

I am having a ready to go word2vec model that I already trained. I have serialized it as a CSV file:

word,  v0,     v1,     ..., vN
house, 0.1234, 0.4567, ..., 0.3461
car,   0.456,  0.677,  ..., 0.3461

What I'd like to know is how I can load that word vector model in gensim and use that to train a paragraph or doc2vec model.

This Doc2Vec tutorial says I can load a model in form of a "# C text format" but I have no idea what that actually means. What is "C text format" in the first place but more important:

  • How can I load my word2vec model and use it for doc2vec training?

How do I build the vocabulary from my word2vec model?

Stefan Falk
  • 23,898
  • 50
  • 191
  • 378
  • Some one asked a similar question here: https://stackoverflow.com/questions/27470670/how-to-use-gensim-doc2vec-with-pre-trained-word-vectors?rq=1 – Anushka--x Jul 09 '18 at 15:52

1 Answers1

1

Doc2Vec does not need word-vectors as an input: it will create any word-vectors that are needed during its own training. (And some modes, like pure DBOW – dm=0, dbow_words=0 – don't use or train word-vectors at all.)

Seeding a Doc2Vec model with word-vectors might help or hurt; there's not much theory or published results to offer guidance. There's an experimental method on Word2Vec, intersect_word2vec_format(), that can merge word2vec-c-format vectors into a model with an existing vocabulary, but you'd need to review the source to really understand its assumptions:

https://github.com/RaRe-Technologies/gensim/blob/51753b95415bbc344ea6af671818277464905ea2/gensim/models/word2vec.py#L1140

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • I cannot proof this statement but I think the document vectors work better if one provides pre-trained word vectors. I only tested this by commenting out the intersect part and compared the results. But thanks for providing an answer :) – Stefan Falk Jul 29 '16 at 09:52
  • 1
    Work better on what task, with how much data, with which pre-trained vectors? – gojomo Jul 29 '16 at 16:41