How are word vectors co-trained with paragraph vectors in doc2vec DBOW?

Question

I don't understand how word vectors are involved at all in the training process with gensim's doc2vec in DBOW mode (dm=0). I know that it's disabled by default with dbow_words=0. But what happens when we set dbow_words to 1?

In my understanding of DBOW, the context words are predicted directly from the paragraph vectors. So the only parameters of the model are the N p-dimensional paragraph vectors plus the parameters of the classifier.

But multiple sources hint that it is possible in DBOW mode to co-train word and doc vectors. For instance:

section 5 of An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation
this SO answer: How to use Gensim doc2vec with pre-trained word vectors?

So, how is this done? Any clarification would be much appreciated!

Note: for DM, the paragraph vectors are averaged/concatenated with the word vectors to predict the target words. In that case, it's clear that words vectors are trained simultaneously with document vectors. And there are N*p + M*q + classifier parameters (where M is vocab size and q word vector space dim).

score 2 · Accepted Answer · answered Apr 09 '19 at 16:28

2

If you set dbow_words=1, then skip-gram word-vector training is added the to training loop, interleaved with the normal PV-DBOW training.

So, for a given target word in a text, 1st the candidate doc-vector is used (alone) to try to predict that word, with backpropagation adjustments then occurring to the model & doc-vector. Then, a bunch of the surrounding words are each used, one at a time in skip-gram fashion, to try to predict that same target word – with the followup adjustments made.

Then, the next target word in the text gets the same PV-DBOW plus skip-gram treatment, and so on, and so on.

As some logical consequences of this:

training takes longer than plain PV-DBOW - by about a factor equal to the window parameter
word-vectors overall wind up getting more total training attention than doc-vectors, again by a factor equal to the window parameter

answered Apr 09 '19 at 16:28

gojomo

52,260
14
86
115

many thanks for the fast and helpful answer! (1) I understand that in this setting, word and doc vectors are indeed trained at the same time, but they don't interact. Hence, in terms of quality there is probably no improvement vs. training word and doc vectors separately? (2) I conclude that when `dm=0` and `dbow_words=0`, word vectors are still created but never used/trained. Do you know by any chance how to get rid of them to reduce model size on disk and RAM? – Antoine Apr 10 '19 at 08:23
elaborating on (1): I probably misunderstood something, but doesn't your explanation that word and doc vectors are trained simultaneously but without interacting contradict the results presented in this [paper](https://www.aclweb.org/anthology/W16-1609) (section 5) that pre-training words vectors improves the quality of the dbow doc vectors? If there is no leak between the two tasks, this shouldn't change anything, no? – Antoine Apr 10 '19 at 12:51
1

There's no supported way to discard the allocated, untrained word-vectors in the `dbow_words=0` case. If you're done with both training and inference (which is also a kind of training), and *only* need to access trained-up doc-vectors, you could possibly `del` the associated `d2v_model.wv` property - but that *might* prevent other `save()`/`load()` operations from working, I'm not sure. – gojomo Apr 10 '19 at 17:39
1

In `dbow_words=1` mode, word-vectors are trained with some (context_word->target_word) pairs, then doc-vectors are trained with (doc_tag->target_word) pairs), then that's repeated in interleaved fashion. So no individual micro-training-examples involves both. But that's also the case between many words, in normal word training - but the words still wind up in useful relative positions. That's because all training examples share the same hidden->output layer of the neural network. Thus, the contrasting examples are each changing some shared parameters, and *indirectly* affect each other. – gojomo Apr 10 '19 at 17:43
1

Whether adding `dbow_words` helps or hurts will be very specific to your data, goals, and meta-parameters. Whether seeding a `Doc2Vec` model with pre-trained word-vectors helps – an option for which there is no official `gensim` support – will depend on how well that pre-trained vocabulary suits your documents, and the model mode. And in `dbow_words=0` mode, pre-loaded word-vectors *can't* have any effect, direct or indirect, on the doc-vectors - to the extent that paper suggests that, it is confused. (I also make this point at: https://groups.google.com/d/msg/gensim/4-pd0iA_xW4/UzpuvBOPAwAJ ) – gojomo Apr 10 '19 at 17:48
1

You can find more of my concerns about the specific claims/tests/gaps of that paper in some discussion at a project github issue – starting at https://github.com/RaRe-Technologies/gensim/issues/1270#issuecomment-293418459 – and in other discussion-group links from that issue. – gojomo Apr 10 '19 at 17:52
thank you so much for your time providing such detailed explanations and useful links, it is greatly appreciated. Indeed, you're right about the indirect influence thing. I was not considering the fact that the projection->output matrix is shared both by word and doc vectors. Thanks again! – Antoine Apr 11 '19 at 08:34

How are word vectors co-trained with paragraph vectors in doc2vec DBOW?

1 Answers1

Linked