问题描述
我正在准备一个我已经训练过的 word2vec 模型.我已将其序列化为 CSV 文件:
I am having a ready to go word2vec model that I already trained. I have serialized it as a CSV file:
word, v0, v1, ..., vN
house, 0.1234, 0.4567, ..., 0.3461
car, 0.456, 0.677, ..., 0.3461
我想知道的是如何在 gensim
中加载该词向量模型并使用它来训练段落或 doc2vec 模型.
What I'd like to know is how I can load that word vector model in gensim
and use that to train a paragraph or doc2vec model.
这个 Doc2Vec 教程 说我可以以# C 文本格式
",但我不知道这实际上意味着什么.什么是C 文本格式",但更重要的是:
This Doc2Vec tutorial says I can load a model in form of a "# C text format
" but I have no idea what that actually means. What is "C text format" in the first place but more important:
- 如何加载我的 word2vec 模型并将其用于 doc2vec 训练?
如何从我的 word2vec 模型构建词汇表?
How do I build the vocabulary from my word2vec model?
推荐答案
Doc2Vec 不需要词向量作为输入:它将创建任何在其自身训练期间需要的词向量.(还有一些模式,比如纯 DBOW——dm=0, dbow_words=0
——根本不使用或训练词向量.)
Doc2Vec does not need word-vectors as an input: it will create any word-vectors that are needed during its own training. (And some modes, like pure DBOW – dm=0, dbow_words=0
– don't use or train word-vectors at all.)
用词向量播种 Doc2Vec 模型可能有帮助,也可能有伤害;没有太多理论或已发表的结果可以提供指导.Word2Vec 有一个实验方法,intersect_word2vec_format()
,它可以将 word2vec-c-format 向量合并到具有现有词汇表的模型中,但您需要查看源代码才能真正理解其假设:
Seeding a Doc2Vec model with word-vectors might help or hurt; there's not much theory or published results to offer guidance. There's an experimental method on Word2Vec, intersect_word2vec_format()
, that can merge word2vec-c-format vectors into a model with an existing vocabulary, but you'd need to review the source to really understand its assumptions:
这篇关于如何在 gensim 中加载预训练模型并用它训练 doc2vec?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!