问题描述
只需阅读gensim页面上的doc2vec命令即可.
Just reading through the doc2vec commands on the gensim page.
我对命令"intersect_word2vec_format"感到好奇.
I am curious about the command"intersect_word2vec_format" .
我对该命令的理解是,它使我可以将来自预训练的word2vec模型的向量值注入到doc2vec模型中,然后使用预训练的word2vec值训练我的doc2vec模型,而不是从文档语料库中生成词向量值.结果是,我得到了一个更准确的doc2vec模型,因为我使用的是预训练的w2v值,该值是由与我相对较小的文档语料库相比更大的数据语料库生成的.
My understanding of this command is it lets me inject vector values from a pretrained word2vec model into my doc2vec model and then train my doc2vec model using the pretrained word2vec values rather than generating the word vector values from my document corpus. The result is that I get a more accurate doc2vec model because I am using pretrained w2v values which was generated from a much larger corpus of data compared to my relatively small document corpus.
我对该命令的理解正确还是不正确? ;-)
Is my understanding of this command correct or not even close? ;-)
推荐答案
是的,intersect_word2vec_format()
将让您将来自外部文件的向量引入已经初始化了自己的词汇表的模型中(就像通过build_vocab()
一样) .也就是说,它将只加载那些在本地词汇表中已经有单词的向量.
Yes, the intersect_word2vec_format()
will let you bring vectors from an external file into a model that's already had its own vocabulary initialized (as if by build_vocab()
). That is, it will only load those vectors for which there are already words in the local vocabulary.
此外,默认情况下,它将锁定这些加载的矢量,以防在后续训练期间进行任何进一步的调整,尽管现有词汇中的其他字词可能会继续更新. (您可以通过提供lockf=1.0
值而不是默认值0.0来更改此行为.)
Additionally, it will by default lock those loaded vectors against any further adjustment during subsequent training, though other words in the pre-existing vocabulary may continue to update. (You can change this behavior by supplying a lockf=1.0
value instead of the default 0.0.)
但是,最好将其视为实验功能,它可能提供的好处(取决于具体设置)取决于许多因素.
However, this is best considered an experimental function and what, if any, benefits it might offer will depend on lots of things specific to your setup.
与dm=0
参数相对应的PV-DBOW Doc2Vec模式通常在速度和doc-vector质量上表现最佳,并且根本不使用或训练单词向量-因此,任何预加载的向量不会有任何效果.
The PV-DBOW Doc2Vec mode, corresponding to the dm=0
parameter, is often a top-performer in speed and doc-vector quality, and doesn't use or train word-vectors at all – so any pre-loading of vectors won't have any effect.
通过默认dm=1
设置启用的PV-DM模式可以在与文档向量训练同时训练所需的任何词向量. (也就是说,没有单独的阶段会首先创建字向量,因此对于相同的iter
遍,无论字向量是从默认随机值开始还是预先设置,PV-DM训练花费的时间都相同.在模型中预先植入来自其他地方的一些单词向量可能会帮助或损害最终质量–它可能取决于您的语料库,元参数和目标的具体情况,以及这些外部向量是否代表单词-含义与当前语料/目标保持同步.
The PV-DM mode, enabled by the default dm=1
setting, trains any word-vectors it needs simultaneous with doc-vector training. (That is, there's no separate phase where word-vectors are created first, and thus for the same iter
passes, PV-DM training takes the same amount of time whether word-vectors start with default random values, or are pre-loaded from elsewhere.) Pre-seeding the model with some word-vectors from elsewhere might help or hurt final quality – it's likely to depend on the specifics of your corpus, meta-parameters, and goals, and whether those external vectors represent word-meanings in sync with the current corpus/goal.
这篇关于gensim doc2vec"intersect_word2vec_format"命令的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!