本文介绍了如何使用新词汇逐步训练 word2vec 模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我得到了一个超过 40G 的数据集.由于内存有限,我的标记器程序被终止,所以我尝试拆分我的数据集.如何增量训练word2vec模型,即如何使用单独的数据集来训练一个word2vec模型?

I got a dataset over 40G. The program of my tokenizer is killed due to limited memory, so I try to split my dataset. How can I train the word2vec model incrementally, that is, how can I use separate datasets to train one word2vec model?

我当前的 word2vec 代码是:

My current word2vec code is:

model = gensim.models.Word2Vec(documents, size=150, window=10, min_count=1, workers=10)
model.train(documents,total_examples=len(documents),epochs=epochs)
model.save("./word2vec150d/word2vec_{}.model".format(epochs))

任何帮助将不胜感激!

推荐答案

我找到了解决方案:使用 PathLineSentences.它非常快.增量训练 word2vec 模型无法学习新词汇,但 PathLineSentences 可以.

I have found the solution: use PathLineSentences. It is very fast. Incrementally training a word2vec model cannot learn new vocabularies, but PathLineSentences can.

from gensim.models.word2vec import PathLineSentences

model = Word2Vec(PathLineSentences(input_dir), size=100, window=5, min_count=5, workers=multiprocessing.cpu_count() * 2, iter=20,sg=1)

对于单个文件,使用LineSentences.

from gensim.models.word2vec import LineSentence

model = Word2Vec(LineSentence(file), size=100, window=5, min_count=5, workers=multiprocessing.cpu_count() * 2, iter=20,sg=1)
...

这篇关于如何使用新词汇逐步训练 word2vec 模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-01 08:42