问题描述
我目前正在使用在Google新闻语料库上训练的Word2Vec模型(来自此处)由于仅在2013年之前对新闻进行过培训,因此我需要更新向量,并根据2013年以后的新闻在词汇表中添加新词.
I am currently using the Word2Vec model trained on Google News Corpus (from here)Since this is trained on news only until 2013, I need to updated the vectors and also add new words in the vocabulary based on the news coming after 2013.
假设我在2013年以后有了新的新闻语料库.我可以重新训练,微调或更新Google新闻Word2Vec模型吗?可以使用Gensim完成吗?可以使用FastText完成吗?
Suppose I have a new corpus of news after 2013. Can I re-train or fine tune or update the Google News Word2Vec model? Can it be done using Gensim? Can it be done using FastText?
推荐答案
您可以看一下: https://github.com/facebookresearch/fastText/pull/423
它完全可以实现您想要的功能:这是链接的内容:
It does exactly the same thing you want:Here is what the link says:
逐步训练分类模型或词向量模型.
Training the classification model or word vector model incrementally.
-incr代表增量训练.
-incr stands for incremental training.
训练单词嵌入时,可以一次从头开始处理所有数据,也可以仅对新数据进行.对于分类,可以使用预先训练好的单词嵌入所有数据,或者仅使用新数据从头开始对其进行训练,而无需更改单词嵌入.
When training word embedding, one could do it from scratch with all data at each time, or just on the new data. For classification, one could train it from scratch with pre-trained word embedding with all data, or only the new one, with no changing of the word embedding.
增量训练实际上意味着,用我们之前获得的数据完成训练模型,然后再使用我们获得的更新数据重新训练模型,而不是从头开始.
Incremental training actually means, having finished training model with data we got before, and retrain the model with newer data we get, not from scratch.
这篇关于微调预训练的word2vec Google新闻的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!