问题描述
我正在使用经过预训练的Google新闻数据集,通过在python中使用Gensim库来获取单词向量
I am using pre-trained Google news dataset for getting word vectors by using Gensim library in python
model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
加载模型后,我正在将训练评论句子的单词转换为向量
After loading the model I am converting training reviews sentence words into vectors
#reading all sentences from training file
with open('restaurantSentences', 'r') as infile:
x_train = infile.readlines()
#cleaning sentences
x_train = [review_to_wordlist(review,remove_stopwords=True) for review in x_train]
train_vecs = np.concatenate([buildWordVector(z, n_dim) for z in x_train])
在word2Vec过程中,我的语料库中的单词出现了很多错误,这些错误不在模型中.问题是我该如何重新训练已经预先训练的模型(例如GoogleNews-vectors-negative300.bin'),以便为那些遗漏的单词获取单词矢量.
During word2Vec process i get a lot of errors for the words in my corpus, that are not in the model. Problem is how can i retrain already pre-trained model (e.g GoogleNews-vectors-negative300.bin'), in order to get word vectors for those missing words.
以下是我尝试过的方法:从我的训练句子中训练出一种新模式
Following is what I have tried:Trained a new model from training sentences that I had
# Set values for various parameters
num_features = 300 # Word vector dimensionality
min_word_count = 10 # Minimum word count
num_workers = 4 # Number of threads to run in parallel
context = 10 # Context window size
downsampling = 1e-3 # Downsample setting for frequent words
sentences = gensim.models.word2vec.LineSentence("restaurantSentences")
# Initialize and train the model (this will take some time)
print "Training model..."
model = gensim.models.Word2Vec(sentences, workers=num_workers,size=num_features, min_count = min_word_count,
window = context, sample = downsampling)
model.build_vocab(sentences)
model.train(sentences)
model.n_similarity(["food"], ["rice"])
成功了!但是问题是我的数据集非常少,而训练大型模型的资源却很少.
It worked! but the problem is that I have a really small dataset and less resources to train a large model.
我正在考虑的第二种方法是扩展已经训练好的模型,例如GoogleNews-vectors-negative300.bin.
Second way that I am looking at is to extend the already trained model such as GoogleNews-vectors-negative300.bin.
model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
sentences = gensim.models.word2vec.LineSentence("restaurantSentences")
model.train(sentences)
有可能吗,并且这是一种好方法,请帮帮我
Is it possible and is that a good way to use, please help me out
推荐答案
这是我从技术上解决问题的方式:
This is how I technically solved the issue:
使用Radim Rehurek的可迭代语句准备数据输入: https://rare-technologies.com/word2vec -tutorial/
Preparing data input with sentence iterable from Radim Rehurek: https://rare-technologies.com/word2vec-tutorial/
sentences = MySentences('newcorpus')
设置模型
model = gensim.models.Word2Vec(sentences)
将词汇与Google单词向量相交
Intersecting the vocabulary with the google word vectors
model.intersect_word2vec_format('GoogleNews-vectors-negative300.bin',
lockf=1.0,
binary=True)
最终执行模型并更新
model.train(sentences)
警告提示:从实质的角度来看,一个可能很小的语料库是否真的可以改进"在一个庞大的语料库上训练的Google单词向量,这当然是有争议的……
A note of warning: From a substantive point of view, it is of course highly debatable whether a corpus likely to be very small can actually "improve" the Google wordvectors trained on a massive corpus...
这篇关于是否可以从python句子集中重新训练word2vec模型(例如GoogleNews-vectors-negative300.bin)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!