问题描述
我需要计算word2vec
的训练模型中每个单词的频率.我想要的输出看起来像这样:
I need to count the frequency of each word in word2vec
's training model. I want to have output that looks like this:
term count
apple 123004
country 4432180
runs 620102
...
有可能这样做吗?我如何从word2vec中获取这些数据?
Is it possible to do that? How would I get that data out of word2vec?
推荐答案
您正在使用哪个word2vec实现?
Which word2vec implementation are you using?
在流行的gensim
库中,在建立Word2Vec
模型的词汇表后(通过进行全面训练或调用build_vocab()
之后),模型的wv
属性包含KeyedVectors
类型对象,它是属性vocab
,它是Vocab
类型对象的字典,在扫描的语料库中具有单词频率的count
属性.
In the popular gensim
library, after a Word2Vec
model has its vocabulary established (either by doing its full training, or after build_vocab()
has been called), the model's wv
property contains a KeyedVectors
-type object, which as a property vocab
which is a dict of Vocab
-type objects, which have a count
property of the word's frequency in the scanned corpus.
因此,您可以通过以下方式大致找到您想要的东西:
So you could get roughly what you seek with something like:
w2v_model = Word2Vec(your_corpus, ...)
for word in w2v_model.wv.vocab:
print((word, w2v_model.wv.vocab[word].count))
普通的词向量集(例如通过gensim
的load_word2vec_format()
方法加载的词向量)将不具有准确的计数,但是按照惯例,通常在内部将其频率从最高频率到最低频率排序.
Plain sets of word-vectors (such as those loaded via gensim
's load_word2vec_format()
method) won't have accurate counts, but are by convention usually internally ordered from most-frequent to least-frequent.
这篇关于如何在Word2Vec的训练模型中计算单词频率?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!