如何在Word2Vec的训练模型中计算单词频率

如何在Word2Vec的训练模型中计算单词频率

本文介绍了如何在Word2Vec的训练模型中计算单词频率?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要计算word2vec的训练模型中每个单词的频率.我想要的输出看起来像这样:

I need to count the frequency of each word in word2vec's training model. I want to have output that looks like this:

term    count
apple   123004
country 4432180
runs    620102
...

有可能这样做吗?我如何从word2vec中获取这些数据?

Is it possible to do that? How would I get that data out of word2vec?

推荐答案

您正在使用哪个word2vec实现?

Which word2vec implementation are you using?

在流行的gensim库中,在建立Word2Vec模型的词汇表后(通过进行全面训练或调用build_vocab()之后),模型的wv属性包含KeyedVectors类型对象,它是属性vocab,它是Vocab类型对象的字典,在扫描的语料库中具有单词频率的count属性.

In the popular gensim library, after a Word2Vec model has its vocabulary established (either by doing its full training, or after build_vocab() has been called), the model's wv property contains a KeyedVectors-type object, which as a property vocab which is a dict of Vocab-type objects, which have a count property of the word's frequency in the scanned corpus.

因此,您可以通过以下方式大致找到您想要的东西:

So you could get roughly what you seek with something like:

w2v_model = Word2Vec(your_corpus, ...)
for word in w2v_model.wv.vocab:
    print((word, w2v_model.wv.vocab[word].count))

普通的词向量集(例如通过gensimload_word2vec_format()方法加载的词向量)将不具有准确的计数,但是按照惯例,通常在内部将其频率从最高频率到最低频率排序.

Plain sets of word-vectors (such as those loaded via gensim's load_word2vec_format() method) won't have accurate counts, but are by convention usually internally ordered from most-frequent to least-frequent.

这篇关于如何在Word2Vec的训练模型中计算单词频率?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-24 16:11