为什么在转换语料库后,tf-idf模型会丢弃术语和计数?
我的代码:
from gensim import corpora, models, similarities
# Let's say you have a corpus made up of 2 documents.
doc0 = [(0, 1), (1, 1)]
doc1 = [(0,1)]
doc2 = [(0, 1), (1, 1)]
doc3 = [(0, 3), (1, 1)]
corpus = [doc0,doc1,doc2,doc3]
# Train a tfidf model using the corpus
tfidf = models.TfidfModel(corpus)
# Now if you print the corpus, it still remains as the flat frequency counts.
for d in corpus:
print d
print
# To convert the corpus into tfidf, re-initialize the corpus
# according to the model to get the normalized frequencies.
corpus = tfidf[corpus]
for d in corpus:
print d
输出:
[(0, 1.0), (1, 1.0)]
[(0, 1.0)]
[(0, 1.0), (1, 1.0)]
[(0, 3.0), (1, 1.0)]
[(1, 1.0)]
[]
[(1, 1.0)]
[(1, 1.0)]
最佳答案
IDF是通过将文档总数除以包含该项的文档数,然后取该商的对数来获得的。在您的情况下,所有文档都有term0,因此term0的IDF是log(1),等于0。所以在你的doc term矩阵中,term0的列都是零。
出现在所有文档中的一个术语没有权重,它完全没有信息。
关于python - 为什么在我转换了语料库后,`gensim`中的tf-idf模型会丢弃这些术语并计数?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/15036048/