本文介绍了在Gensim中了解LDA转换的语料库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图检查BOW语料库与LDA [BOW语料库]的内容(由在该语料库上训练的LDA模型转换而成,例如35个主题)我发现以下输出:

I tried to examine the contents of the BOW corpus vs. the LDA[BOW Corpus] (transformed by LDA model trained on that corpus with, say, 35 topics)I found the following output:

DOC 1 : [(1522, 1), (2028, 1), (2082, 1), (6202, 1)]
LDA 1 : [(29, 0.80571428571428572)]
DOC 2 : [(1522, 1), (5364, 1), (6202, 1), (6661, 1), (6983, 1)]
LDA 2 : [(29, 0.83809523809523812)]
DOC 3 : [(3079, 1), (3395, 1), (4874, 1)]
LDA 3 : [(34, 0.75714285714285712)]
DOC 4 : [(1482, 1), (2806, 1), (3988, 1)]
LDA 4 : [(22, 0.50714288283121989), (32, 0.25714283145449457)]
DOC 5 : [(440, 1), (533, 1), (1264, 1), (2433, 1), (3012, 1), (3902, 1), (4037, 1), (4502, 1), (5027, 1), (5723, 1)]
LDA 5 : [(12, 0.075870715371114297), (30, 0.088821329943986921), (31, 0.75219107156801579)]
DOC 6 : [(705, 1), (3156, 1), (3284, 1), (3555, 1), (3920, 1), (4306, 1), (4581, 1), (4900, 1), (5224, 1), (6156, 1)]
LDA 6 : [(6, 0.63896110435842401), (20, 0.18441557445724915), (28, 0.09350643806744402)]
DOC 7 : [(470, 1), (1434, 1), (1741, 1), (3654, 1), (4261, 1)]
LDA 7 : [(5, 0.17142855723258577), (13, 0.17142856888458904), (19, 0.50476192150187316)]
DOC 8 : [(2227, 1), (2290, 1), (2549, 1), (5102, 1), (7651, 1)]
LDA 8 : [(12, 0.16776844589094803), (19, 0.13980868559963203), (22, 0.1728575716782704), (28, 0.37194624921210206)]

在哪里, DOC N是来自BOW语料库的文档 LDA N是该LDA模型对DOC N的转化

Where, DOC N is the document from the BOW corpus LDA N is the transformation of DOC N by that LDA model

我是否正确理解每个转换文档"LDA N"的输出是文档N所属的主题?通过这种理解,我可以看到一些文档(例如4、5、6、7和8)属于多个主题,而DOC 8则属于具有各自概率的主题12、19、22和28.

Am I correct in understanding the output for each transformed document "LDA N" to be the topics that the document N belongs to? By that understanding, I can see some documents like 4, 5, 6, 7 and 8 to belong to more than 1 topic like DOC 8 belongs to topics 12, 19, 22 and 28 with the respective probabilities.

您能否解释一下LDA N的输出并更正我对此输出的理解,尤其是在另一个线程中这里-Gensim的创建者本人曾提到过,文档属于一个主题吗?

Could you please explain the output of LDA N and correct my understanding of this output, especially since in another thread HERE - by the creator of Gensim himself, it's been mentioned that a document belongs to ONE topic?

推荐答案

您对gensimLDA输出的理解是正确的.不过,您需要记住的是LDA[corpus]仅会输出超过特定阈值(在初始化模型时设置)的主题.

Your understanding of the output of LDA from gensim is correct. What you need to remember though is that LDA[corpus] will only output topics that exceed a certain threshold (set when you initialise the model).

document belongs to ONE topic问题是您需要自己做出决定的问题. LDA为您提供的每个文档的主题分布*.然后,您需要确定一个文档(例如,具有某个主题的50%)是否足以使该文档属于该主题.

The document belongs to ONE topic issue is one you need to make a decision about on your own. LDA gives you a distribution over the topics for each document you feed into it*. You need to then make a decision whether a document having (for instance) 50% of a topic is enough for that document to belong to said topic.

(*),您必须牢记LDA[corpus]只会向您显示超过阈值的那些,而不是整个分布.您也可以使用

(*) again you have to keep in mind that LDA[corpus] will only show you those ones that exceed a threshold, not the whole distribution. You can access the whole distribution as well using

theta, _ = lda.inference(corpus)
theta /= theta.sum(axis=1)[:, None]

这篇关于在Gensim中了解LDA转换的语料库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-15 03:27