问题描述
我正在尝试为Gensim中的LDA模型获取最佳主题数.我发现的一种方法是计算每个模型的对数似然并相互比较,例如在使用潜在Dirichlet分配的输入参数
I am trying to obtain the optimal number of topics for an LDA-model within Gensim. One method I found is to calculate the log likelihood for each model and compare each against each other, e.g. at The input parameters for using latent Dirichlet allocation
因此,我研究了用Gensim计算LDA模型的对数可能性,并遇到了以下帖子:
Hence I looked into calculating the log likelihood of a LDA-model with Gensim and came across following post: How do you estimate α parameter of a latent dirichlet allocation model?
基本上说update_alpha()方法实现了Jonathan的 Huang中描述的方法. Dirichlet分布参数的最大似然估计.仍然我不知道如何在不更改代码的情况下使用库获取此参数.
which basically states that the update_alpha() method implements the method decribed in Huang, Jonathan. Maximum likelihood estimation of Dirichlet distribution parameters. Still I don't know how to obtain this parameter using the libary without changing the code.
如何从Gensim的LDA模型中获得对数似然性?
How can I obtain log likelihood from an LDA model with Gensim?
使用Gensim是否有更好的方法来获取最佳主题数?
Is there a better way to obtain optimal number of topics with Gensim?
推荐答案
尽管我不能特别评论Gensim,但可以考虑一些有关优化主题的一般建议.
Although I cannot comment on Gensim in particular I can weigh in with some general advice for optimising your topics.
正如您所述,使用对数似然法是一种方法.另一种选择是保留一组来自模型生成过程的文档,并在模型完成后推断主题,并检查是否有意义.
As you stated, using log likelihood is one method. Another option is to keep a set of documents held out from the model generation process and infer topics over them when the model is complete and check if it makes sense.
您可以尝试的另一种完全不同的方法是层次化Dirichlet流程,该方法可以在不指定的情况下动态地找到语料库中的主题数.
A completely different method you could try is a hierarchical Dirichlet process, this method can find the number of topics in the corpus dynamically without being specified.
关于如何最好地指定参数和评估主题模型的论文很多,具体取决于您的经验水平,这些论文可能对您不利或对您不利:
There are many papers on how to best specify parameters and evaluate your topic model, depending on your experience level these may or may not be good for you:
重新思考LDA:为何如此重要,Wallach,HM,Mimno,D.和McCallum,答:
Rethinking LDA: Why Priors Matter, Wallach, H.M., Mimno, D. and McCallum, A.
主题模型的评估方法,Wallach HM,Murray,I.,Salakhutdinov,R.还有Dim Mimno.
Evaluation Methods for Topic Models, Wallach H.M., Murray, I., Salakhutdinov, R. and Mimno, D.
此外,这是有关分层Dirichlet流程的论文:
Also, here is the paper about the hierarchical Dirichlet process:
分级Dirichlet流程,Teh,YW,约旦,密西根州,比尔(M. Beal)和布莱(Blei)DM
Hierarchical Dirichlet Processes, Teh, Y.W., Jordan, M.I., Beal, M.J. and Blei, D.M.
这篇关于使用Gensim为LDA模型获取最佳主题数的最佳方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!