使用Gensim为LDA模型获取最佳主题数的最佳方法是什么?

本文介绍了使用Gensim为LDA模型获取最佳主题数的最佳方法是什么?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试为Gensim中的LDA模型获取最佳主题数.我发现的一种方法是计算每个模型的对数似然并相互比较，例如在使用潜在Dirichlet分配的输入参数

I am trying to obtain the optimal number of topics for an LDA-model within Gensim. One method I found is to calculate the log likelihood for each model and compare each against each other, e.g. at The input parameters for using latent Dirichlet allocation

因此，我研究了用Gensim计算LDA模型的对数可能性，并遇到了以下帖子:

Hence I looked into calculating the log likelihood of a LDA-model with Gensim and came across following post: How do you estimate α parameter of a latent dirichlet allocation model?

基本上说update_alpha()方法实现了Jonathan的 Huang中描述的方法. Dirichlet分布参数的最大似然估计.仍然我不知道如何在不更改代码的情况下使用库获取此参数.

which basically states that the update_alpha() method implements the method decribed in Huang, Jonathan. Maximum likelihood estimation of Dirichlet distribution parameters. Still I don't know how to obtain this parameter using the libary without changing the code.

如何从Gensim的LDA模型中获得对数似然性?

How can I obtain log likelihood from an LDA model with Gensim?

使用Gensim是否有更好的方法来获取最佳主题数?

Is there a better way to obtain optimal number of topics with Gensim?

推荐答案

尽管我不能特别评论Gensim，但可以考虑一些有关优化主题的一般建议.

Although I cannot comment on Gensim in particular I can weigh in with some general advice for optimising your topics.

正如您所述，使用对数似然法是一种方法.另一种选择是保留一组来自模型生成过程的文档，并在模型完成后推断主题，并检查是否有意义.

As you stated, using log likelihood is one method. Another option is to keep a set of documents held out from the model generation process and infer topics over them when the model is complete and check if it makes sense.

您可以尝试的另一种完全不同的方法是层次化Dirichlet流程，该方法可以在不指定的情况下动态地找到语料库中的主题数.

A completely different method you could try is a hierarchical Dirichlet process, this method can find the number of topics in the corpus dynamically without being specified.

关于如何最好地指定参数和评估主题模型的论文很多，具体取决于您的经验水平，这些论文可能对您不利或对您不利:

There are many papers on how to best specify parameters and evaluate your topic model, depending on your experience level these may or may not be good for you:

重新思考LDA:为何如此重要，Wallach，HM，Mimno，D.和McCallum，答:

Rethinking LDA: Why Priors Matter, Wallach, H.M., Mimno, D. and McCallum, A.

主题模型的评估方法，Wallach HM，Murray，I.，Salakhutdinov，R.还有Dim Mimno.

Evaluation Methods for Topic Models, Wallach H.M., Murray, I., Salakhutdinov, R. and Mimno, D.

此外，这是有关分层Dirichlet流程的论文:

Also, here is the paper about the hierarchical Dirichlet process:

分级Dirichlet流程，Teh，YW，约旦，密西根州，比尔(M. Beal)和布莱(Blei)DM

Hierarchical Dirichlet Processes, Teh, Y.W., Jordan, M.I., Beal, M.J. and Blei, D.M.

这篇关于使用Gensim为LDA模型获取最佳主题数的最佳方法是什么?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！