问题描述
我希望使用LDA将每个文档分配给一个主题.现在,我意识到您所获得的是LDA主题的分布.但是,正如您从下面的最后一行看到的那样,我将其分配给最可能的主题.
I am hoping to assign each document to one topic using LDA. Now I realise that what you get is a distribution over topics from LDA. However as you see from the last line below I assign it to the most probable topic.
我的问题是这个.为了获得这些主题,我不得不第二次运行 lda [corpus]
.是否有其他内置的gensim函数可以直接为我提供此主题分配向量?尤其是由于LDA算法已遍历文档,因此它可能已经保存了这些主题分配?
My question is this. I have to run lda[corpus]
for somewhat the second time in order to get these topics. Is there some other builtin gensim function that will give me this topic assignment vectors directly? Especially since the LDA algorithm has passed through the documents it might have saved these topic assignments?
# Get the Dictionary and BoW of the corpus after some stemming/ cleansing
texts = [[stem(word) for word in document.split() if word not in STOPWORDS] for document in cleanDF.text.values]
dictionary = corpora.Dictionary(texts)
dictionary.filter_extremes(no_below=5, no_above=0.9)
corpus = [dictionary.doc2bow(text) for text in texts]
# The actual LDA component
lda = models.LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=30, chunksize=10000, passes=10,workers=4)
# Assign each document to most prevalent topic
lda_topic_assignment = [max(p,key=lambda item: item[1]) for p in lda[corpus]]
推荐答案
没有其他内置的Gensim函数可以直接提供主题分配向量.
There is no other builtin Gensim function that will give the topic assignment vectors directly.
您的问题是LDA算法已通过文档,但LDA的实现通过按块更新模型(基于 chunksize
参数的值)来工作,因此是有效的,因此不会保留整个模型记忆体.
Your question is valid that LDA algorithm has passed through the documents but implementation of LDA is working by updating the model in chunks (based on value of chunksize
parameter), hence it will not keep the entire corpus in-memory.
因此,您必须使用 lda [corpus]
或使用方法 lda.get_document_topics()
Hence you have to use lda[corpus]
or use the method lda.get_document_topics()
这篇关于Gensim LDA主题分配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!