问题描述
我想计算LDA主题之间的余弦相似度.实际上,gensim函数.matutils.cossim可以做到,但是我不知道我可以为该函数使用哪个参数(向量)?
I want to compute Cosine Similarity between LDA topics. In fact, gensim function .matutils.cossim can do it but I dont know which parameter (vector ) I can use for this function?
下面是一段代码:
import numpy as np
import lda
from sklearn.feature_extraction.text import CountVectorizer
cvectorizer = CountVectorizer(min_df=4, max_features=10000, stop_words='english')
cvz = cvectorizer.fit_transform(tweet_texts_processed)
n_topics = 8
n_iter = 500
lda_model = lda.LDA(n_topics=n_topics, n_iter=n_iter)
X_topics = lda_model.fit_transform(cvz)
n_top_words = 6
topic_summaries = []
topic_word = lda_model.topic_word_ # get the topic words
vocab = cvectorizer.get_feature_names()
for i, topic_dist in enumerate(topic_word):
topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
topic_summaries.append(' '.join(topic_words))
print('Topic {}: {}'.format(i, ' '.join(topic_words)))
doc_topic = lda_model.doc_topic_
lda_keys = []
for i, tweet in enumerate(tweets):
lda_keys += [X_topics[i].argmax()]
import gensim
from gensim import corpora, models, similarities
#Cosine Similarity between LDA topics
**sim = gensim.matutils.cossim(LDA_topic[1], LDA_topic[2])**
推荐答案
您可以使用单词主题分布向量.您需要两个主题向量都具有相同的维,并且元组的第一个元素为int,第二个元素为float.
You can use word-topic distribution vector.You need both topic vectors to be with the same dimension, and have first element of tuple to be int, and second - float.
vec1((int,浮点数)列表)
第一个元素是word_id,您可以在模型的id2word变量中找到它.如果您有两个模型,则需要合并字典.您的向量必须是:
So first element is word_id, that you can find in id2word variable in model.If you have two models, you need to union dictionaries.Your vectors must be:
[(1, 0.541223), (2, 0.44123)]
然后您可以比较它们.
这篇关于余弦相似度和LDA主题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!