本文介绍了如何从text2vec LDA获取主题概率表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
text2vec软件包中的LDA主题建模非常了不起.它确实比topicmodel
The LDA topic modeling in the text2vec package is amazing. It is indeed much faster than topicmodel
但是,我不知道如何获得每个文档属于每个主题的概率,如下例所示:
However, I don't know how to get the probability of each document belongs to each topic as the example below:
V1 V2 V3 V4
1 0.001025237 7.89E-05 7.89E-05 7.89E-05
2 0.002906977 0.002906977 0.014534884 0.002906977
3 0.003164557 0.003164557 0.003164557 0.003164557
4 7.21E-05 7.21E-05 0.000360334 7.21E-05
5 0.000804433 8.94E-05 8.94E-05 8.94E-05
6 5.63E-05 5.63E-05 5.63E-05 5.63E-05
7 0.001984127 0.001984127 0.001984127 0.001984127
8 0.003515625 0.000390625 0.000390625 0.000390625
9 0.000748503 0.000748503 0.003742515 0.003742515
10 0.000141723 0.00297619 0.000141723 0.000708617
这是text2vec lda的代码
This is the code for text2vec lda
ss2 <- as.character(stressor5$weibo)
seg2 <- mmseg4j(ss2)
# Create vocabulary. Terms will be unigrams (simple words).
it_test = itoken(seg2, progressbar = FALSE)
vocab2 <- create_vocabulary(it_test)
pruned_vocab2 = prune_vocabulary(vocab2,
term_count_min = 10,
doc_proportion_max = 0.5,
doc_proportion_min = 0.001)
vectorizer2 <- vocab_vectorizer(pruned_vocab2)
dtm_test = create_dtm(it_test, vectorizer2)
lda_model = LDA$new(n_topics = 1000, vocabulary = vocab2, doc_topic_prior = 0.1, topic_word_prior = 0.01)
doc_topic_distr = lda_model$fit_transform(dtm_test, n_iter = 1000, convergence_tol = 0.01, check_convergence_every_n = 10)
推荐答案
doc_topic_distr
是一个矩阵,其中包含将文档中的单词分配给特定主题的次数.因此,您只需要按字数对每一行进行归一化(也可以在归一化之前添加doc_topic_prior
.)
doc_topic_distr
is a matrix which contains number of how many times words from document were assigned to particular topic. So you need just to normalize each row by number of words (also you can add doc_topic_prior
before normalization).
library(text2vec)
data("movie_review")
tokens = movie_review$review %>%
tolower %>%
word_tokenizer
# turn off progressbar because it won't look nice in rmd
it = itoken(tokens, ids = movie_review$id, progressbar = FALSE)
v = create_vocabulary(it) %>%
prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.2)
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer, type = "lda_c")
doc_topic_prior = 0.1
lda_model =
LDA$new(n_topics = 10, vocabulary = v,
doc_topic_prior = doc_topic_prior, topic_word_prior = 0.01)
doc_topic_distr =
lda_model$fit_transform(dtm, n_iter = 1000, convergence_tol = 0.01,
check_convergence_every_n = 10)
head(doc_topic_distr)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#5814_8 16 18 0 34 0 16 49 0 20 23
#2381_9 4 0 6 20 0 0 6 6 0 28
#7759_3 21 39 7 0 3 47 0 25 21 17
#3630_4 18 7 22 14 19 0 18 0 2 35
#9495_8 4 0 13 17 13 78 3 2 28 25
#8196_8 0 0 0 11 0 8 0 8 8 0
doc_topic_prob = normalize(doc_topic_distr, norm = "l1")
# or add norm first and normalize :
# doc_topic_prob = normalize(doc_topic_distr + doc_topic_prior, norm = "l1")
这篇关于如何从text2vec LDA获取主题概率表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!