问题描述
我正在使用python gensim
从一个只有231个句子的小型语料库中训练潜在Dirichlet分配(LDA)模型.但是,每次我重复该过程时,都会产生不同的主题.
I am using python gensim
to train an Latent Dirichlet Allocation (LDA) model from a small corpus of 231 sentences. However, each time i repeat the process, it generates different topics.
为什么相同的LDA参数和语料库每次都会生成不同的主题?
我如何稳定主题的产生?
我正在使用该语料库( http://pastebin.com/WptkKVF0 )和此停用词列表(http://pastebin.com/LL7dqLcj ),这是我的代码:
I'm using this corpus (http://pastebin.com/WptkKVF0) and this list of stopwords (http://pastebin.com/LL7dqLcj) and here's my code:
from gensim import corpora, models, similarities
from gensim.models import hdpmodel, ldamodel
from itertools import izip
from collections import defaultdict
import codecs, os, glob, math
stopwords = [i.strip() for i in codecs.open('stopmild','r','utf8').readlines() if i[0] != "#" and i != ""]
def generateTopics(corpus, dictionary):
# Build LDA model using the above corpus
lda = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=50)
corpus_lda = lda[corpus]
# Group topics with similar words together.
tops = set(lda.show_topics(50))
top_clusters = []
for l in tops:
top = []
for t in l.split(" + "):
top.append((t.split("*")[0], t.split("*")[1]))
top_clusters.append(top)
# Generate word only topics
top_wordonly = []
for i in top_clusters:
top_wordonly.append(":".join([j[1] for j in i]))
return lda, corpus_lda, top_clusters, top_wordonly
#######################################################################
# Read textfile, build dictionary and bag-of-words corpus
documents = []
for line in codecs.open("./europarl-mini2/map/coach.en-es.all","r","utf8"):
lemma = line.split("\t")[3]
documents.append(lemma)
texts = [[word for word in document.lower().split() if word not in stopwords]
for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda, corpus_lda, topic_clusters, topic_wordonly = generateTopics(corpus, dictionary)
for i in topic_wordonly:
print i
推荐答案
因为LDA在训练和推理步骤中都使用了随机性.
Because LDA uses randomness in both training and inference steps.
每次使用numpy.random.seed
进行模型训练或推理时,通过将numpy.random
种子重置为相同的值:
By resetting the numpy.random
seed to the same value every time a model is trained or inference is performed, with numpy.random.seed
:
SOME_FIXED_SEED = 42
# before training/inference:
np.random.seed(SOME_FIXED_SEED)
(这很丑陋,这使得Gensim结果难以再现;请考虑提交补丁.我已经打开了问题.)
(This is ugly, and it makes Gensim results hard to reproduce; consider submitting a patch. I've already opened an issue.)
这篇关于每当我在同一个语料库上训练时,LDA模型就会生成不同的主题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!