问题描述
我尝试使用LDA进行文本聚类,但是并没有给我独特的聚类.下面是我的代码
I tried doing text clustering using LDA, but it isn't giving me distinct clusters. Below is my code
#Import libraries
from gensim import corpora, models
import pandas as pd
from gensim.parsing.preprocessing import STOPWORDS
from itertools import chain
#stop words
stoplist = list(STOPWORDS)
new = ['education','certification','certificate','certified']
stoplist.extend(new)
stoplist.sort()
#read data
dat = pd.read_csv('D:\data_800k.csv',encoding='latin').Certi.tolist()
#remove stop words
texts = [[word for word in document.lower().split() if word not in stoplist] for document in dat]
#dictionary
dictionary = corpora.Dictionary(texts)
#corpus
corpus = [dictionary.doc2bow(text) for text in texts]
#train model
lda = models.LdaMulticore(corpus, id2word=dictionary, num_topics=25, workers=4,minimum_probability=0)
#print topics
lda.print_topics(num_topics=25, num_words=7)
#get corpus
lda_corpus = lda[corpus]
#calculate cutoff score
scores = list(chain(*[[score for topic_id,score in topic] \
for topic in [doc for doc in lda_corpus]]))
#threshold
threshold = sum(scores)/len(scores)
threshold
**0.039999999971137644**
#cluster1
cluster1 = [j for i,j in zip(lda_corpus,dat) if i[0][1] > threshold]
#cluster2
cluster2 = [j for i,j in zip(lda_corpus,dat) if i[1][1] > threshold]
问题是在cluster1中存在重叠的元素,这些元素通常出现在cluster2中,依此类推.
The problem is there are overlapping elements in cluster1, which tend to be present in cluster2 and so on.
我还尝试将阈值手动提高到0.5,但这给了我同样的问题
I also tried to increase threshold manually to 0.5, however it is giving me the same issue
推荐答案
这很现实.
文档或单词通常都不能唯一地分配给单个群集.
Neither documents or words are usually uniquely assignable to a single cluster.
如果您要手动标记一些数据,还将很快找到一些不能清楚地标记为一个或另一个的文档.因此,好我想算法不会假装有一个很好的唯一赋值.
If you'd manually label some data, you will also quickly find some documents that cannot be clearly labeled as one or the other. So it's good I'd the algorithm doesn't pretend there were a good unique assignment.
这篇关于文本聚类的主题建模效率低下的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!