

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
from gensim import corpora, models, similarities
from nltk.corpus import stopwords
import codecs

documents = []
with codecs.open("Master_File_for_Docs.txt", encoding = 'utf-8', mode= "r") as fid:
   for line in fid:
stoplist = []
x = stopwords.words('english')
for word in x:

#Removes Stopwords
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

lda = models.LdaModel(corpus, id2word=dictionary, num_topics=100)
#corpus_lda = lda[corpus]
#for doc in corpus_lda:
 #   print(doc)


I am running Gensim for topic modeling and trying to get the above code working. I know that this code works because my friend ran it from a mac computer and it worked successfully but when I run it from a windows computer the code gives me a



Also the logging that I set on the second line also doesn't appear on my windows computer. Is there something in Windows that I need to fix in order for gensim to work?


出现 MemoryError 是因为Gensim在分析数据时会尝试将所需的所有数据保留在内存中.解决方案很简单:

The MemoryError appears because Gensim is trying to keep all of the data you need in memory while analyzing it.The solutions are scarse:

  • 使用具有更多内存的服务器(AWS计算机,比您的PC更强大的功能)
  • 尝试使用64位python解释器
  • 尝试减小 model.save()中的 size 参数.这样会减少代表您的单词的功能
  • 尝试增加 model.save()中的 min_count 参数.这将使模型只考虑出现至少 min_count
  • 的单词
  • Use a server with more memory (AWS machine, anything more powerful than your PC)
  • Try a python interpreter in 64 bit
  • Try reducing the size parameter in model.save(). This will lead to have less features representing your words
  • Try increasing the min_count parameter in model.save(). This will make the model consider only words that appear at least min_count times


Be careful though, these last 2 advices will modify the characteristics of your model


08-04 05:17