我正在使用nltk中的大型数据集(每个具有5 * 10^5
功能的15个数据文件)训练分类器,
所以我陷入了这个错误之间:
Traceback (most recent call last):
File "term_classify.py", line 51, in <module>
classifier = obj.run_classifier(cltype)
File "/root/Desktop/karim/software/nlp/nltk/publish/lists/classifier_function.py", line 146, in run_classifier
classifier = NaiveBayesClassifier.train(train_set)
File "/usr/local/lib/python2.7/dist-packages/nltk/classify/naivebayes.py", line 210, in train
count = feature_freqdist[label, fname].N()
MemoryError
码:
def run_classifier(self,cltype):
# create our dict of training data
texts = {}
texts['act'] = 'act'
texts['art'] = 'art'
texts['animal'] = 'anim'
texts['country'] = 'country'
texts['company'] = 'comp'
train_set = []
train_set = train_set + [(self.get_feature(word), sense) for word in features]
#len of train_set = 545668. Better if we can push 100000 at a time
classifier = NaiveBayesClassifier.train(train_set)
有没有什么办法可以分批训练分类器,或者可以减少负荷,从而影响结果?
最佳答案
您始终可以切换到支持批处理学习(部分适合)的scikit-learn
贝叶斯
http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB
partial_fit(X,y,classes = None,sample_weight = None)
增量拟合
一批样品。该方法有望被调用多次
连续地在数据集的不同块上执行
核心或在线学习。当
整个数据集太大,无法立即放入内存。这个方法有
一些性能开销,因此最好在
尽可能大的数据块(只要适合
内存预算)以隐藏开销。
关于python - 训练分类器作为批处理,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/26907220/