我正在使用nltk中的大型数据集(每个具有5 * 10^5功能的15个数据文件)训练分类器,

所以我陷入了这个错误之间:

Traceback (most recent call last):
  File "term_classify.py", line 51, in <module>
    classifier = obj.run_classifier(cltype)
  File "/root/Desktop/karim/software/nlp/nltk/publish/lists/classifier_function.py", line 146, in run_classifier
    classifier = NaiveBayesClassifier.train(train_set)
  File "/usr/local/lib/python2.7/dist-packages/nltk/classify/naivebayes.py", line 210, in train
    count = feature_freqdist[label, fname].N()
MemoryError


码:

def run_classifier(self,cltype):
    # create our dict of training data
    texts = {}
    texts['act'] = 'act'
    texts['art'] = 'art'
    texts['animal'] = 'anim'
    texts['country'] = 'country'
    texts['company'] = 'comp'
    train_set = []
    train_set = train_set + [(self.get_feature(word), sense) for word in features]
    #len of train_set = 545668. Better if we can push 100000 at a time
    classifier = NaiveBayesClassifier.train(train_set)


有没有什么办法可以分批训练分类器,或者可以减少负荷,从而影响结果?

最佳答案

您始终可以切换到支持批处理学习(部分适合)的scikit-learn贝叶斯

http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB


  partial_fit(X,y,classes = None,sample_weight = None)
  
  增量拟合
  一批样品。该方法有望被调用多次
  连续地在数据集的不同块上执行
  核心或在线学习。当
  整个数据集太大,无法立即放入内存。这个方法有
  一些性能开销,因此最好在
  尽可能大的数据块(只要适合
  内存预算)以隐藏开销。

关于python - 训练分类器作为批处理,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/26907220/

10-12 21:11