NLTK的半监督朴素贝叶斯

本文介绍了NLTK的半监督朴素贝叶斯的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经基于EM(期望最大化算法)在Python中构建了NLTK朴素贝叶斯的半监督版本.但是，在EM的某些迭代中，我得到的对数可能性为负(EM的对数可能性在每次迭代中都必须为正)，因此我认为我的代码中肯定有一些错误.仔细检查我的代码后，我不知道为什么会这样.如果有人在下面的代码中发现任何错误，将不胜感激:

I have built a semi-supervised version of NLTK's Naive Bayes in Python based on the EM (expectation-maximization algorithm). However, in some iterations of EM I am getting negative log-likelihoods (the log-likelihoods of EM must be positive in every iteration), therefore I believe that there must be some mistakes in my code. After carefully reviewing my code, I have no idea why is this happenning. It would be really appreciated if someone could spot any mistakes in my code below:

(半监督朴素贝叶斯的参考资料)

EM算法主循环

#initial assumptions:
#Bernoulli NB: only feature presence (value 1) or absence (value None) is computed

#initial data:
#C: classifier trained with labeled data
#labeled_data: an array of tuples (feature dic, label)
#features: dictionary that outputs feature dictionary for a given document id

for iteration in range(1, self.maxiter):

  #Expectation: compute probabilities for each class for each unlabeled document
  #An array of tuples (feature dictionary, probability dist) is built
  unlabeled_data = [(features[id],C.prob_classify(features[id])) for id in U]

  #Maximization: given the probability distributions of previous step,
  #update label, feature-label counts and update classifier C
  #gen_freqdists is a custom function, see below
  #gen_probdists is the original NLTK function
  l_freqdist_act,ft_freqdist_act, ft_values_act = self.gen_freqdists(labeled_data,unlabeled_data)
  l_probdist_act, ft_probdist_act = self.gen_probdists(l_freqdist_act, ft_freqdist_act, ft_values_act, ELEProbDist)
  C = nltk.NaiveBayesClassifier(l_probdist_act, ft_probdist_act)

  #Compute log-likelihood
  #NLTK Naive bayes classifier prob_classify func gives logprob(class) + logprob(doc|class))
  #for labeled data, sum logprobs output by the classifier for the label
  #for unlabeled data, sum logprobs output by the classifier for each label
  log_lh = sum([C.prob_classify(ftdic).prob(label) for (ftdic,label) in labeled_data])
  log_lh += sum([C.prob_classify(ftdic).prob(label) for (ftdic,ignore) in unlabeled_data for label in l_freqdist_act.samples()])

  #Continue until convergence
  if log_lh_old == "first":
    if self.debug: print "\tM: #iteration 1",log_lh,"(FIRST)"
    log_lh_old =  log_lh
  else:
    log_lh_diff = log_lh - log_lh_old
    if self.debug: print "\tM: #iteration",iteration,log_lh_old,"->",log_lh,"(",log_lh_diff,")"
    if log_lh_diff < self.log_lh_diff_min: break
    log_lh_old =  log_lh

自定义函数gen-freqdists，用于创建所需的频率分布

def gen_freqdists(self, instances_l, instances_ul):
    l_freqdist = FreqDist() #frequency distrib. of labels
    ft_freqdist= defaultdict(FreqDist) #dictionary of freq. distrib. for ft-label pairs
    ft_values = defaultdict(set) #dictionary of possible values for each ft (only 1/None)
    fts = set() #set of all fts

    #counts for labeled data
    for (ftdic,label) in instances_l:
      l_freqdist.inc(label,1)
      for f in ftdic.keys():
        fts.add(f)
        ft_freqdist[label,f].inc(1,1)
        ft_values[f].add(1)

    #counts for unlabeled data
    #we must compute maximum a posteriori label estimate
    #and update label/ft occurrences accordingly
    for (ftdic,probs) in instances_ul:
      map_l = probs.max() #label with highest probability
      map_p = probs.prob(map_l) #probability of map_l
      l_freqdist.inc(map_l,count=map_p)
      for f in ftdic.keys():
        fts.add(f)
        ft_freqdist[map_l,f].inc(1,count=map_p)
        ft_values[f].add(1)

    #features not appearing in documents get implicit None values
    for l in l_freqdist.samples():
    num_samples = l_freqdist[l]
    for f in fts:
      count = ft_freqdist[l,f].N()
      ft_freqdist[l,f].inc(None, num_samples-count)
      ft_values[f].add(None)

    #return computed frequency distributions
    return l_freqdist, ft_freqdist, ft_values

推荐答案

我认为您正在汇总错误的值.

I think you're summing the wrong values.

这是应该用于计算对数概率之和的代码:

This is your code that is supposed to compute the sum of the log probs:

  #Compute log-likelihood
  #NLTK Naive bayes classifier prob_classify func gives logprob(class) + logprob(doc|class))
  #for labeled data, sum logprobs output by the classifier for the label
  #for unlabeled data, sum logprobs output by the classifier for each label

  log_lh = sum([C.prob_classify(ftdic).prob(label) for (ftdic,label) in labeled_data])
  log_lh += sum([C.prob_classify(ftdic).prob(label) for (ftdic,ignore) in unlabeled_data for label in l_freqdist_act.samples()])

根据NLTK文档中的 prob_classify (在NaiveBayesClassifier上)为 ProbDistI 对象(不是 logprob(class) + logprob(doc|class)).当您获得该对象时，您将在其上为给定标签调用prob方法.您可能想调用logprob，并且也将返回值取反.

According to the NLTK documentation for prob_classify (on NaiveBayesClassifier) a ProbDistI object is returned (not logprob(class) + logprob(doc|class)). When you get this object, You're calling the prob method on it for a given label. You probably want to call logprob, and negate that return as well.

这篇关于NLTK的半监督朴素贝叶斯的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！