#Bayesian Methods to create Anti-spammer
We can construct P(Spam | Word) for every (meaningful) word we encounter
during training.
Then multiply these together when analyzing a new mail to get the probability of it being spam.
Assumes the presence of different words are independent of each other - one reason this is called "Naive Bayes"

理论就是:  不考虑词和词之间的关系,简单的将每个词贡献的'spam‘值算出来,最后根据所有的这些词贡献出的'spam'值来分析新的邮件。

下面则是代码
首先是使用pandas读入数据,然后使用scikit-learn 来build 一个spam classifier, 最后使用这个spam classifier 来predict两个字符串到底应该归类spam 或者ham.

点击(此处)折叠或打开

  1. #!/usr/bin/env python3
  2. # -*- coding: utf-8 -*-
  3. # Author: hezhb
  4. # Created Time: Tue 01 May 2018 11:49:35 AM CST

  5. import os
  6. import io
  7. import numpy as np
  8. from pandas import DataFrame
  9. from sklearn.feature_extraction.text import CountVectorizer
  10. from sklearn.naive_bayes import MultinomialNB

  11. def readFiles(path):
  12.     for root, dirnames, filenames in os.walk(path):
  13.         for filename in filenames:
  14.             path = os.path.join(root, filename)
  15.     
  16.             inBody = False
  17.             lines = []
  18.             
  19.             f = io.open(path, 'r', encoding='latin1')
  20.             for line in f:
  21.                 if inBody:
  22.                     lines.append(line)
  23.                 elif line == '\n':
  24.                     inBody = True
  25.                 
  26.             f.close()
  27.             message = '\n'.join(lines)
  28.             yield path, message
  29.     

  30. def dataFrameFromDirectory(path, classification):
  31.     rows = []
  32.     index = []
  33.                         
  34.     for filename, message in readFiles(path):
  35.         rows.append({'message':message, 'class':classification})
  36.         index.append(filename)
  37.     
  38.     return DataFrame(rows, index=index)


  39. PATH='./hands-on/emails/'


  40. data = DataFrame({'message':[], 'class':[]})
  41. data = data.append(dataFrameFromDirectory(PATH+'spam', 'spam'))
  42. data = data.append(dataFrameFromDirectory(PATH+'ham', 'ham'))
  43. #print(data.head())

  44. """
  45. Now we will use CountVectorizer to split up each message into its list of words
  46. and throw that into a MultinomialNB classifier, call fit() and we've got
  47. a trained spam filter ready to go.
  48. """

  49. vectorizer = CountVectorizer(encoding='latin1')
  50. counts = vectorizer.fit_transform(data['message'].values)
  51. classifier = MultinomialNB()
  52. targets = data['class'].values
  53. classifier.fit(counts, targets)

  54. #Now can try this classifier out
  55. examples = ['Free viagra Now', 'Hi Bob, how about a game of golf tommorrow.']
  56. example_counts = vectorizer.transform(examples)
  57. predictions = classifier.predict(example_counts)
  58. print(predictions)


10-12 13:51
查看更多