I seriously hate to post a question about an entire chunk of code, but I've been working on this for the past 3 hours and I can't wrap my head around what is happening. I have approximately 600 tweets I am retrieving from a CSV file with varying score values (between -2 to 2) reflecting the sentiment towards a presidential candidate.
However, when I run this training sample on any other data, only one value is returned (positive). I have checked to see if the scores were being added correctly and they are. It just doesn't make sense to me that 85,000 tweets would all be rated "positive" from a diverse training set of 600. Does anyone know what is happening here? Thanks!
import nltk
import csv
tweets = []
import ast
with open('romney.csv', 'rb') as csvfile:
mycsv = csv.reader(csvfile)
for row in mycsv:
tweet = row[1]
score = ast.literal_eval(row[12])
if score > 0:
print score
print tweet
elif score < 0:
print score
print tweet
except ValueError:
tweet = ""
def get_words_in_tweets(tweets):
all_words = []
for (words, sentiment) in tweets:
return all_words
def get_word_features(wordlist):
wordlist = nltk.FreqDist(wordlist)
word_features = wordlist.keys()
return word_features
def extract_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in document_words)
return features
word_features = get_word_features(get_words_in_tweets(tweets))
training_set = nltk.classify.apply_features(extract_features, tweets)
classifier = nltk.NaiveBayesClassifier.train(training_set)
c = 0
with open('usa.csv', "rU") as csvfile:
mycsv = csv.reader(csvfile)
for row in mycsv:
tweet = row[0]
c = c + 1
print classifier.classify(extract_features(tweet.split()))
except IndexError:
tweet = ""
Naive Bayes Classifier usually works best when evaluating words that appear in the document, ignoring absence of words. Since you use
features['contains(%s)' % word] = (word in document_words)
每个文档主要由值= False的要素表示.
each document is mostly represented by features with a value = False.
if word in document_words:
features['contains(%s)' % word] = True
(您可能还应该更改 for 循环,以使其比在词典中循环所有单词,而是循环到文档中出现的单词更有效).
(you should probably also change the for loop for something more efficient than looping over all words in the lexicon, looping instead on words occurring in the document).