


I have a small Python script that calculates the top 10 most frequent words, 10 most infrequent words and the total number of words in a .txt document. According to the assignment, a word is defined as 2 letters or more. I have the 10 most frequent and the 10 most infrequent words printing fine, however when I attempt to print the total number of words in the document it prints the total number of all the words, including the single letter words (such as "a"). How can I get the total number of words to calculate ONLY the words that have 2 letters or more?


from string import *
from collections import defaultdict
from operator import itemgetter
import re

number = 10
words = {}
total_words = 0
words_only = re.compile(r'^[a-z]{2,}$')
counter = defaultdict(int)

"""Define function to count the total number of words"""
def count_words(s):
    unique_words = split(s)
    return len(unique_words)

"""Define words as 2 letters or more -- no single letter words such as "a" """
for word in words:
    if len(word) >= 2:
        counter[word] += 1

"""Open text document, strip it, then filter it"""
txt_file = open('charactermask.txt', 'r')

for line in txt_file:
    total_words = total_words + count_words(line)
    for word in line.strip().split():
        word = word.strip(punctuation).lower()
        if words_only.match(word):
            counter[word] += 1

# Most Frequent Words
top_words = sorted(counter.iteritems(),
                    key=lambda(word, count): (-count, word))[:number]

print "Most Frequent Words: "

for word, frequency in top_words:
    print "%s: %d" % (word, frequency)

# Least Frequent Words:
least_words = sorted(counter.iteritems(),
                    key=lambda (word, count): (count, word))[:number]

print " "
print "Least Frequent Words: "

for word, frequency in least_words:
    print "%s: %d" % (word, frequency)

# Total Unique Words:
print " "
print "Total Number of Words: %s" % total_words


I am not an expert with Python, this is for a Python class I am currently taking. The neatness of my code and proper formatting count against me in this assignment, if possible can someone also tell me if the format of this code is considered "good practice"?


名单COM prehension方式:

The list comprehension method:

def countWords(s):
    words = s.split()
    return len([word for word in words if len(word)>=2])


The verbose method:

def countWords(s):
    words = s.split()
    count = 0
    for word in words:
        if len(word) >= 2:
            count += 1
    return count

顺便说一句,使用荣誉 defaultdict ,但我会用的:

As an aside, kudos on using defaultdict, but I would go with collections.Counter:

words = collections.Counter([word for line in open(filepath) for word in line.strip()])
words = dict((k,v) for k,v in words.iteritems if len(k)>=2)
mostFrequent = [w[0] for w in words.most_common(10)]
leastFrequent = [w[0] for w in words.most_common()[-10:]]


Hope this helps


09-03 05:51