我是python新手,我尝试编写脚本,在两个文件之间找到最常用的单词。我能分别在两个文件之间找到最常用的单词,但不知道如何计数让我们说出两个文件中最常用的5个单词需要查找的常用词和那些常用词之间的频率两个文件也应该是最高的。

import re
from collections import Counter


finalLineLower=''
with open("test3.txt", "r") as hfFile:
        for line in hfFile:
            finalLine = re.sub('[,.<;:)-=!>_(?"]', '', line)
            finalLineLower += finalLine.lower()
            words1 = finalLineLower.split()

f = open('test2.txt', 'r')
sWords = [line.strip() for line in f]


finalLineLower1=''
with open("test4.txt", "r") as tsFile:
        for line in tsFile:
            finalLine = re.sub('[,.<;:)-=!>_(?"]', '', line)
            finalLineLower1 += finalLine.lower()
            words = finalLineLower1.split()
#print (words)
mc = Counter(words).most_common()
mc2 = Counter(words1).most_common()

print(len(mc))
print(len(mc2))

示例test3和test4文件如下。
测试3:
Essays are generally scholarly pieces of writing giving the author's own argument, but the definition is vague, overlapping with those of an article, a pamphlet and a short story.

测试4:
Essays are generally scholarly pieces of writing giving the author's own argument, but the definition is vague, overlapping with those of an article, a pamphlet and a short story.

Essays can consist of a number of elements, including: literary criticism, political manifestos, learned arguments, observations of daily life, recollections, and reflections of the author. Almost all modern essays are written in prose, but works in verse have been dubbed essays (e.g. Alexander Pope's An Essay on Criticism and An Essay on Man). While brevity usually defines an essay, voluminous works like John Locke's An Essay Concerning Human Understanding and Thomas Malthus's An Essay on the Principle of Population are counterexamples. In some countries (e.g., the United States and Canada), essays have become a major part of formal education. Secondary students are taught structured essay formats to improve their writing skills, and admission essays are often used by universities in selecting applicants and, in the humanities and social sciences, as a way of assessing the performance of students during final exams.

最佳答案

您只需找到Counter对象与&操作数之间的交集:

mc = Counter(words)
mc2 = Counter(words1)
total=mc&mc2
mos=total.most_common(N)

例子:
>>> d1={'a':5,'f':2,'c':1,'h':2,'t':4}
>>> d2={'a':3,'b':2,'e':1,'h':5,'t':6}
>>> c1=Counter(d1)
>>> c2=Counter(d2)
>>> t=c1&c2
>>> t
Counter({'t': 4, 'a': 3, 'h': 2})
>>> t.most_common(2)
[('t', 4), ('a', 3)]

但是请注意,&返回计数器之间的最小计数,也可以使用UNI:cc>返回最大计数,并且可以使用简单的DICT理解来获得最大计数:
>>> m=c1|c2
>>> m
Counter({'t': 6, 'a': 5, 'h': 5, 'b': 2, 'f': 2, 'c': 1, 'e': 1})
>>> max={i:j for i,j in m.items() if i in t}
>>> max
{'a': 5, 'h': 5, 't': 6}

最后,如果你想要常用单词的总和,你可以把你的计数器加在一起:
>>> s=Counter(max)+t
>>> s
Counter({'t': 10, 'a': 8, 'h': 7})

关于python - 使用Python在2个文件之间的最常用词,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/30661138/

10-13 02:04