问题描述
我正试图解决一个棘手的问题,并且迷路了.
I am trying to solve a difficult problem and am getting lost.
这是我应该做的:
INPUT: file
OUTPUT: dictionary
Return a dictionary whose keys are all the words in the file (broken by
whitespace). The value for each word is a dictionary containing each word
that can follow the key and a count for the number of times it follows it.
You should lowercase everything.
Use strip and string.punctuation to strip the punctuation from the words.
Example:
>>> #example.txt is a file containing: "The cat chased the dog."
>>> with open('../data/example.txt') as f:
... word_counts(f)
{'the': {'dog': 1, 'cat': 1}, 'chased': {'the': 1}, 'cat': {'chased': 1}}
到目前为止,我想尽一切办法来至少拉出正确的单词:
Here's all I have so far, in trying to at least pull out the correct words:
def word_counts(f):
i = 0
orgwordlist = f.split()
for word in orgwordlist:
if i<len(orgwordlist)-1:
print orgwordlist[i]
print orgwordlist[i+1]
with open('../data/example.txt') as f:
word_counts(f)
我想我需要以某种方式使用.count方法并最终将一些字典压缩在一起,但是我不确定如何为每个第一个单词计算第二个单词.
I'm thinking I need to somehow use the .count method and eventually zip some dictionaries together, but I'm not sure how to count the second words for each first word.
我知道我距离解决问题还差得很远,但是想一次迈出一步.感谢您提供任何帮助,甚至只是指向正确方向的提示.
I know I'm nowhere near solving the problem, but trying to take it one step at a time. Any help is appreciated, even just tips pointing in the right direction.
推荐答案
我们可以通过两次通过解决此问题:
We can solve this in two passes:
- 在第一遍中,我们构造一个
Counter
,并使用zip(..)
计算两个连续单词的元组;和 - 然后我们将
Counter
放入词典字典中.
- in a first pass, we construct a
Counter
and count the tuples of two consecutive words usingzip(..)
; and - then we turn that
Counter
in a dictionary of dictionaries.
这将导致以下代码:
from collections import Counter, defaultdict
def word_counts(f):
st = f.read().lower().split()
ctr = Counter(zip(st,st[1:]))
dc = defaultdict(dict)
for (k1,k2),v in ctr.items():
dc[k1][k2] = v
return dict(dc)
这篇关于为文件中的每个单词创建一个字典,并计算紧随其后的单词的出现频率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!