我有一些像这样的串行代码来计算单词索引,即计算并置的单词对。下面的程序可以工作,只是为了说明的目的,句子列表是固定的。
import sys
from collections import defaultdict
GLOBAL_CONCORDANCE = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: [])))
def BuildConcordance(sentences):
global GLOBAL_CONCORDANCE
for sentenceIndex, sentence in enumerate(sentences):
words = [word for word in sentence.split()]
for index, word in enumerate(words):
for i, collocate in enumerate(words[index:len(words)]):
GLOBAL_CONCORDANCE[word][collocate][i].append(sentenceIndex)
def main():
sentences = ["Sentence 1", "Sentence 2", "Sentence 3", "Sentence 4"]
BuildConcordance(sentences)
print GLOBAL_CONCORDANCE
if __name__ == "__main__":
main()
对我来说,第一个 for 循环可以并行化,因为计算的数字是独立的。但是,正在修改的数据结构是全局结构。
我尝试使用 Python 的
Pool
模块,但我面临一些酸洗问题,这让我怀疑我是否使用了正确的设计模式。有人可以建议一种并行化此代码的好方法吗? 最佳答案
通常,当您使用函数式风格时,多处理是最简单的。在这种情况下,我的建议是从工作函数的每个实例返回结果元组列表。嵌套 defaultdict
的额外复杂性并没有真正给你带来任何好处。像这样的东西:
import sys
from collections import defaultdict
from multiprocessing import Pool, Queue
import re
GLOBAL_CONCORDANCE = defaultdict(lambda: defaultdict(lambda: defaultdict(list)))
def concordance_worker(index_sentence):
sent_index, sentence = index_sentence
words = sentence.split()
return [(word, colo_word, colo_index, sent_index)
for i, word in enumerate(words)
for colo_index, colo_word in enumerate(words[i:])]
def build_concordance(sentences):
global GLOBAL_CONCORDANCE
pool = Pool(8)
results = pool.map(concordance_worker, enumerate(sentences))
for result in results:
for word, colo_word, colo_index, sent_index in result:
GLOBAL_CONCORDANCE[word][colo_word][colo_index].append(sent_index)
print len(GLOBAL_CONCORDANCE)
def main():
sentences = ["Sentence 1", "Sentence 2", "Sentence 3", "Sentence 4"]
build_concordance(sentences)
if __name__ == "__main__":
main()
如果这不会生成您要查找的内容,请告诉我。
关于python - 我怎样才能并行化这个字数统计功能?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/8016561/