我有一些像这样的串行代码来计算单词索引,即计算并置的单词对。下面的程序可以工作,只是为了说明的目的,句子列表是固定的。

import sys
from collections import defaultdict

GLOBAL_CONCORDANCE = defaultdict(lambda: defaultdict(lambda: defaultdict(lambda: [])))

def BuildConcordance(sentences):
    global GLOBAL_CONCORDANCE
    for sentenceIndex, sentence in enumerate(sentences):
        words = [word for word in sentence.split()]

        for index, word in enumerate(words):
            for i, collocate in enumerate(words[index:len(words)]):
                GLOBAL_CONCORDANCE[word][collocate][i].append(sentenceIndex)

def main():
    sentences = ["Sentence 1", "Sentence 2", "Sentence 3", "Sentence 4"]
    BuildConcordance(sentences)
    print GLOBAL_CONCORDANCE

if __name__ == "__main__":
    main()

对我来说,第一个 for 循环可以并行化,因为计算的数字是独立的。但是,正在修改的数据结构是全局结构。

我尝试使用 Python 的 Pool 模块,但我面临一些酸洗问题,这让我怀疑我是否使用了正确的设计模式。有人可以建议一种并行化此代码的好方法吗?

最佳答案

通常,当您使用函数式风格时,多处理是最简单的。在这种情况下,我的建议是从工作函数的每个实例返回结果元组列表。嵌套 defaultdict 的额外复杂性并没有真正给你带来任何好处。像这样的东西:

import sys
from collections import defaultdict
from multiprocessing import Pool, Queue
import re

GLOBAL_CONCORDANCE = defaultdict(lambda: defaultdict(lambda: defaultdict(list)))

def concordance_worker(index_sentence):
    sent_index, sentence = index_sentence
    words = sentence.split()

    return [(word, colo_word, colo_index, sent_index)
            for i, word in enumerate(words)
            for colo_index, colo_word in enumerate(words[i:])]

def build_concordance(sentences):
    global GLOBAL_CONCORDANCE
    pool = Pool(8)

    results = pool.map(concordance_worker, enumerate(sentences))

    for result in results:
        for word, colo_word, colo_index, sent_index in result:
            GLOBAL_CONCORDANCE[word][colo_word][colo_index].append(sent_index)

    print len(GLOBAL_CONCORDANCE)


def main():
    sentences = ["Sentence 1", "Sentence 2", "Sentence 3", "Sentence 4"]
    build_concordance(sentences)

if __name__ == "__main__":
    main()

如果这不会生成您要查找的内容,请告诉我。

关于python - 我怎样才能并行化这个字数统计功能?,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/8016561/

10-10 18:21