我有一个要并行化的函数。
import multiprocessing as mp
from pathos.multiprocessing import ProcessingPool as Pool
cores=mp.cpu_count()
# create the multiprocessing pool
pool = Pool(cores)
def clean_preprocess(text):
"""
Given a string of text, the function:
1. Remove all punctuations and numbers and converts texts to lower case
2. Handles negation words defined above.
3. Tokenies words that are of more than length 1
"""
cores=mp.cpu_count()
pool = Pool(cores)
lower = re.sub(r'[^a-zA-Z\s\']', "", text).lower()
lower_neg_handled = n_pattern.sub(lambda x: n_dict[x.group()], lower)
letters_only = re.sub(r'[^a-zA-Z\s]', "", lower_neg_handled)
words = [i for i in tok.tokenize(letters_only) if len(i) > 1] ##parallelize this?
return (' '.join(words))
我一直在阅读有关多处理的文档,但是对于如何适当地并行化我的函数仍然有些困惑。如果有人能指出我正确的并行化函数的方向,我将不胜感激。
最佳答案
在函数上,您可以决定通过以下方式来并行化:将文本拆分为多个子部分,将标记化应用于子部分,然后合并结果。
类似于:
text0 = text[:len(text)/2]
text1 = text[len(text)/2:]
然后,使用以下步骤对这两个部分进行处理:
# here, I suppose that clean_preprocess is the sequential version,
# and we manage the pool outside of it
with Pool(2) as p:
words0, words1 = pool.map(clean_preprocess, [text0, text1])
words = words1 + words2
# or continue with words0 words1 to save the cost of joining the lists
但是,您的功能似乎受到内存的限制,因此不会有可怕的加速(通常因素2是这些天我们在标准计算机上希望达到的最大值),例如How much does parallelization help the performance if the program is memory-bound?或What do the terms "CPU bound" and "I/O bound" mean?
因此,您可以尝试将文本分成两个以上的部分,但可能不会更快。您甚至可能会获得令人失望的性能,因为拆分文本可能比处理文本更昂贵。
关于python - for循环中的并行化功能,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/54699149/