本文介绍了在Python中的线程之间共享字典时,是否可以避免锁定开销?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

限时删除!!

我在Python中有一个多线程应用程序,其中线程正在读取非常大的指令(因此我无法将它们复制到线程本地存储中)字典(从磁盘读取并且从未修改).然后,他们使用dict作为只读数据来处理大量数据:

I have a multi-threaded application in Python where threads are reading very large (so I cannot copy them to thread-local storage) dicts (read from disk and never modified). Then they process huge amounts of data using the dicts as read-only data:

# single threaded
d1,d2,d3 = read_dictionaries()
while line in stdin:
    stdout.write(compute(line,d1,d2,d3)+line)

我正在尝试通过使用线程来加快速度,然后线程将各自读取自己的输入并写入自己的输出,但是由于字典很大,因此我希望线程共享存储空间.

I am trying to speed this up by using threads, which would then each read its own input and write its own output, but since the dicts are huge, I want the threads to share the storage.

IIUC,每次线程从dict中读取时,它都必须将其锁定,这会给应用程序带来性能损失.由于字典是只读的,因此不需要数据锁定.

IIUC, every time a thread reads from the dict, it has to lock it, and that imposes a performance cost on the application. This data locking is not necessary because the dicts are read-only.

CPython是实际上单独锁定数据还是仅使用 GIL ?

Does CPython actually lock the data individually or does it just use the GIL?

如果确实存在按字典锁定,有没有办法避免它?

If, indeed, there is per-dict locking, is there a way to avoid it?

推荐答案

python中的多线程处理是没有用的.最好使用多处理模块.因为只有在少数情况下,多线程才能做出积极的努力.

Multithreading processing in python is useless. It's better to use multiprocessing module. Because multithreading can give positive effort only in lower number of cases.

在您身边没有任何代码示例的情况下,我只能建议将您的大词典拆分为几个部分,并使用 Pool.map .并在主过程中合并结果.

Without any code examples from your side I can only recommend to split your big dictionary on several parts and process every part using Pool.map. And merge results in main process.

不幸的是,不可能在有效的不同python进程之间共享大量内存(我们并不是在谈论基于 mmap ).但是您可以在不同的过程中阅读字典的不同部分.或者只是在主进程中阅读整个词典,然后为子进程分配一小块.

Unfortunately, it's impossible to share a lot of memory between different python processes effective (we are not talking about shared memory pattern based on mmap). But you can read different parts of your dictionary in different processes. Or just read entire dictionary in main process and give a small chunks to child processes.

此外,我应该警告您,使用多处理算法时应格外小心.因为每增加一个兆字节,进程数就会成倍增加.

Also, I should warn you that you should be very carefully with multiprocessing algorithms. Because every extra megabytes will be multiplied on number of process.

因此,根据您的伪代码示例,我可以假设基于compute函数的两种可能的算法:

So, based on your pseudocode example I can assume two possible algorithm based on compute function:

# "Stateless"
for line in stdin:
    res = compute_1(line) + compute_2(line) + compute_3(line)
    print res, line

# "Shared" state
for line in stdin:
    res = compute_1(line)
    res = compute_2(line, res)
    res = compute_3(line, res)
    print res, line

在第一种情况下,您可以创建多个工作程序,并根据流程类(减少每个流程的内存使用量是一个好主意),并像生产线一样对其进行计算.

In first case, you can create a several workers, read each dictionary in separate worker based on Process class (it's good idea to decrease memory usage for each process), and compute it like a production line.

在第二种情况下,您具有共享状态.对于每个下一个工作人员,您需要上一个工作人员的结果.对于多线程/多处理编程,这是最坏的情况.但是您可以编写算法,那里有几个工作人员正在使用相同的Queue并将结果推送到该队列,而无需等待所有周期的完成.您只需在进程之间共享一个 Queue 实例即可.

In second case, you have a shared state. For each next worker you need a result of previous one. It's worst case for multithreading/multiprocessing programming. But you can write algorithm there several workers are using same Queue and pushing result to it without waiting finish of all cycle. And you just share a Queue instance between your processes.

这篇关于在Python中的线程之间共享字典时,是否可以避免锁定开销?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

1403页,肝出来的..

09-06 10:44