使用多处理功能共享巨大的字典

使用多处理功能共享巨大的字典

本文介绍了python:使用多处理功能共享巨大的字典的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用多处理处理大量数据,存储在字典中。基本上我正在做的是加载一些存储在字典中的签名,从中构建一个共享的dict对象(获取由Manager.dict()返回的代理对象),并将此代理作为参数传递给具有要在多处理中执行。



只是为了澄清:

 签名= dict()
load_signatures(signatures)
[...]
manager = Manager()
signaturesProxy = manager.dict(signatures)
[... ]
result = pool.map(myfunction,[signaturesProxy] * NUM_CORES)

如果签名不到200万条目,那么一切都会奏效。无论如何,我必须处理具有5.8M密钥的字典(以二进制格式进行酸洗签名生成4.8 GB文件)。在这种情况下,该进程在创建代理对象期间会死机:

 追溯(最近的最后一次呼叫):
文件matrix.py,第617行,在< module>
signaturesProxy = manager.dict(signatures)
文件/usr/lib/python2.6/multiprocessing/managers.py,第634行,临时
令牌,exp = self._create (typeid,* args,** kwds)
文件/usr/lib/python2.6/multiprocessing/managers.py,第534行,_create
id,exposed = dispatch(conn,None ,'create',(typeid,)+ args,kwds)
文件/usr/lib/python2.6/multiprocessing/managers.py,第79行,在调度
raise convert_to_error(kind,结果)
multiprocessing.managers.RemoteError:
---------------------------------- -----------------------------------------
追溯(最近的电话最后):
文件/usr/lib/python2.6/multiprocessing/managers.py,第173行,handle_request
request = c.recv()
EOFError
-------------------------------------------------- -------------------------


$ b $我知道数据结构是巨大的,但我正在一台配备有32GB RAM的机器上运行,我看到p加载签名后,占用7GB的RAM。然后,它开始构建代理对象,并且RAM使用率高达〜17GB的RAM,但从不接近32.此时,RAM使用率开始迅速减少,并且过程以上述错误结束。所以我猜这不是由于内存不足的错误...



任何想法或建议?



谢谢,



Davide

解决方案>

如果字典是只读的,则在大多数操作系统中您不需要代理对象。



只需在启动工作之前加载字典,然后放入他们在某个地方可以到达;最简单的地方是全球范围内的一个模块。他们可以从工人那里读取。

 从多处理导入池

buf =

def f(x):
buf.find(x)
返回0

如果__name__ =='__main__':
buf =a* 1024 * 1024 * 1024
池=池(进程= 1)
result = pool.apply_async(f,[10])
print result.get = 5)

这只使用1GB的内存组合,而不是每个进程1GB,因为任何现代的操作系统将在fork之前创建的数据写入阴影。只要记住,其他工作人员将不会看到对数据的更改,当然,内存将被分配给您更改的任何数据。



它将使用一些内存:包含引用计数的每个对象的页面将被修改,因此将被分配。这是否取决于数据。



这将适用于实现普通分支的任何操作系统。它不会在Windows上工作;其(瘫痪)流程模型需要重新启动每个工作人员的整个过程,因此共享数据不是很好。


I'm processing very large amounts of data, stored in a dictionary, using multiprocessing. Basically all I'm doing is loading some signatures, stored in a dictionary, building a shared dict object out of it (getting the 'proxy' object returned by Manager.dict() ) and passing this proxy as argument to the function that has to be executed in multiprocessing.

Just to clarify:

signatures = dict()
load_signatures(signatures)
[...]
manager = Manager()
signaturesProxy = manager.dict(signatures)
[...]
result = pool.map ( myfunction , [ signaturesProxy ]*NUM_CORES )

Now, everything works perfectly if signatures is less than 2 million entries or so. Anyways, I have to process a dictionary with 5.8M keys (pickling signatures in binary format generates a 4.8 GB file). In this case, the process dies during the creation of the proxy object:

Traceback (most recent call last):
  File "matrix.py", line 617, in <module>
signaturesProxy = manager.dict(signatures)
  File "/usr/lib/python2.6/multiprocessing/managers.py", line 634, in temp
token, exp = self._create(typeid, *args, **kwds)
  File "/usr/lib/python2.6/multiprocessing/managers.py", line 534, in _create
id, exposed = dispatch(conn, None, 'create', (typeid,)+args, kwds)
  File "/usr/lib/python2.6/multiprocessing/managers.py", line 79, in dispatch
raise convert_to_error(kind, result)
multiprocessing.managers.RemoteError:
---------------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.6/multiprocessing/managers.py", line 173, in handle_request
    request = c.recv()
EOFError
---------------------------------------------------------------------------

I know the data structure is huge but I'm working on a machine equipped w/ 32GB of RAM, and running top I see that the process, after loading the signatures, occupies 7GB of RAM. It then starts building the proxy object and the RAM usage goes up to ~17GB of RAM but never gets close to 32. At this point, the RAM usage starts diminishing quickly and the process terminates with the above error. So I guess this is not due to an out-of-memory error...

Any idea or suggestion?

Thank you,

Davide

解决方案

If the dictionaries are read-only, you don't need proxy objects in most operating systems.

Just load the dictionaries before starting the workers, and put them somewhere they'll be reachable; the simplest place is globally to a module. They'll be readable from the workers.

from multiprocessing import Pool

buf = ""

def f(x):
    buf.find("x")
    return 0

if __name__ == '__main__':
    buf = "a" * 1024 * 1024 * 1024
    pool = Pool(processes=1)
    result = pool.apply_async(f, [10])
    print result.get(timeout=5)

This only uses 1GB of memory combined, not 1GB for each process, because any modern OS will make a copy-on-write shadow of the data created before the fork. Just remember that changes to the data won't be seen by other workers, and memory will, of course, be allocated for any data you change.

It will use some memory: the page of each object containing the reference count will be modified, so it'll be allocated. Whether this matters depends on the data.

This will work on any OS that implements ordinary forking. It won't work on Windows; its (crippled) process model requires relaunching the entire process for each worker, so it's not very good at sharing data.

这篇关于python:使用多处理功能共享巨大的字典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-03 09:17
查看更多