问题描述
我正在尝试在 python 中运行一个并行进程,其中我必须根据某些条件从大数组中提取某些多边形.大数组有 10k+ 个被索引的多边形.
I am trying to run a parallel process in python, wherein I have to extract certain polygons from a large array based on some conditions. The large array has 10k+ polygons that are indexed.
在一个 extract_polygon
函数中,我传递了(数组,索引).基于索引,该函数必须根据定义的条件返回或不返回与该索引对应的多边形.该数组永远不会更改,仅用于根据提供的索引读取多边形.
In a extract_polygon
function I pass (array, index). Based on index the function has to either return the polygon corresponding to that index or not based on the conditions defined. The array is never changed and is only used for reading the polygon based on the index provided.
由于数组非常大,我在并行处理过程中遇到内存不足错误.我怎样才能避免这种情况?(在某种程度上,如何在多处理中有效地使用共享数组?)
Since the array is very large, I am running into out of memory error during parallel processing. how can I avoid that? (In a way, how to effectively use shared array in multiprocessing?)
以下是我的示例代码:
def extract_polygon(array, index):
try:
islays = ndimage.find_objects(clone==index)
poly = clone[islays[0][0],islays[0][1]]
area = np.count_nonzero(ploy)
minArea = 100
maxArea = 10000
if (area > minArea) and (area < maxArea):
return poly
else:
return None
except:
return None
start = time.time()
pool = mp.Pool(10)
results = pool.starmap(get_objects,[(array, index) for index in indices])
pool.close()
pool.join()
#indices here is a list of all the indexes we have.
在这种情况下,我可以使用任何其他库,例如 ray
吗?
Can I use any other library like ray
in this case?
推荐答案
你绝对可以使用像 这样的库雷.
结构看起来像这样(经过简化以删除您的应用程序逻辑).
The structure would look something like this (simplified to remove your application logic).
import numpy as np
import ray
ray.init()
# Create the array and store it in shared memory once.
array = np.ones(10**6)
array_id = ray.put(array)
@ray.remote
def extract_polygon(array, index):
# Change this to actual extract the polygon.
return index
# Start 10 tasks that each take in the ID of the array in shared memory.
# These tasks execute in parallel (assuming there are enough CPU resources).
result_ids = [extract_polygon.remote(array_id, i) for i in range(10)]
# Fetch the results.
results = ray.get(result_ids)
您可以在文档中阅读有关 Ray 的更多信息.
You can read more about Ray in the documentation.
请参阅下面的一些相关答案:
See some related answers below:
这篇关于在多处理中使用共享数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!