python - 数以百万计的函数迭代的并行化

我在参数compare的2000万种不同组合上运行以下函数sample，其中sample是由100个1和0组成的一维数组。

compare与sample一起使用其他几个数组，并使用它们执行一些点积，对这些点积求幂，然后将它们相互比较。这些其他数组保持不变。

在我的笔记本电脑上，大约需要一小时才能完成全部2000万个组合。

我正在寻找使它更快运行的方法。我愿意改进以下代码，也可以使用诸如dask之类的利用并行处理的库。

笔记：

compare中每行的注释显示了对该行在我的计算机上花费多长时间的非常粗略的估计。它们是函数本身在行上%% timeit的结果。
在我的用例中，compare的输入实际上不是随机生成的

def compare(sample, competition_exp_dot, preferences): # 140 µs
    sample_exp_dot = np.exp(preferences @ sample) #30.3 µs
    all_competitors = np.append(sample_exp_dot.reshape(-1, 1), competition_exp_dot, 1) # 5 µs
    all_results = all_products/all_competitors.sum(axis=1)[:,None] #27.4 µs

    return all_results.mean(axis=0) #20.6 µs

#these inputs to the above function stay the same
preferences = np.random.random((1000,100))
competition = np.array([np.random.randint(0,2,100), np.random.randint(0,2,100)])
competition_exp_dot = np.exp(preferences @ competition.T)

# the function is run with 20,000,000 variations of sample
population = np.random.randint(0,2,(20000000,100))
result = [share_calc(sample, competition_exp_dot, preferences) for sample in population]

最佳答案

有很多方法可以加速简单的数组编程代码，如下所示：

您可以使用诸如Numba之类的工具，该工具将融合一些工作，并为单节点多核并行性提供一些选项。
您可以使用Dask之类的工具将其扩展到单台计算机的多个内核上（也可以在Numba中使用）或跨集群
您可以使用Torch，TensorFlow，CuPy或Jax等GPU阵列库之一在GPU上运行此库

您也可以将以上各项混合使用。

关于python - 数以百万计的函数迭代的并行化，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/59516878/