问题描述
我已经使用numpy编写了代码,该代码采用大小为(m x n)的数组...行(m)是由(n)个特征组成的单个观察结果...并且创建了大小为(m x m)的平方距离矩阵.该距离矩阵是给定观察值与所有其他观察值的距离.例如.第0行第9列是观测值0与观测值9之间的距离.
I have written code using numpy that takes an array of size (m x n)... The rows (m) are individual observations comprised of (n) features... and creates a square distance matrix of size (m x m). This distance matrix is the distance of a given observation from all other observations. E.g. row 0 column 9 is the distance between observation 0 and observation 9.
import numpy as np
#import cupy as np
def l1_distance(arr):
return np.linalg.norm(arr, 1)
X = np.random.randint(low=0, high=255, size=(700,4096))
distance = np.empty((700,700))
for i in range(700):
for j in range(700):
distance[i,j] = l1_distance(X[i,:] - X[j,:])
我通过注释第二条import语句在cupy上尝试在GPU上进行此操作,但显然double for循环的效率极低.它需要大约numpy大约6秒,但是cupy需要26秒.我知道为什么会这样,但是我现在还不清楚如何并行化此过程.
I attempted this on GPU using cupy by umcommenting the second import statement, but obviously the double for loop is drastically inefficient. It takes numpy approx 6 seconds, but cupy takes 26 seconds. I understand why but it's not immediately clear to me how to parallelize this process.
我知道我需要编写某种归约内核,但是我无法考虑如何根据对另一个数组元素的迭代操作来构造一个cupy数组.
I know I'm going to need to write a reduction kernel of some sort, but I can't think of how to construct one cupy array from iterative operations on elements of another array.
推荐答案
在A100 GPU中使用广播CuPy花费0.10秒,而在NumPy中花费6.6秒
Using broadcasting CuPy takes 0.10 seconds in a A100 GPU compared to NumPy which takes 6.6 seconds
for i in range(700):
distance[i,:] = np.abs(np.broadcast_to(X[i,:], X.shape) - X).sum(axis=1)
此向量化使一个向量与所有其他向量的距离平行.
This vectorizes and makes the distance of one vector to all other ones in parallel.
这篇关于使用Cupy从GPU上的另一个矩阵创建距离矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!