问题描述
我知道这听起来很奇怪,但这是我的场景:
I know it sound weird, but here is my scenario:
我需要进行矩阵-矩阵乘法 (A(n*k)*B(k*n)),但我只需要计算输出矩阵的对角元素.我搜索了 cublas 库,但没有找到任何可以做到这一点的 2 级或 3 级函数.因此,我决定将 A 的每一行和 B 的每一列分配到 CUDA 线程中.对于每个线程(idx),我需要计算点积A[idx,:]*B[:,idx]"并将其保存为对应的对角线输出.现在因为这个点积也需要一些时间,我想知道我是否可以在这里调用cublas函数(比如cublasSdot)来实现它.
I need to do a matrix-matrix multiplication (A(n*k)*B(k*n)), but I only needs the diagonal elements to be evaluated for the output matrix. I searched cublas library and didn't find any level 2 or 3 functions that can do that.So, I decided to distribute each row of A and each column of B into CUDA threads. For each thread (idx), I need to calculate the dot product "A[idx,:]*B[:,idx]" and save it as the corresponding diagonal output. Now since this dot product also takes some time, and I wonder whether I could somehow call cublas function here (say cublasSdot) to achieve it.
如果我错过了一些可以直接实现我的目标的 cublas 函数(只计算矩阵乘法的对角元素),这个问题可以被丢弃.
If I missed some cublas function that can achieve my goal directly (only calculate the diagonal elements for a matrix-matrix multiplication), this question could be discarded.
推荐答案
是的,它可以(直到(并且不包括)CUDA 10 版本).
Yes, it can (until (and excluding) version CUDA 10).
"CUDA C/C++ 中可用的语言界面和设备运行时 API 是主机上可用的 CUDA 运行时 API 的子集.CUDA 运行时 API 的语法和语义已保留在设备上,以促进可在主机或设备环境中运行的 API 例程的代码重用.内核还可以直接调用诸如 CUBLAS 之类的 GPU 库,而无需返回 CPU."来源
"The language interface and Device Runtime API available in CUDA C/C++ is a subset of the CUDA Runtime API available on the Host. The syntax and semantics of the CUDA Runtime API have been retained on the device in order to facilitate ease of code reuse for API routines that may run in either the host or device environments. A kernel can also call GPU libraries such as CUBLAS directly without needing to return to the CPU." Source
这里 你可以看到使用 cuda 和 CUBLAS 的矩阵向量乘法库函数cublasSgemv.
Here you can see and Matrix-Vector Multiplication using cuda and CUBLAS library function cublasSgemv.
但是请记住,不再有 CUDA 10 中的设备 CUBLAS 功能..从 Robert_Crovella 可以引用:
Bear in mind, however that there is no longer a device CUBLAS capability in CUDA 10.. From Robert_Crovella one can cite:
目前的建议是看看 CUTLASS 2 是否会有所帮助(它主要集中在与 GEMM 相关的活动).如果没有,请自己写执行该功能的代码,或从主机代码调用 cublas.
尽管如此,目前网上有几种矩阵向量乘法的实现,例如1, 2 等.
Nonetheless, currently there are several implementation online of Matrix-Vector Multiplication, for instance 1, 2, among others.
这篇关于CUDA 内核可以调用 cublas 函数吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!