问题描述
我的最终目标是可能通过使用支持CUDA的GPU来加速Python中矩阵向量乘积的计算.矩阵A约为15k x 15k且稀疏(密度〜0.05),向量x为15k元素且密集,我正在计算Ax.我必须多次执行此计算,因此使其尽可能快是理想的.
My ultimate goal is to accelerate the computation of a matrix-vector product in Python, potentially by using a CUDA-enabled GPU. The matrix A is about 15k x 15k and sparse (density ~ 0.05), and the vector x is 15k elements and dense, and I am computing Ax. I have to perform this computation many times, so making it as fast as possible would be ideal.
我当前的非GPU优化"是将A表示为scipy.sparse.csc_matrix对象,然后简单地计算A.dot(x),但我希望在带有NVIDIA的VM上加快这一速度附有GPU,并且在可能的情况下仅使用Python(即,不手工写出详细的内核功能).我已经成功使用cudamat库加速了密集的矩阵向量乘积,但对于稀疏情况却没有.对于在线稀疏案例,有一些建议,例如使用pycuda或scikit-cuda或anaconda的加速包,但信息不多,因此很难知道从哪里开始.
My current non-GPU "optimization" is to represent A as a scipy.sparse.csc_matrix object, and then simply computing A.dot(x), but I was hoping to speed this up on a VM with a couple NVIDIA GPUs attached, and using only Python if possible (i.e. not writing out the detailed kernel functions by hand). I’ve succeeded in accelerating dense matrix-vector products using the cudamat library, but not for the sparse case. There are a handful of suggestions for the sparse case online, such as using pycuda, or scikit-cuda, or anaconda’s accelerate package, but there’s not a ton of information so it’s hard to know where to begin.
我不需要非常详细的说明,但是如果有人以前已经解决了这个问题,并且可以提供一种大图"路线图以最简单的方式进行操作,或者可以加快稀疏GPU的运行速度,基于矩阵的矢量积将具有超过scipy的稀疏算法,这将非常有帮助.
I don’t need greatly detailed instructions, but if anyone has solved this before and could provide a "big picture" roadmap for the simplest way of doing this, or has an idea of the sort of speed up a sparse GPU-based matrix-vector product would have over scipy’s sparse algorithms, that would be very helpful.
推荐答案
正如评论中指出的那样,NVIDIA提供了 cuSPARSE 库,其中包含具有密集矢量的稀疏矩阵乘积的函数.
As pointed out in comments, NVIDIA ship the cuSPARSE library which includes functions for sparse matrix products with dense vectors.
Numba现在通过 pyculib 程序包为cuSparse库提供了Python绑定.
Numba now has Python bindings for the cuSparse library via the pyculib package.
这篇关于如何使用CUDA通过当前通过scipy.sparse.csc_matrix.dot实现的密集矢量积来加速稀疏矩阵?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!