本文介绍了CPU和GPU中的SVD速度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在Matlab R2014a中测试svd,似乎没有CPU vs GPU加速.我正在使用GTX 460卡和Core 2 duo E8500.

I'm testing svd in Matlab R2014a and it seems that there is no CPU vs GPU speedup. I'm using a GTX 460 card and a Core 2 duo E8500.

这是我的代码:

%test SVD
n=10000;
%host
Mh= rand(n,1000);
tic
%[Uh,Sh,Vh]= svd(Mh);
svd(Mh);
toc
%device
Md = gpuArray.rand(n,1000);
tic
%[Ud,Sd,Vd]= svd(Md);
svd(Md);
toc

此外,每次运行的时间也不同,但是CPUGPU版本大致相同.为什么没有加速?

Also, the run times are different from run to run, but the CPU and GPU versions are about the same. Why there is no speedup?

这是一些测试

for i=1:10
    clear;
    m= 10000;
    n= 100;
    %host
    Mh= rand(m,n);
    tic
    [Uh,Sh,Vh]= svd(Mh);
    toc
    %device
    Md = gpuArray.rand(m,n);
    tic
    [Ud,Sd,Vd]= svd(Md);
    toc
end

>> test_gpu_svd
Elapsed time is 43.124130 seconds.
Elapsed time is 43.842277 seconds.
Elapsed time is 42.993283 seconds.
Elapsed time is 44.293410 seconds.
Elapsed time is 42.924541 seconds.
Elapsed time is 43.730343 seconds.
Elapsed time is 43.125938 seconds.
Elapsed time is 43.645095 seconds.
Elapsed time is 43.492129 seconds.
Elapsed time is 43.459277 seconds.
Elapsed time is 43.327012 seconds.
Elapsed time is 44.040959 seconds.
Elapsed time is 43.242291 seconds.
Elapsed time is 43.390881 seconds.
Elapsed time is 43.275379 seconds.
Elapsed time is 43.408705 seconds.
Elapsed time is 43.320387 seconds.
Elapsed time is 44.232156 seconds.
Elapsed time is 42.984002 seconds.
Elapsed time is 43.702430 seconds.


for i=1:10
    clear;
    m= 10000;
    n= 100;
    %host
    Mh= rand(m,n,'single');
    tic
    [Uh,Sh,Vh]= svd(Mh);
    toc
    %device
    Md = gpuArray.rand(m,n,'single');
    tic
    [Ud,Sd,Vd]= svd(Md);
    toc
end

>> test_gpu_svd
Elapsed time is 21.140301 seconds.
Elapsed time is 21.334361 seconds.
Elapsed time is 21.275991 seconds.
Elapsed time is 21.582602 seconds.
Elapsed time is 21.093408 seconds.
Elapsed time is 21.305413 seconds.
Elapsed time is 21.482931 seconds.
Elapsed time is 21.327842 seconds.
Elapsed time is 21.120969 seconds.
Elapsed time is 21.701752 seconds.
Elapsed time is 21.117268 seconds.
Elapsed time is 21.384318 seconds.
Elapsed time is 21.359225 seconds.
Elapsed time is 21.911570 seconds.
Elapsed time is 21.086259 seconds.
Elapsed time is 21.263040 seconds.
Elapsed time is 21.472175 seconds.
Elapsed time is 21.561370 seconds.
Elapsed time is 21.330314 seconds.
Elapsed time is 21.546260 seconds.

推荐答案

通常,SVD很难并行化例程.您可以在 此处 使用高端Tesla卡时,加速效果不是很好.

Generally SVD is a difficult to paralellize routine. You can check here that with a high end Tesla card, the speedup is not very impressive.

您有一张GTX460卡- Fermi架构 .该卡针对游戏(单精度计算)而非HPC(双精度计算)进行了优化. 单精度/双精度吞吐率为12.因此,该卡具有873 GFLOPS SP/72 GFLOPS DP .在 此处进行检查> .

You have a GTX460 card - Fermi architecture. The card is optimized for gaming (single precision computations), not HPC (double precision computation). The Single Precision / Double Precision throughput ratio is 12. So the card has 873 GFLOPS SP / 72 GFLOPS DP. Check here.

因此,如果Md数组使用双精度元素,则对其的计算将相当慢.另外,很有可能在调用CPU例程时,所有CPU内核都会被利用,从而降低了在GPU上运行例程的可能收益.另外,在GPU运行中,您需要花费时间将缓冲区传输到设备上..

So if the Md array uses double precision elements, then the computation on it would be rather slow. Also there's a high chance that when calling the CPU routine, all CPU cores will get utilized, reducing the possible gain of running the routine on the GPU. Plus, in the GPU run you pay time for transferring the buffer to the device.

根据Divakar的建议,您可以使用Md = single(Md)将数组转换为单精度并再次运行基准测试.您可以尝试使用更大的数据集来查看是否有所更改.我希望在您的GPU上使用此例程不会有太大收获.

Per Divakar's suggestion, you could use Md = single(Md) to convert your array to single precision and run the benchmark again. You can try and go with a bigger dataset size to see if something changes. I don't expect to much gain for this routine on your GPU.

更新1:

发布结果后,我发现DP/SP的时间比例为2.在CPU端,这是正常的,因为在SSE寄存器中,double的值可以少2倍.但是,在GPU端只有2的比率意味着gpu代码没有充分利用SM内核-因为理论比率是12.换句话说,我希望对于与DP相比优化的代码.似乎并非如此.

After you posted the results, I saw that the DP/SP time ratio is 2. On the CPU side this is normal, because you can fit 2 times less double values in SSE registers. However, a ratio of only 2 on the GPU side means that the gpu code does not make best use of the SM cores - because the theoretical ratio is 12. In other words, I would have expected much better SP performance for an optimized code, compared to DP. It seems that this is not the case.

这篇关于CPU和GPU中的SVD速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-03 09:32