问题描述
我在 Matlab R2014a
中测试 svd
似乎没有 CPU
vs GPU
加速.我正在使用 GTX 460
卡和 Core 2 duo E8500
.
I'm testing svd
in Matlab R2014a
and it seems that there is no CPU
vs GPU
speedup. I'm using a GTX 460
card and a Core 2 duo E8500
.
这是我的代码:
%test SVD
n=10000;
%host
Mh= rand(n,1000);
tic
%[Uh,Sh,Vh]= svd(Mh);
svd(Mh);
toc
%device
Md = gpuArray.rand(n,1000);
tic
%[Ud,Sd,Vd]= svd(Md);
svd(Md);
toc
此外,运行时间因运行而异,但 CPU
和 GPU
版本大致相同.为什么没有加速?
Also, the run times are different from run to run, but the CPU
and GPU
versions are about the same. Why there is no speedup?
这里有一些测试
for i=1:10
clear;
m= 10000;
n= 100;
%host
Mh= rand(m,n);
tic
[Uh,Sh,Vh]= svd(Mh);
toc
%device
Md = gpuArray.rand(m,n);
tic
[Ud,Sd,Vd]= svd(Md);
toc
end
>> test_gpu_svd
Elapsed time is 43.124130 seconds.
Elapsed time is 43.842277 seconds.
Elapsed time is 42.993283 seconds.
Elapsed time is 44.293410 seconds.
Elapsed time is 42.924541 seconds.
Elapsed time is 43.730343 seconds.
Elapsed time is 43.125938 seconds.
Elapsed time is 43.645095 seconds.
Elapsed time is 43.492129 seconds.
Elapsed time is 43.459277 seconds.
Elapsed time is 43.327012 seconds.
Elapsed time is 44.040959 seconds.
Elapsed time is 43.242291 seconds.
Elapsed time is 43.390881 seconds.
Elapsed time is 43.275379 seconds.
Elapsed time is 43.408705 seconds.
Elapsed time is 43.320387 seconds.
Elapsed time is 44.232156 seconds.
Elapsed time is 42.984002 seconds.
Elapsed time is 43.702430 seconds.
for i=1:10
clear;
m= 10000;
n= 100;
%host
Mh= rand(m,n,'single');
tic
[Uh,Sh,Vh]= svd(Mh);
toc
%device
Md = gpuArray.rand(m,n,'single');
tic
[Ud,Sd,Vd]= svd(Md);
toc
end
>> test_gpu_svd
Elapsed time is 21.140301 seconds.
Elapsed time is 21.334361 seconds.
Elapsed time is 21.275991 seconds.
Elapsed time is 21.582602 seconds.
Elapsed time is 21.093408 seconds.
Elapsed time is 21.305413 seconds.
Elapsed time is 21.482931 seconds.
Elapsed time is 21.327842 seconds.
Elapsed time is 21.120969 seconds.
Elapsed time is 21.701752 seconds.
Elapsed time is 21.117268 seconds.
Elapsed time is 21.384318 seconds.
Elapsed time is 21.359225 seconds.
Elapsed time is 21.911570 seconds.
Elapsed time is 21.086259 seconds.
Elapsed time is 21.263040 seconds.
Elapsed time is 21.472175 seconds.
Elapsed time is 21.561370 seconds.
Elapsed time is 21.330314 seconds.
Elapsed time is 21.546260 seconds.
推荐答案
通常 SVD 是一个难以并行化的例程.你可以查看这里strong> 使用高端 Tesla 卡,加速不是很可观.
Generally SVD is a difficult to paralellize routine. You can check here that with a high end Tesla card, the speedup is not very impressive.
您有一张 GTX460 卡 - Fermi 架构强>.该卡针对游戏(单精度计算)而非 HPC(双精度计算)进行了优化.单精度/双精度吞吐量比为 12.因此该卡具有 873 GFLOPS SP/72 GFLOPS DP.检查这里.
You have a GTX460 card - Fermi architecture. The card is optimized for gaming (single precision computations), not HPC (double precision computation). The Single Precision / Double Precision throughput ratio is 12. So the card has 873 GFLOPS SP / 72 GFLOPS DP. Check here.
所以如果 Md 数组使用双精度元素,那么它的计算会很慢.此外,在调用 CPU 例程时,很有可能会利用所有 CPU 内核,从而降低在 GPU 上运行例程的可能增益.另外,在 GPU 运行中您需要花时间将缓冲区传输到设备.
So if the Md array uses double precision elements, then the computation on it would be rather slow. Also there's a high chance that when calling the CPU routine, all CPU cores will get utilized, reducing the possible gain of running the routine on the GPU. Plus, in the GPU run you pay time for transferring the buffer to the device.
根据 Divakar 的建议,您可以使用 Md = single(Md)
将数组转换为单精度并再次运行基准测试.您可以尝试使用更大的数据集大小来查看是否有变化.我不希望在您的 GPU 上使用此例程获得太多收益.
Per Divakar's suggestion, you could use Md = single(Md)
to convert your array to single precision and run the benchmark again. You can try and go with a bigger dataset size to see if something changes. I don't expect to much gain for this routine on your GPU.
更新 1:
在您发布结果后,我看到 DP/SP 时间比为 2.在 CPU 方面,这是正常的,因为您可以在 SSE 寄存器中容纳少 2 倍的 double
值.但是,GPU 端只有 2 的比率意味着 gpu 代码没有充分利用 SM 内核 - 因为理论上的比率是 12.换句话说,我本来希望一个更好的 SP 性能与 DP 相比,优化的代码.好像不是这样的.
After you posted the results, I saw that the DP/SP time ratio is 2. On the CPU side this is normal, because you can fit 2 times less double
values in SSE registers. However, a ratio of only 2 on the GPU side means that the gpu code does not make best use of the SM cores - because the theoretical ratio is 12. In other words, I would have expected much better SP performance for an optimized code, compared to DP. It seems that this is not the case.
这篇关于CPU 和 GPU 中的 SVD 速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!