问题描述
在下面的实验和讨论中,我发现对于批量2D FFT,cuFFT比FFTW慢。 为什么cuFFT这么慢,我能做些什么来使cuFFT运行得更快吗?
I'm hoping to accelerate a computer vision application that computes many FFTs using FFTW and OpenMP on an Intel CPU. However, for a variety of FFT problem sizes, I've found that cuFFT is slower than FFTW with OpenMP.
我们的需要进行正向FFT一堆尺寸256x256的小飞机。我正在使用深度为32的功能运行FFT,因此我使用批处理模式每个函数调用执行32个FFT。通常,我做大小为256x256的8个FFT函数调用,批处理大小为32。
In the experiments and discussion below, I find that cuFFT is slower than FFTW for batched 2D FFTs. Why is cuFFT so slow, and is there anything I can do to make cuFFT run faster?
FFTW + OpenMP
以下代码在 Intel i7-2600 8核CPU
上 16.0ms 执行。
FFTW + OpenMP
The following code executes in 16.0ms on an Intel i7-2600 8-core CPU
.
cuFFT
以下代码在 21.7ms 中在顶部的 NVIDIA K20 GPU
上执行。请注意,即使我使用流,。
int depth = 32; int nRows = 256; int nCols = 256; int nIter = 8;int n[2] = {nRows, nCols};//if nCols is even, cols_padded = (nCols+2). if nCols is odd, cols_padded = (nCols+1)int cols_padded = 2*(nCols/2 + 1); //allocate this width, but tell FFTW that it's nCols widthint inembed[2] = {nRows, 2*(nCols/2 + 1)};int onembed[2] = {nRows, (nCols/2 + 1)}; //default -- equivalent ot onembed=NULLfloat* h_in = (float*)malloc(sizeof(float)*nRows*cols_padded*depth);memset(h_in, 0, sizeof(float)*nRows*cols_padded*depth);fftwf_complex* h_freq = reinterpret_cast<fftwf_complex*>(h_in); //in-place versionfftwf_plan forwardPlan = fftwf_plan_many_dft_r2c(2, //rank n, //dims -- this doesn't include zero-padding depth, //howmany h_in, //in inembed, //inembed depth, //istride 1, //idist h_freq, //out onembed, //onembed depth, //ostride 1, //odist FFTW_PATIENT /*flags*/);double start = read_timer();#pragma omp parallel forfor(int i=0; i<nIter; i++){ fftwf_execute_dft_r2c(forwardPlan, h_in, h_freq);}double responseTime = read_timer() - start;printf("did %d FFT calls in %f ms \n", nIter, responseTime);
cuFFT
The following code executes in 21.7ms on a top-of-the-line NVIDIA K20 GPU
. Note that, even if I use streams, cuFFT does not run multiple FFTs concurrently.
其他注意事项
Other notes
- 在GPU版本中,CPU和GPU之间的
cudaMemcpy
s 不包含 - 此处提供的性能数字是几个实验的平均值,其中每个实验具有8个FFT函数调用(总共10个实验,因此80个FFT函数调用)。
- 我尝试过许多问题大小(例如128x128,256x256,512x512,1024x1024),所有深度都为32。基于
nvvp
分析器,一些像1024x1024的大小能够使GPU完全饱和。但是,对于所有这些大小,CPU FFTW + OpenMP比cuFFT快。
- In the GPU version,
cudaMemcpy
s between the CPU and GPU are not included in my computation time. - The performance numbers presented here are averages of several experiments, where each experiment has 8 FFT function calls (total of 10 experiments, so 80 FFT function calls).
- I've tried many problem sizes (e.g. 128x128, 256x256, 512x512, 1024x1024), all with depth=32. Based on the
nvvp
profiler, some sizes like 1024x1024 are able to fully saturate the GPU. But, for all of these sizes, the CPU FFTW+OpenMP is faster than cuFFT.
推荐答案
问题可能已过时,虽然这里是一个可能的解释(对于慢速的cuFFT)。
Question might be outdated, though here is a possible explanation (for the slowness of cuFFT).
在为 cufftPlanMany
构建数据时,数据安排对GPU不是很好。事实上,使用32的istride和ostride意味着没有数据读取被合并。有关读取模式的详细信息,请参见。
When structuring your data for cufftPlanMany
, the data arrangement is not very nice with the GPU. Indeed, using an istride and ostride of 32 means no data read is coalesced. See here for details on the read pattern
input[b * idist + (x * inembed[1] + y) * istride]
output[b * odist + (x * onembed[1] + y) * ostride]
32,它将不太可能合并/最优。 (确实 b
是批号)。以下是我应用的更改:
in which case if i/ostride is 32, it will very unlikely be coalesced/optimal. (indeed b
is the batch number). Here are the changes I applied:
CHECK_CUFFT(cufftPlanMany(&forwardPlan,
2, //rank
n, //dimensions = {nRows, nCols}
inembed, //inembed
1, // WAS: depth, //istride
nRows*cols_padded, // WAS: 1, //idist
onembed, //onembed
1, // WAS: depth, //ostride
nRows*cols_padded, // WAS:1, //odist
CUFFT_R2C, //cufftType
depth /*batch*/));
运行此操作,由于非法内存访问,我输入了未指定的启动失败。你可能想改变内存分配( cufftComplex
是两个浮动,你需要一个x2的分配大小 - 看起来像拼写错误)。
Running this, I entered a unspecified launch failure because of illegal memory access. You might want to change the memory allocation (cufftComplex
is two floats, you need an x2 in your allocation size - looks like a typo).
// WAS : CHECK_CUDART(cudaMalloc(&d_in, sizeof(float)*nRows*cols_padded*depth));
CHECK_CUDART(cudaMalloc(&d_in, sizeof(float)*nRows*cols_padded*depth*2));
以这种方式运行时,我的卡上有x8的性能提升。
When running it this way, I got a x8 performance improvement on my card.
这篇关于为什么cuFFT这么慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!