本文介绍了CUDA float2合并的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
在CUDA中使用float2数据类型时,合并读取时遇到问题。
I'm having trouble coalescing reads when using the float2 datatype in CUDA.
我试图做一个简单的示例在可视化探查器中运行,但是它总是返回非终止读取。如果有人能对此有所阐明,我将非常感谢。
I've tried to make a simple example to run in the visual profiler but it always returns noncoalesced reads. If anyone could shed some light on this I would be really grateful, thanks.
#include <stdio.h>
#include <cuda_runtime_api.h>
__global__ void kernel(float2 *in, float2 *out) {
int idx=blockIdx.x*blockDim.x+threadIdx.x;
float2 d=in[idx];
d.x = 100.f;
out[idx] = d;
}
int main() {
const int dataSize=32;
float2 *in;
cudaMalloc((void**)&in,dataSize*sizeof(float2));
float2 *out;
cudaMalloc((void**)&out,dataSize*sizeof(float2));
kernel<<<1,32>>>(in,out);
return 0;
}
推荐答案
我在NVIDIA论坛。事实证明,在调试模式下未对加载向量进行优化。
I asked this question on the NVIDIA forums. Turns out loading vectors is not optimized in debug mode. Forums
这篇关于CUDA float2合并的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!