本文介绍了cuda 3D纹理插值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 29岁程序员,3月因学历无情被辞! 我尝试用cuda使用纹理内存插入一个3D数组,下面的代码。我绘制输入f [x] [y] [z] fo一个固定的z值,然后我插入我的数组x和y和绘图i再次,他们看起来完全不同。我也尝试过在1维(与不同的代码),它的工作,所以我假设在我的代码必须有一个错误。你能帮我找到它吗?I am trying to interpolate a 3D array with cuda using texture memory with the code below. I have plotted the input f[x][y][z] fo a fixed z value, then I interpolate my array for x and y and plot i again and they look totally different. I also tried this in 1 dimension (with a different code) and there it works so i assume that there must be an error in my code. Can you help me finding it?#include <cuda_runtime.h>#include <cuda.h>#include <iostream>#include <fstream>typedef float myType;texture<myType, 3> tex;cudaArray *d_volumeArray = 0;#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true){ if (code != cudaSuccess) { fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line); if (abort) { getchar(); exit(code); } }}__global__ void getInterpolatedFunctionValue(double x, double y, double z){//http://stackoverflow.com/questions/10643790/texture-memory-tex2d-basics printf("%f \n", tex3D(tex, x+0.5f, y+0.5f, z+0.5f));}using namespace std;int main(){int nx=100, ny=100, nz=10;myType f[nx][ny][nz];for(int i=0; i<nx; i++) for(int j=0; j<ny; j++) for(int k=0; k<nz; k++){ f[i][j][k] = sin(i/10.0)*cos(j/10.0)+k; }const cudaExtent extend = make_cudaExtent(nx, ny, nz);cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<myType>();gpuErrchk(cudaMalloc3DArray(&d_volumeArray, &channelDesc, extend));cudaMemcpy3DParms copyParams = {0};copyParams.srcPtr = make_cudaPitchedPtr((void*)f, extend.width*sizeof(myType), extend.width, extend.height);copyParams.dstArray = d_volumeArray;copyParams.extent = extend;copyParams.kind = cudaMemcpyHostToDevice;gpuErrchk(cudaMemcpy3D(&copyParams));tex.normalized = false;tex.filterMode = cudaFilterModeLinear;tex.addressMode[0] = cudaAddressModeClamp;tex.addressMode[1] = cudaAddressModeClamp;tex.addressMode[2] = cudaAddressModeClamp;gpuErrchk(cudaBindTextureToArray(tex, d_volumeArray, channelDesc));for(int i=0; i<nx*2; i++){ for(int j=0; j<ny*2; j++){ getInterpolatedFunctionValue <<<1, 1>>> (float(i)/2, float(j)/2, 3.0); gpuErrchk(cudaPeekAtLastError()); gpuErrchk(cudaDeviceSynchronize()); }}gpuErrchk(cudaUnbindTexture(tex));gpuErrchk(cudaFreeArray(d_volumeArray));return 0;}更新: @Robert Crovella:在我看来,我的问题更好,如果有人绘制输出和比较插值与原件。我将在下面添加。整数除法不计划,我固定它,但这不是我的问题的原因Update:@Robert Crovella: In my opinion you can see my problem better if one does plot the output and compare the interpolation with the original. I will add them below. The integer division was not planed and i fixed it, but that was not the reason for my problem @JackOLantern:我知道这篇文章,你的代码有模板我的版本。但在我看来,它不工作,因为我的预期。@JackOLantern: i know this post and your code there was the template for my version. But it seems to me that it does not work as i would have expected.由于我没有足够的信誉上传图像在这里,我会链接这两个图像。数字[1]显示了我的输入值的固定z值和图[2]插值完成我的代码。原始数据在[2,4]的范围内,而内插的在[-2,10]和结构完全不同。我希望这有助于更好地了解我的问题。Since i have not enough reputation to upload images here i will link the two images. Number [1] shows a plot of my input values for a fix z value and figure [2] the interpolation done by my code. The original data are in a range of [2,4] while the interpolated are in [-2,10] and the structure are totally different. I hope this helps understanding my problem better. [1] http://www.directupload.net/file/d/3731/9j45kzs4_png.htm [2] http://www.directupload.net/file/d/3731/qteqvclr_png.htm推荐答案主要问题似乎是你的底层纹理存储索引顺序颠倒。 x 维度是快速变化的矩阵维度(在这种情况下为第三个下标)和一个warp内的快速变化的线程维度(尽管这个例子不相关)。在您的代码中,我认为下面总结了必要的更改:The principal issue seems to be that you have your underlying texture storage index order reversed. The x dimension is the rapidly-varying matrix dimension (3rd subscript, in this case) and the rapidly varying thread dimension within a warp (although irrelevant for this example). In your code, I think the following summarizes the necessary changes:myType f[nz][ny][nx];for(int i=0; i<nx; i++) for(int j=0; j<ny; j++) for(int k=0; k<nz; k++){ f[k][j][i] = sin(i/10.0f)*cos(j/10.0f)+k; }对于使用线性插值的纹理,如果进一步深入探讨,我建议您对所提供的材料有充分的理解此处。对于具有非归一化坐标的线性滤波,对于在特定方向上具有N个数据点的纹理,插值的范围(不包括钳位区域)将具有尺寸N-1。这种关系通常通过仔细应用在先前链接的材料中的表查找方程来处理,然而对于你的示例,为了做出最小数目的改变,我们可以省略,并且简单地注意我们如何计算我们期望的函数值以及 x , y 和 z 传递给纹理查找。There's quite a bit more that can be said about texturing with linear interpolation, so if you dig into it further, I suggest a solid understanding of the material presented here. For linear filtering with non-normalized coordinates, the range of interpolation (not including the clamp regions) will be of dimension N-1, for a texture with N data points in a particular direction. This relationship would normally be handled by careful application of the table lookup equation in the previously linked material, however for your example, to make the minimum number of changes, we can dispense with that and simply be careful about how we compute our expected functional value as well as the x,y, and z values passed to the texture lookup.这里有一个例子,主要的修改是存储顺序。因为我不想绘制数据,我选择修改你的代码注入验证检查。Here's an example with the primary modification being the storage order. Since I didn't want to plot the data, I chose to modify your code to inject validation checking.#include <iostream>#include <fstream>#define NX 100#define NY 100#define NZ 10#define TOL 0.003f#define I_FACT 2typedef float myType;texture<myType, 3> tex;cudaArray *d_volumeArray = 0;__global__ void getInterpolatedFunctionValue(myType x, myType y, myType z, myType *result){ *result = tex3D(tex, x+0.5f, y+0.5f, z+0.5f);}#define cudaCheckErrors(msg) \ do { \ cudaError_t __err = cudaGetLastError(); \ if (__err != cudaSuccess) { \ fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \ msg, cudaGetErrorString(__err), \ __FILE__, __LINE__); \ fprintf(stderr, "*** FAILED - ABORTING\n"); \ exit(1); \ } \ } while (0)using namespace std;int main(){int nx=NX, ny=NY, nz=NZ;myType f[nz][ny][nx];for(int ix=0; ix<nx; ix++) for(int iy=0; iy<ny; iy++) for(int iz=0; iz<nz; iz++){ f[iz][iy][ix] = sin(ix/(float)10)*cos(iy/(float)10)+iz; }const cudaExtent extent = make_cudaExtent(nx, ny, nz);cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<myType>();cudaMalloc3DArray(&d_volumeArray, &channelDesc, extent);cudaCheckErrors("cudaMalloc3D error");cudaMemcpy3DParms copyParams = {0};copyParams.srcPtr = make_cudaPitchedPtr((void*)f, extent.width*sizeof(myType), extent.width, extent.height);copyParams.dstArray = d_volumeArray;copyParams.extent = extent;copyParams.kind = cudaMemcpyHostToDevice;cudaMemcpy3D(&copyParams);cudaCheckErrors("cudaMemcpy3D fail");tex.normalized = false;tex.filterMode = cudaFilterModeLinear;tex.addressMode[0] = cudaAddressModeClamp;tex.addressMode[1] = cudaAddressModeClamp;tex.addressMode[2] = cudaAddressModeClamp;cudaBindTextureToArray(tex, d_volumeArray, channelDesc);cudaCheckErrors("bind fail");myType my_result;myType *d_result, *h_result = &my_result;cudaMalloc(&d_result, sizeof(myType));for(int i=0; i<(nx-1)*I_FACT; i++) for(int j=0; j<(ny-1)*I_FACT; j++) for (int k = 0; k <(nz-1)*I_FACT; k++){ myType test_val = sin(i/(float)(10*I_FACT))*cos(j/(float)(10*I_FACT)) + k/(float)(I_FACT); getInterpolatedFunctionValue <<<1, 1>>> (i/(float)I_FACT, j/(float)I_FACT, k/(float)I_FACT, d_result); cudaDeviceSynchronize(); cudaCheckErrors("kernel fail"); cudaMemcpy(h_result, d_result, sizeof(myType), cudaMemcpyDeviceToHost); cudaCheckErrors("cudaMemcpy fail"); if (fabs(my_result - test_val) > TOL) {printf("mismatch at x:%f, y:%f, z:%f, was:%f, should be: %f\n", i/(float)I_FACT,j/(float)I_FACT,k/(float)I_FACT, my_result, test_val); return 1;} }printf("success!\n");cudaUnbindTexture(tex);cudaCheckErrors("unbind fail");cudaFreeArray(d_volumeArray);cudaCheckErrors("free fail");return 0;}这段代码似乎对我来说运行正常,大约需要30秒K40c,使用CUDA 6.5。在将来,而不是期望别人绘制您的数据以确定有效性,如果您在请求帮助中构建验证检查将是有帮助的。This code seems to run correctly for me, it takes about 30 seconds on a K40c, with CUDA 6.5. In the future, rather than expecting others to plot your data to determine validity, it would be helpful if you build validation checking into your request for help. This makes it easy on others to help you, and also explicitly declares the nature of the results you are expecting.上面的代码中所包含的公差可能不正确,可能不适用于您所期望的结果的性质。多种病例。纹理硬件具有以8位分数精度(参考前面的链接)存储的系数,在3D情况下,您将将这些系数中的3个相乘在一起。因此,最多的容差可能需要是存储在纹理中的数据的最大值的0.005倍,但是我没有进行仔细的容差分析。The tolerance built into the code above is probably incorrect to cover a wide range of cases. The texture hardware has coefficients that are stored with 8 bits of fractional precision (refer to previous link), and in the 3D case you'll be multiplying 3 of these coefficients together. So at most the tolerance probably needs to be perhaps 0.005 times the maximum value of the data stored in the texture, but I haven't done a careful tolerance analysis.增加 I_FACT 参数将大大增加上述测试代码的运行时间。Increasing the I_FACT parameter will dramatically increase the runtime of the test code above. 这篇关于cuda 3D纹理插值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 上岸,阿里云!
08-23 16:13