来自cudaMemcpy2D的错误数据 | 来自cudaMemcpy2D的错误数据

本文介绍了来自cudaMemcpy2D的错误数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如果有人问我类似的问题，请把我链接到主题！

If this sort of question has been asked I apologize, link me to the thread please!

无论如何，我是CUDA的新手(我来自OpenCL)，想尝试使用它生成图像.相关的CUDA代码为:

Anyhow I am new to CUDA (I'm coming from OpenCL) and wanted to try generating an image with it. The relevant CUDA code is:

__global__
void mandlebrot(uint8_t *pixels, size_t pitch, unsigned long width, unsigned long height) {
  unsigned block_size = blockDim.x;
  uint2 location = {blockIdx.x*block_size, blockIdx.y*block_size};
  ulong2 pixel_location = {threadIdx.x, threadIdx.y};
  ulong2 real_location = {location.x + pixel_location.x, location.y + pixel_location.y};
  if (real_location.x >= width || real_location.y >= height)
    return;
  uint8_t *row = (uint8_t *)((char *)pixels + real_location.y * pitch);
  row[real_location.x * 4+0] = 0;
  row[real_location.x * 4+1] = 255;
  row[real_location.x * 4+2] = 0;
  row[real_location.x * 4+3] = 255;
}

cudaError_t err = cudaSuccess;

#define CUDA_ERR(e) \
  if ((err = e) != cudaSuccess) { \
    fprintf(stderr, "Failed to allocate device vector A (error code %s)!\n", cudaGetErrorString(err)); \
    exit(-1); \
  }


int main(void) {
  ulong2 dims = {1000, 1000};
  unsigned long block_size = 500;
  dim3 threads_per_block(block_size, block_size);
  dim3 remainders(dims.x % threads_per_block.x, dims.y % threads_per_block.y);
  dim3 blocks(dims.x / threads_per_block.x + (remainders.x == 0 ? 0 : 1), dims.y / threads_per_block.y + (remainders.y == 0 ? 0 : 1));

  size_t pitch;
  uint8_t *pixels, *h_pixels = NULL;
  CUDA_ERR(cudaMallocPitch(&pixels, &pitch, dims.x * 4 * sizeof(uint8_t), dims.y));
  mandlebrot<<<blocks, threads_per_block>>>(pixels, pitch, dims.x, dims.y);

  h_pixels = (uint8_t *)malloc(dims.x * 4 * sizeof(uint8_t) * dims.y);
  memset(h_pixels, 0, dims.x * 4 * sizeof(uint8_t) * dims.y);
  CUDA_ERR(cudaMemcpy2D(h_pixels, dims.x * 4 * sizeof(uint8_t), pixels, pitch, dims.x, dims.y, cudaMemcpyDeviceToHost));

  save_png("out.png", h_pixels, dims.x, dims.y);

  CUDA_ERR(cudaFree(pixels));
  free(h_pixels);

  CUDA_ERR(cudaDeviceReset());
  puts("Success");
  return 0;
}

save_png函数是我创建的常用工具函数，用于获取数据块并将其保存到png中:

The save_png function is a usual utility function I created for taking a block of data and saving it to a png:

void save_png(const char *filename, uint8_t *buffer, unsigned long width, unsigned long height) {
  png_structp png_ptr = png_create_write_struct(PNG_LIBPNG_VER_STRING, NULL, NULL, NULL);
  if (!png_ptr) {
    std::cerr << "Failed to create png write struct" << std::endl;
    return;
  }
  png_infop info_ptr = png_create_info_struct(png_ptr);
  if (!info_ptr) {
    std::cerr << "Failed to create info_ptr" << std::endl;
    png_destroy_write_struct(&png_ptr, NULL);
    return;
  }
  FILE *fp = fopen(filename, "wb");
  if (!fp) {
    std::cerr << "Failed to open " << filename << " for writing" << std::endl;
    png_destroy_write_struct(&png_ptr, &info_ptr);
    return;
  }
  if (setjmp(png_jmpbuf(png_ptr))) {
    png_destroy_write_struct(&png_ptr, &info_ptr);
    std::cerr << "Error from libpng!" << std::endl;
    return;
  }
  png_init_io(png_ptr, fp);
  png_set_IHDR(png_ptr, info_ptr, width, height, 8, PNG_COLOR_TYPE_RGBA, PNG_INTERLACE_NONE, PNG_COMPRESSION_TYPE_DEFAULT, PNG_FILTER_TYPE_DEFAULT);
  png_write_info(png_ptr, info_ptr);
  png_byte *row_pnts[height];
  size_t i;
  for (i = 0; i < height; i++) {
    row_pnts[i] = buffer + width * 4 * i;
  }
  png_write_image(png_ptr, row_pnts);
  png_write_end(png_ptr, info_ptr);
  png_destroy_write_struct(&png_ptr, &info_ptr);
  fclose(fp);
}

无论如何，生成的图像都是奇怪的带白色斑点的白色斑点，可以在此处.

Anyways the image that's generated is a weird whiteish strip that's speckled with random colored pixels which can be seen here.

我做错了什么明显的东西吗?我试图遵循CUDA网站上的介绍文档.否则有人可以帮我解决这个问题吗?在这里，我只是试图用绿色像素填充pixels缓冲区.

Is there something glaring I did wrong? I tried to follow the introduction documentation on the CUDA site. Otherwise can anyone help me out to fix this? Here I'm simply trying to fill the pixels buffer with green pixels.

我正在将MBP视网膜与NVIDIA GeForce GT 650M独立显卡一起使用.如果需要，我可以运行并将输出从cuda示例代码粘贴到print_devices.

I am using a MBP retina with an NVIDIA GeForce GT 650M discrete graphics card. I can run and paste the output to print_devices from the cuda sample code if need be.

使用以下makefile进行编译时，请注意没有错误或警告:

Note no errors or warnings during compilation with the following makefile:

all:
    nvcc -c mandlebrot.cu -o mandlebrot.cu.o
    nvcc mandlebrot.cu.o -o mandlebrot -lpng

并且在运行时没有错误.

and no errors at runtime.

推荐答案

最好提供一个完整的代码，以便他人可以复制，粘贴，编译和运行，而无需添加任何内容或更改任何内容，剥离包含标头不是更好我认为，如果您需要帮助，将测试代码依赖于其他人可能没有的png库也没有用.

It's better if you provide a complete code that someone can copy, paste, compile, and run, without adding anything or changing anything, Stripping off the include headers isn't helpful, in my opinion, and making your test code dependent on a png library that others may not have is also not productive, if you want help.

您对内核启动的错误检查已损坏.您可能需要查看正确的cuda错误检查.如果进行了正确的错误检查，或者使用cuda-memcheck运行了代码，则在内核启动时会发现错误9.这是无效的配置.如果您打印出blocks和threads_per_block变量，则会看到类似这样的内容:

Your error checking on kernel launches is broken. You may want to review proper cuda error checking. If you had proper error checking, or ran your code with cuda-memcheck, you would discover an error 9 on the kernel launch. This is an invalid configuration. If you print out your blocks and threads_per_block variables, you'll see something like this:

blocks: 2, 2
threads: 500, 500

您实际上是在此处将每个块的线程数设置为500,500:

You are in fact setting threads per block to 500,500 here:

unsigned long block_size = 500;
dim3 threads_per_block(block_size, block_size);

这是非法的，因为您要求每个块500x500个线程(即250000个线程)超过每块1024个线程的最大限制.

That is illegal, as you are requesting 500x500 threads per block (i.e. 250000 threads) which exceeds the maximum limit of 1024 threads per block.

因此您的内核根本没有运行，并且您正在得到垃圾.

So your kernel is not running at all and you're getting garbage.

您只需更改您的block_size定义即可解决此错误:

You can fix this error pretty simply by changing your block_size definition:

unsigned long block_size = 16;

在那之后仍然存在问题，因为您误解了 cudaMemcpy2D .:

After that there is still an issue, as you've misinterpreted the parameters for cudaMemcpy2D.:

CUDA_ERR(cudaMemcpy2D(h_pixels, dims.x * 4 * sizeof(uint8_t), pixels, pitch, dims.x, dims.y, cudaMemcpyDeviceToHost));

第5个参数的文档说明:

The documentation states for the 5th parameter:

但是您已经以元素(4字节为一组)而不是字节为单位传递了宽度.

but you've passed the width in elements (groups of 4 bytes) rather than bytes.

这将解决以下问题:

CUDA_ERR(cudaMemcpy2D(h_pixels, dims.x * 4 * sizeof(uint8_t), pixels, pitch, dims.x*4, dims.y, cudaMemcpyDeviceToHost));

通过上述更改，我通过使用您的代码的测试版本能够获得良好的结果:

With the above changes, I was able to get good results with a test version of your code:

#include <stdio.h>
#include <stdint.h>

__global__
void mandlebrot(uint8_t *pixels, size_t pitch, unsigned long width, unsigned long height) {
  unsigned block_size = blockDim.x;
  uint2 location = {blockIdx.x*block_size, blockIdx.y*block_size};
  ulong2 pixel_location = {threadIdx.x, threadIdx.y};
  ulong2 real_location = {location.x + pixel_location.x, location.y + pixel_location.y};
  if (real_location.x >= width || real_location.y >= height)
    return;
  uint8_t *row = (uint8_t *)((char *)pixels + real_location.y * pitch);
  row[real_location.x * 4+0] = 0;
  row[real_location.x * 4+1] = 255;
  row[real_location.x * 4+2] = 0;
  row[real_location.x * 4+3] = 255;
}

cudaError_t err = cudaSuccess;

#define CUDA_ERR(e) \
  if ((err = e) != cudaSuccess) { \
    fprintf(stderr, "Failed to allocate device vector A (error code %s)!\n", cudaGetErrorString(err)); \
    exit(-1); \
  }

int main(void) {
  ulong2 dims = {1000, 1000};
  dim3 threads_per_block(16, 16);
  dim3 remainders(dims.x % threads_per_block.x, dims.y % threads_per_block.y);
  dim3 blocks(dims.x / threads_per_block.x + (remainders.x == 0 ? 0 : 1), dims.y / threads_per_block.y + (remainders.y == 0 ? 0 : 1));

  size_t pitch;
  uint8_t *pixels, *h_pixels = NULL;
  CUDA_ERR(cudaMallocPitch(&pixels, &pitch, dims.x * 4 * sizeof(uint8_t), dims.y));

  printf("blocks: %u, %u\n", blocks.x, blocks.y);
  printf("threads: %u, %u\n", threads_per_block.x, threads_per_block.y);
  mandlebrot<<<blocks, threads_per_block>>>(pixels, pitch, dims.x, dims.y);

  h_pixels = (uint8_t *)malloc(dims.x * 4 * sizeof(uint8_t) * dims.y);
  memset(h_pixels, 0, dims.x * 4 * sizeof(uint8_t) * dims.y);
  CUDA_ERR(cudaMemcpy2D(h_pixels, dims.x * 4 * sizeof(uint8_t), pixels, pitch, dims.x*4, dims.y, cudaMemcpyDeviceToHost));

//  save_png("out.png", h_pixels, dims.x, dims.y);
  for (int row = 0; row < dims.y; row++)
    for (int col = 0; col < dims.x; col++){
      if (h_pixels[(row*dims.x*4) + col*4   ] !=   0) {printf("mismatch 0 at %u,%u: was: %u should be: %u\n", row,col, h_pixels[(row*dims.x)+col*4], 0); return 1;}
      if (h_pixels[(row*dims.x*4) + col*4 +1] != 255) {printf("mismatch 1 at %u,%u: was: %u should be: %u\n", row,col, h_pixels[(row*dims.x)+col*4 +1], 255); return 1;}
      if (h_pixels[(row*dims.x*4) + col*4 +2] !=   0) {printf("mismatch 2: was: %u should be: %u\n", h_pixels[(row*dims.x)+col*4 +2], 0); return 1;}
      if (h_pixels[(row*dims.x*4) + col*4 +3] != 255) {printf("mismatch 3: was: %u should be: %u\n", h_pixels[(row*dims.x)+col*4 +3 ], 255); return 1;}
      }
  CUDA_ERR(cudaFree(pixels));
  free(h_pixels);

  CUDA_ERR(cudaDeviceReset());
  puts("Success");
  return 0;
}

请注意，上面的代码是完整的代码，您可以复制，粘贴，编译和运行.

Note the above code is a complete code you can copy, paste, compile and run.

这篇关于来自cudaMemcpy2D的错误数据的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！

1403页，肝出来的..