优化CUDA矩阵汉明距离

本文介绍了优化CUDA矩阵汉明距离的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

是否有人知道用于计算尺寸为A×N和N×B的两个矩阵之间的GEMM风格汉明距离的优化CUDA内核？该问题几乎与GEMM相同，而是计算每个向量{1 ... N}的和（a_n！= b_n），而不是乘以和求和每个向量元素。

我想在写我自己的前验证，因为这个问题比较常见，但我还没有成功找到代码。

编辑：

除了下面的kangshiyin的建议，我发现非常有助于了解超出范围的步骤在CUDA C编程指南中的基本共享内存矩阵乘法示例。

解决方案

你是对的， gemm（）代码。 CUDA示例有一个简单的实现 gemm（），但它太简单了。性能受限于共享内存访问，在Kepler设备上只能提供〜250 Gflops。为了更高的性能，您可能需要检查MAGMA中的 gemm（）代码。

这两篇文章还告诉你如何实现和调整 gemm（） 。

与不同gemm（）其具有用于快速乘法和加法操作的FMA指令的硬件支持，所期望的操作比较和添加可能需要更多指令，因此性能应该更低。考虑到Kepler的 gemm（）的最高性能是〜3 Tflops。你可以得到0.5〜2 Tflops汉明距离矩阵计算。

Is anyone aware of an optimized CUDA kernel for computing a GEMM style hamming distance between two matrices of dimension A x N and N x B? The problem is nearly identical to GEMM, but instead computes the sum( a_n != b_n ) for each vector {1 ... N}, instead of multiplying and summing each vector element.

I wanted to verify before writing my own, since this problem is relatively common, but I haven't had success in finding code for it yet. Suggestions for code to modify would be excellent as well.

EDIT:

In addition to kangshiyin's suggestions below, I found this walk-through of an optimized SGEMM implementation to be extraordinarily helpful in understanding steps beyond the basic shared memory matrix multiplication example in the CUDA C Programming Guide.

解决方案

You are right that you could write your kernel by modifying gemm() code. CUDA examples have a simple implementation of gemm(), but it is too simple. The performance is bounded by shared memory access, giving only ~250 Gflops on Kepler devices. For higher performance, you may want to check the gemm() code in MAGMA.

http://icl.cs.utk.edu/magma/index.html

These two papers also tell you how to implement and tune gemm().

http://staff.kfupm.edu.sa/ics/ahkhan/Resources/Papers/Autotuning/Autotuning%2520GEMM%2520Kernels%2520for%2520the%2520Fermi%2520GPU.pdf

http://www.netlib.org/lapack/lawnspdf/lawn267.pdf

Unlike gemm() which has hardware support with the FMA instruction for fast multiply-and-add operation, your desired operation compare-and-add may need more instructions, thus the performance should be lower. Considering the peak performance of gemm() is ~3 Tflops on Kepler. You may be able to get 0.5~2 Tflops for hamming distance matrix calculation.

这篇关于优化CUDA矩阵汉明距离的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！