问题描述
Cuda 7.5支持16位浮点变量。
任何人都可以提供示例代码演示它的使用吗?
Cuda 7.5 supports 16 bit floating point variables.Can anyone provide sample code demonstrating the use of it?
推荐答案
前面有几个要注意的事项:
There are a few things to note up-front:
- 请参阅半精度。
- 请注意,大多数或所有这些内在函数在设备代码中仅支持 。 (但是,@njuffa已创建一组主机可用的转换函数)
- 请注意,计算能力5.2及以下的设备不 支持半精度算术。这意味着要执行的任何算术运算都必须在一些支持的类型上进行,例如
float
。计算能力5.3的设备(目前为Tegra TX1)和可能的未来设备将支持本地半精度算术运算,但是这些设备目前通过__ hmul
。在不支持本机操作的设备中,__ hmul
的固有属性将被定义。 -
cuda_fp16.h
在您打算在设备代码中使用这些类型和内在性的任何文件中。
- Refer to the half-precision intrinsics.
- Note that most or all of these intrinsics are only supported in device code. (However, @njuffa has created a set of host-usable conversion functions here)
- Note that devices of compute capability 5.2 and below do not natively support half-precision arithmetic. This means that any arithmetic operations to be performed must be done on some supported type, such as
float
. Devices of compute capability 5.3 (Tegra TX1, currently) and presumably future devices, will support "native" half-precision arithmetic operations, but these are currently exposed through such intrinsics as__hmul
. An intrinsic like__hmul
will be undefined in devices that do not support native operations. - You should include
cuda_fp16.h
in any file where you intend to make use of these types and intrinsics in device code.
根据上述要点,这里有一个简单的代码,它需要一组 float
数量,将它们转换为 half
数量,并按比例缩放系数:
With the above points in mind, here is a simple code that takes a set of float
quantities, converts them to half
quantities, and scales them by a scale factor:
$ cat t924.cu
#include <stdio.h>
#include <cuda_fp16.h>
#define DSIZE 4
#define SCF 0.5f
#define nTPB 256
__global__ void half_scale_kernel(float *din, float *dout, int dsize){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
if (idx < dsize){
half scf = __float2half(SCF);
half kin = __float2half(din[idx]);
half kout;
#if __CUDA_ARCH__ >= 530
kout = __hmul(kin, scf);
#else
kout = __float2half(__half2float(kin)*__half2float(scf));
#endif
dout[idx] = __half2float(kout);
}
}
int main(){
float *hin, *hout, *din, *dout;
hin = (float *)malloc(DSIZE*sizeof(float));
hout = (float *)malloc(DSIZE*sizeof(float));
for (int i = 0; i < DSIZE; i++) hin[i] = i;
cudaMalloc(&din, DSIZE*sizeof(float));
cudaMalloc(&dout, DSIZE*sizeof(float));
cudaMemcpy(din, hin, DSIZE*sizeof(float), cudaMemcpyHostToDevice);
half_scale_kernel<<<(DSIZE+nTPB-1)/nTPB,nTPB>>>(din, dout, DSIZE);
cudaMemcpy(hout, dout, DSIZE*sizeof(float), cudaMemcpyDeviceToHost);
for (int i = 0; i < DSIZE; i++) printf("%f\n", hout[i]);
return 0;
}
$ nvcc -o t924 t924.cu
$ cuda-memcheck ./t924
========= CUDA-MEMCHECK
0.000000
0.500000
1.000000
1.500000
========= ERROR SUMMARY: 0 errors
$
如果你研究上述代码,你会注意到,除了cc5.3和更高版本的设备,正在作为常规 float
操作。这与上面的注释3一致。
If you study the above code, you'll note that, except in the case of cc5.3 and higher devices, the arithmetic is being done as a regular float
operation. This is consistent with the note 3 above.
外卖如下:
- 在cc5.2及以下版本的设备上,
half
数据类型可能仍然有用,但主要作为存储优化(和相关地,也许是存储器带宽优化,因为例如给定的128位向量负载可以立即加载 8一半
量)。例如,如果你有一个大的神经网络,你已经确定权重可以容忍存储为半精度量(从而加倍存储密度,或大约加倍的神经网络的大小可以表示在GPU的存储空间),那么您可以将神经网络权重存储为半精度。然后,当你需要执行向前传递(推理)或向后传递(训练)时,你可以从内存加载权重,将它们(使用内在函数)转换为float
数量,执行必要的操作(可能包括由于训练而调整重量),然后(如果需要)再次将重量存储为一半
数量。 - 对于cc5.3和未来的设备,如果算法将容忍它,可以执行与上述类似的操作, code> float (也许回到
half
),而是将所有数据保留在half
表示,并直接进行必要的算术(使用__ hmul
或__ hadd
intrinsics)。
- On devices of cc5.2 and below, the
half
datatype may still be useful, but principally as a storage optimization (and, relatedly, perhaps a memory bandwidth optimization, since e.g. a given 128-bit vector load could load 8half
quantities at once). For example, if you have a large neural network, and you've determined that the weights can tolerate being stored as half-precision quantities (thereby doubling the storage density, or approximately doubling the size of the neural network that can be represented in the storage space of a GPU), then you could store the neural network weights as half-precision. Then, when you need to perform a forward pass (inference) or a backward pass (training) you could load the weights in from memory, convert them on-the-fly (using the intrinsics) tofloat
quantities, perform the necessary operation (perhaps including adjusting the weight due to training), then (if necessary) store the weight again as ahalf
quantity. - For cc5.3 and future devices, if the algorithm will tolerate it, it may be possible to perform a similar operation as above, but without conversion to
float
(and perhaps back tohalf
), but rather leaving all data inhalf
representation, and doing the necessary arithmetic directly (using e.g.__hmul
or__hadd
intrinsics).
虽然我没有在这里展示,但是一半
数据类型是可用主机代码。通过这个,我的意思是你可以为该类型的项目分配存储,并执行例如。 cudaMemcpy
操作就可以了。但是主机代码不知道关于 half
数据类型的任何东西(例如,如何对其进行算术,或打印出来,或做类型转换),内在函数不是可用于主机代码。因此,你可以为一半
数据类型的大数组分配存储空间,如果你想(可能存储一组神经网络权重),但你只能直接操作
Although I haven't demonstrated it here, the half
datatype is "usable" in host code. By that, I mean you can allocate storage for items of that type, and perform e.g. cudaMemcpy
operations on it. But the host code doesn't know anything about half
data type (e.g. how to do arithmetic on it, or print it out, or do type conversions) and the intrinsics are not usable in host code. Therefore, you could certainly allocate storage for a large array of half
data type if you wanted to (perhaps to store a set of neural network weights), but you could only directly manipulate that data with any ease from device code, not host code.
还有几个注释:
-
CUBLAS库。上述说明应该提供一些有关不同设备类型(即计算能力)可能发生的情况的信息。
The CUBLAS library implements a matrix-matrix multiply designed to work directly on
half
data. The description above should give some insight as to what is likely going on "under the hood" for different device types (i.e. compute capabilities).
相关问题关于 half
的用法是。
A related question about use of half
in thrust is here.
这篇关于任何人都可以提供示例代码在cuda中使用16位浮点的示例代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!