问题描述
在支持高速缓存的CUDA设备中,一个线程对全局内存地址的连续原子操作中的引用的局部性是否受益于L2高速缓存?
例如,我在CUDA内核中有一个原子操作使用返回的值。
In a cache-enabled CUDA device, does locality of references in consecutive atomic operations on global memory addresses by one thread benefit from L2 cache?
For example, I have an atomic operation in a CUDA kernel that uses the returned value.
uint a = atomicAnd( &(GM_addr[index]), b );
我想如果我要在同一个内核中再次使用原子,如果我可以将新原子操作的地址限制为32字节长 [&(GM_addr [index& 0xFFFFFFF8]),&(GM_addr [index | 7])]
interval,我会在L2缓存(有一个32字节长的缓存行)命中。这个猜测是正确的吗?
I'm thinking if I'm about to use atomic by the thread in the same kernel again , if I can confine the address of new atomic operation to 32-byte long [ &(GM_addr[index&0xFFFFFFF8]), &(GM_addr[index|7]) ]
interval, I'll have a hit in L2 cache (that has a 32-byte long cache line). Is this speculation correct? Or are there exceptions associated with global atomics?
推荐答案
我回答这里分享我的方法来了解L2的影响全局原子中的缓存利用率。我不接受这个答案,因为我不认为自己从架构的角度来看,在L2缓存上发生了什么。
I'm answering here to share my approach to find out the impact of L2 cache utilization in global atomics. I do not accept this answer because I do not consider myself yet aware of what happens with atomics on L2 cache from an architectural point of view.
我创建了一个简单的CUDA程序。
I created a simple CUDA program.
#include <stdio.h>
static void HandleError( cudaError_t err, const char *file, int line ) {
if (err != cudaSuccess) {
fprintf( stderr, "%s in %s at line %d\n", cudaGetErrorString( err ), file, line );
exit( EXIT_FAILURE );
}
}
#define HANDLE_ERROR( err ) (HandleError( err, __FILE__, __LINE__ ))
__global__ void address_confined(uint* data, uint nElems) {
uint tmp, a = 1;
for( uint index = 0;
index < nElems;
++index ) {
tmp = data[index];
data[index] += a;
a = tmp;
}
}
__global__ void address_not_confined(uint* data, uint nElems) {
uint tmp, a = 1;
for( uint index = 0;
index < nElems;
index += 8 ) {
tmp = data[index];
data[index] += a;
a = tmp;
}
}
__global__ void address_confined_atomics(uint* data, uint nElems) {
uint a = 1;
for( uint index = 0;
index < nElems;
++index ) {
a = atomicAdd ( &(data[index]), a);
}
}
__global__ void address_not_confined_atomics(uint* data, uint nElems) {
uint a = 1;
for( uint index = 0;
index < nElems;
index += 8 ) {
a = atomicAdd ( &(data[index]), a);
}
}
int main ( ){
const unsigned int nElems = 1 << 23;
unsigned int* dev_data;
HANDLE_ERROR( cudaMalloc((void**) &(dev_data), (nElems) * sizeof(unsigned int)) );
HANDLE_ERROR( cudaMemset(dev_data, 0, nElems) );
cudaEvent_t start, stop;
HANDLE_ERROR( cudaEventCreate(&start) );
HANDLE_ERROR( cudaEventCreate(&stop) );
float dt_ms;
HANDLE_ERROR( cudaEventRecord(start) );
address_confined<<<1,1>>>(dev_data, nElems>>3);
HANDLE_ERROR( cudaPeekAtLastError() );
HANDLE_ERROR( cudaEventRecord(stop) );
HANDLE_ERROR( cudaDeviceSynchronize() );
HANDLE_ERROR( cudaEventElapsedTime(&dt_ms, start, stop) );
fprintf( stdout, "Address-confined global access took %f (ms).\n", dt_ms);
HANDLE_ERROR( cudaEventRecord(start) );
address_not_confined<<<1,1>>>(dev_data, nElems);
HANDLE_ERROR( cudaPeekAtLastError() );
HANDLE_ERROR( cudaEventRecord(stop) );
HANDLE_ERROR( cudaDeviceSynchronize() );
HANDLE_ERROR( cudaEventElapsedTime(&dt_ms, start, stop) );
fprintf( stdout, "Address-NOT-confined global access took %f (ms).\n", dt_ms);
HANDLE_ERROR( cudaEventRecord(start) );
address_confined_atomics<<<1,1>>>(dev_data, nElems>>3);
HANDLE_ERROR( cudaPeekAtLastError() );
HANDLE_ERROR( cudaEventRecord(stop) );
HANDLE_ERROR( cudaDeviceSynchronize() );
HANDLE_ERROR( cudaEventElapsedTime(&dt_ms, start, stop) );
fprintf( stdout, "Address-confined atomics took %f (ms).\n", dt_ms);
HANDLE_ERROR( cudaEventRecord(start) );
address_not_confined_atomics<<<1,1>>>(dev_data, nElems);
HANDLE_ERROR( cudaPeekAtLastError() );
HANDLE_ERROR( cudaEventRecord(stop) );
HANDLE_ERROR( cudaDeviceSynchronize() );
HANDLE_ERROR( cudaEventElapsedTime(&dt_ms, start, stop) );
fprintf( stdout, "Address-NOT-confined atomics took %f (ms).\n", dt_ms);
HANDLE_ERROR( cudaFree(dev_data) );
return(EXIT_SUCCESS);
}
在上述四个内核中,只有一个活动线程尝试执行对全局内存中的整数进行读 - 修改 - 写。我选择了一个线程,以消除其他线程的可能影响。两个内核使用32字节跳来跳过在L2中缓存的内容,另外两个内核访问连续的整数。两个内核使用原子,两个不使用。
我在CUDA 6.0中使用Ubuntu 12.04编译了CC = 3.5和 -O3
标志。我在GeForce GTX 780(Kepler GK110)上运行它。
In above four kernels, only one active thread tries to perform a read-modify-write on integers in global memory. I chose one thread in order to eliminate the possible effects of other threads. Two kernels do it with 32-byte hops to skip what has been cached in L2 and two others access consecutive integers. Two kernels use atomics and two don't.
I compiled it for CC=3.5 and with -O3
flag in Ubuntu 12.04 using CUDA 6.0. I ran it on a GeForce GTX 780 (Kepler GK110).
我得到以下结果:
Address-confined global access took 286.206207 (ms).
Address-NOT-confined global access took 398.450348 (ms).
Address-confined atomics took 231.808640 (ms).
Address-NOT-confined atomics took 349.534637 (ms).
从上面的结果可以看出,L2的利用率与原子数相比甚至更大影响通常的全局内存访问。
You can see from above results that utilization of L2 has equal or even more effect on atomics comparing to its impact on usual global memory accesses.
我得到了原子内核分析结果:
I got below results from profiling atomic kernels:
-- address_not_confined_atomics --
L2 Write Transactions: 1048582
L2 Read Transactions: 1069849
Device Memory Write Transactions: 1048578
Device Memory Read Transactions: 1877877
L2 Throughput (Writes): 96.753 (MB/s)
L2 Throughput (Reads): 98.716 (MB/s)
-- address_confined_atomics --
L2 Write Transactions: 1048581
L2 Read Transactions: 1061095
Device Memory Write Transactions: 1046652
Device Memory Read Transactions: 672616
L2 Throughput (Writes): 147.380 (MB/s)
L2 Throughput (Reads): 149.139 (MB/s)
我不会在这里带来非原子分析结果,因为他们更多或更少类似于它们相应的上述版本。在我看来,性能增益来自L2缓存吞吐量增强。特别是当内核执行时间减少的程度与L2缓存吞吐量的增加成比例时。在原子和非原子版本中的L2高速缓存减少了从设备全局存储器读取事务的所需数量,因此降低了总的读取等待时间。总而言之,它似乎对于原子操作(使用返回值的那些)的非原子访问与在全局存储器引用中具有局部性一样重要。 小心不使用返回值的原子会产生不同的设备指令;因此不能依赖上述评估。
I do not bring non-atomic profiling results here because they're more or less similar to their corresponding versions above. It seems to me the performance gain comes from L2 cache throughput enhancement. Especially when the degree to which the kernel execution time has reduced is proportional to the increase in L2 cache throughput. L2 cache, in both atomic and non-atomic versions, reduces the required number of read transactions from device global memory hence reducing overall read latency. To recap, it seems that it can be as important as non-atomic accesses for atomic operations (those that use returned value) to have locality in global memory references. Beware that atomics that don't use returned value produce a different device instruction; thus above evaluations cannot be relied on.
这篇关于可以对全局内存的连续CUDA原子操作是否受益于L2缓存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!