问题描述
我试图概述一些CUDA Rodinia基准,在它们的SM和内存利用率,功耗等方面。为此,我同时执行基准和分析器,基本上产生一个pthread配置文件的GPU执行使用NVML库。
问题是,在没有调用分析器的情况下,基准的执行时间要高得多(大约3倍)当基准测试用分析器执行时。 CPU的频率缩放调节器是用户空间,所以我不认为CPU的频率正在改变。是否是由于GPU频率闪烁?
下面是分析器的代码。
#include< pthread.h>
#include< stdio.h>
#includenvml.h
#includeunistd.h
#define NUM_THREADS 1
void * PrintHello(void * threadid)
{
long tid;
tid =(long)threadid;
// printf(Hello World!It's me,thread#%ld!\\\
,tid);
nvmlReturn_t result;
nvmlDevice_t device;
nvmlUtilization_t utilization;
nvmlClockType_t jok;
unsigned int device_count,i,powergpu,clo;
char version [80];
result = nvmlInit();
result = nvmlSystemGetDriverVersion(version,80);
printf(\\\
驱动程序版本:%s \\\
\\\
,版本);
result = nvmlDeviceGetCount(& device_count);
printf(Found%d device%s\\\
\\\
,device_count,
device_count!= 1?s:);
printf(列出设备:\\\
);
result = nvmlDeviceGetHandleByIndex(0,& device);
while(1)
{
result = nvmlDeviceGetPowerUsage(device,& powergpu);
result = nvmlDeviceGetUtilizationRates(device,& utilization);
printf(\\\
%d\\\
,powergpu);
if(result == NVML_SUCCESS)
{
printf(%d\\\
,utilization.gpu) ;
printf(%d\\\
,utilization.memory);
}
result = nvmlDeviceGetClockInfo(device,NVML_CLOCK_SM,& clo);
if(result == NVML_SUCCESS)
{
printf(%d\\\
,clo);
}
usleep(500000);
}
pthread_exit(NULL);
}
int main(int argc,char * argv [])
{
pthread_t threads [NUM_THREADS]
int rc;
long t;
for(t = 0; t printf(In main:creating thread%ld \\\
,t);
rc = pthread_create(& threads [t],NULL,PrintHello,(void *)t);
if(rc){
printf(ERROR; return code from pthread_create()is%d\\\
,rc);
exit(-1);
}
}
/ * main()应该做的事情* /
pthread_exit(NULL);
}
分析器运行时,GPU正在被拉出其睡眠状态(由于访问 nvml
API,其正在从GPU查询数据)。这使得它们对CUDA应用程序的响应更快,因此,如果您在整个应用程序执行时(例如使用linux time
命令),应用程序似乎运行更快
一个解决方案是使用 nvidia-smi
命令将GPU放置在持久模式 code> nvidia-smi --help 获得命令行帮助)。
另一个解决方案是应用程序,并从定时测量中排除CUDA启动时间,可能通过在定时开始之前执行cuda命令,例如 cudaFree(0);
/ p>
I am trying to profile some CUDA Rodinia benchmarks, in terms of their SM and memory utilization, power consumption etc. For that, I simultaneously execute the benchmark and the profiler which essentially spawns a pthread to profile the GPU execution using NVML library.
The issue is that the execution time of a benchmark, is much higher( about 3 times) in case I do not invoke the profiler along with it, than the case when the benchmark is executing with the profiler. The frequency scaling governor for the CPU is userspace so I do not think that frequency of the CPU is changing. Is it due to the flickering in GPU frequency?Below is the code for the profiler.
#include <pthread.h>
#include <stdio.h>
#include "nvml.h"
#include "unistd.h"
#define NUM_THREADS 1
void *PrintHello(void *threadid)
{
long tid;
tid = (long)threadid;
// printf("Hello World! It's me, thread #%ld!\n", tid);
nvmlReturn_t result;
nvmlDevice_t device;
nvmlUtilization_t utilization;
nvmlClockType_t jok;
unsigned int device_count, i,powergpu,clo;
char version[80];
result = nvmlInit();
result = nvmlSystemGetDriverVersion(version,80);
printf("\n Driver version: %s \n\n", version);
result = nvmlDeviceGetCount(&device_count);
printf("Found %d device%s\n\n", device_count,
device_count != 1 ? "s" : "");
printf("Listing devices:\n");
result = nvmlDeviceGetHandleByIndex(0, &device);
while(1)
{
result = nvmlDeviceGetPowerUsage(device,&powergpu );
result = nvmlDeviceGetUtilizationRates(device, &utilization);
printf("\n%d\n",powergpu);
if (result == NVML_SUCCESS)
{
printf("%d\n", utilization.gpu);
printf("%d\n", utilization.memory);
}
result=nvmlDeviceGetClockInfo(device,NVML_CLOCK_SM,&clo);
if(result==NVML_SUCCESS)
{
printf("%d\n",clo);
}
usleep(500000);
}
pthread_exit(NULL);
}
int main (int argc, char *argv[])
{
pthread_t threads[NUM_THREADS];
int rc;
long t;
for(t=0; t<NUM_THREADS; t++){
printf("In main: creating thread %ld\n", t);
rc = pthread_create(&threads[t], NULL, PrintHello, (void *)t);
if (rc){
printf("ERROR; return code from pthread_create() is %d\n", rc);
exit(-1);
}
}
/* Last thing that main() should do */
pthread_exit(NULL);
}
With your profiler running, the GPU(s) are being pulled out of their sleep state (due to the access to the nvml
API, which is querying data from the GPUs). This makes them respond much more quickly to a CUDA application, and so the application appears to run "faster" if you time the entire application execution (e.g. using the linux time
command).
One solution is to place the GPUs in "persistence mode" with the nvidia-smi
command (use nvidia-smi --help
to get command line help).
Another solution would be to do the timing from within the application, and exclude the CUDA start-up time from the timing measurement, perhaps by executing a cuda command such as cudaFree(0);
prior to the start of timing.
这篇关于CUDA基准测试中的执行时间问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!