我想测量gpu全局内存的缓存行为,以下是我设计的微基准测试。我要做的是从全局内存地址r_add0加载并将其存储到共享内存s_tvalue [0]中。由于某些原因,我需要用内联PTX代码替换全局内存中的加载指令。

i = *r_addr0;
//asm("ldu.global.f64.cs %1, [%2];":"=l"(i):"l"(r_addr0));
s_tvalue[0] = i;


但是,当我用nvcc编译时,它抱怨编译错误

error: Internal Compiler Error (codegen): "asm operand index requested is larger than the number of asm operands provided!"


有人知道我的密码的原因吗?

完整的代码如下:

__global__ void global_latency (long long * my_array,
                                long long array_length, int position,
                                long long *d_time)
{

    unsigned int start_time, end_time;

    __shared__ long long s_tvalue[2];//2: number of threads per block

    int k;
    long long i, j;
    for(k=0; k<2; k++)
        s_tvalue[k] = 0L;
    long long addr0,addr1;

    addr0=(long long)my_array;

    addr1 = ( addr0 ^ (1 << position));

    long long *r_addr0, *r_addr1;
    r_addr0 = (long long *)addr0;
    r_addr1 = (long long *)addr1;

    start_time = clock();
    //i = *r_addr0;
    asm("ldu.global.f64.cs %1, [%2];":"=l"(i):"l"(r_addr0));

    s_tvalue[0] = i;
    //j = *r_addr1;
    asm("ld.global.f64.cs %3, [%4];" : "=l"(j):"l"(r_addr1));
    s_tvalue[1] = j;


    end_time = clock();

    d_time[0] = end_time-start_time;
    d_time[1] = s_tvalue[0];
    printf("[%p]=%lld\n",addr0,d_time[1]);
    d_time[2] = s_tvalue[1];
    printf("[%p]=%lld\n",addr1,d_time[2]);
}

最佳答案

根据我的经验,令牌是从零开始的。由于您只有2个参数,因此分别为%0和%1。您正在使用%2,“大于提供的asm操作数的数量”。

关于c++ - 如何解释CUDA的内联PTX内部编译器错误,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/37667559/

10-11 01:02
查看更多