问题描述
对不起,我的英语。我有一个cuda内核,它会不时返回不同的结果值。该内核计数序列和。我的内核包含4个代码部分。让我解释一下该内核如何工作。第一部分在线程之间分配迭代()。第二部分代码显示每个线程如何计数一半。在第二部分之后,我们必须放置__syncthreads(),因为在第二部分之后,我们开始使用共享内存。在第三部分中,我得到了块中所有线程的结果总和,并将其放入threadIdx.x等于0 。在第四部分中,我获取了所有线程块的结果总和并将其放入dSum [0]
Sorry for my english. I have a cuda kernel which returns different result values from time to time. This kernel counts series sum. My kernel consists of 4 code parts. Let me explain a little how this kernel works. The first part distributes iterations between threads(I took it as source). The second code part shows how every thread counts halfsum. After the second part we must place __syncthreads() because after the second part we are starting to use shared memory. In the third part I'm getting the resulting sum of all threads in block and putting it to the thread which threadIdx.x equals 0(I took it as source @ page 22). In the fourth part Im getting the resulting sum of all thread blocks and putting it to dSum[0]
我是否正确放置了 __ syncthreads() ?哪里有错误?为什么在64个块和768个线程上给出错误的结果,而在768个块和64个线程上给出正确的结果呢?
Did I place __syncthreads() correctly? Where is an error? why on 64 blocks and 768 threads it gives wrong result and on 768 blocks and 64 threads it gives correct result?
__global__ void sumSeries(double* dSum,int totalThreadNumber){
volatile __shared__ double data[768];
int tid=threadIdx.x+blockIdx.x*blockDim.x;
int myend;
double var;
//part_1 get tid's start iteration value and end iteration value.
int mystart = (INT_MAX / totalThreadNumber) * tid;
if (INT_MAX % totalThreadNumber > tid)
{
mystart += tid;
myend = mystart + (INT_MAX / totalThreadNumber) + 1;
}
else
{
mystart += INT_MAX % totalThreadNumber;
myend = mystart + (INT_MAX / totalThreadNumber);
}
//part_2 get halfsum
data[threadIdx.x]=0;
for (int i = mystart ; i < myend ; ++i){
var=i;
data[threadIdx.x] += (var*var+var+1)/(var*var*var+var*var+var+1);
}
__syncthreads();
//part_3 sum all results in every block
for (int s=blockDim.x/2; s>32; s>>=1)
{
if (threadIdx.x < s)
data[threadIdx.x] += data[threadIdx.x + s];
__syncthreads();
}
if (threadIdx.x < 32)
{
data[threadIdx.x] += data[threadIdx.x + 32];
data[threadIdx.x] += data[threadIdx.x + 16];
data[threadIdx.x] += data[threadIdx.x + 8];
data[threadIdx.x] += data[threadIdx.x + 4];
data[threadIdx.x] += data[threadIdx.x + 2];
data[threadIdx.x] += data[threadIdx.x + 1];
}
if (threadIdx.x==0)
{
dSum[blockIdx.x]=data[0];
}
__syncthreads();
//part_4
if (tid==0)
for (int t=1;t<8;++t)
dSum[0]=dSum[0]+dSum[t];
}
推荐答案
所以您的总和就是序列
(n^2+n+1)/(n^3+n^2+n+1) = (n^3-1)/(n^4-1)
谐波序列超过
1/n
没有,它非常非常缓慢地向无穷大发散。从1到N的总和在log(N)和1-log(2)+ log(N + 1)之间。
namely none, it is, very very slowly, diverging towards infinity. The sum from 1 to N has a value between log(N) and 1-log(2)+log(N+1).
任何有限求和的结果关于求和顺序,这些系列中的第一个是非常明智的。从1到N求和并减小将抑制1 == 1 + 1 / n的所有项,这对于浮点数来说很小。从N到1的倒数相加将首先累加小数目并保留它们的累加贡献。
The result of any finite summation of these series is very sensible with regard to the order of summation. Summing forward from 1 to N and decreasing suppresses all terms where 1==1+1/n, which happens at a rather small number for floats. Summing backwards from some N to 1 will accumulate the small numbers first and preserve their cumulative contribution.
因此,取决于部分和的到达顺序,尤其是当和
So depending on the order of arrival of the partial sums, especially when the sum containing 1 comes in, the total sum will show noticeable differences.
两个术语都以单调递减
f(x) = (x^2+x+1)/(x^3+x^2+x+1) = 0.5/(x+1)+0.5*(x+1)/(x^2+1)
该函数的反导数为
F(n) = 0.5*ln(x+1)+0.25*ln(x^2+1)+0.5*arctan(x)
因此
f(n+1) <= F(n+1)-F(n) <= f(n) <= F(n)-F(n-1)
将其相加得出
F(N+1)-F(m) <= sum(n=m to N) f(n) <= F(N)-F(m-1)
为此,必须将总和的初始部分加起来三个条款。
To this one has to add the initial part of the sum in all three of the terms.
因此设置m = 1000 ,计算
S = sum(n = 0至999)f(n)` ,然后
So set m=1000, compute
S=sum(n=0 to 999) f(n)`, then
S+F(2^32 )-F(1000) = 23.459829390459243
S+F(2^32-1)-F( 999) = 23.460829890558995
是求和的上下限从0到2 ^ 32-1,远离任何数值结果。
are the upper and lower bounds for a summation from 0 to 2^32-1, far away from any of the numerical results.
这篇关于cuda计算的结果不时发生变化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!