本文介绍了辛普森的方法将实值函数与CUDA集成的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在CUDA中通过Simpson的方法进行代码集成。

I'm trying to code integration by Simpson's method in CUDA.

这是辛普森规则的公式

其中 x_k = a + k * h

这是我的代码

    __device__ void initThreadBounds(int *n_start, int *n_end, int n, 
                                        int totalBlocks, int blockWidth)
    {
        int threadId = blockWidth * blockIdx.x + threadIdx.x;
        int nextThreadId = threadId + 1;

        int threads = blockWidth * totalBlocks;

        *n_start = (threadId * n)/ threads;
        *n_end =  (nextThreadId * n)/ threads;
    }

    __device__ float reg_func (float x)
    {
        return x;
    }

    typedef float (*p_func) (float);

    __device__ p_func integrale_f = reg_func;

    __device__ void integralSimpsonMethod(int totalBlocks, int totalThreads, 
                    double a, double b, int n, float p_function(float), float* result)
    {
        *result = 0;

        float h = (b - a)/n; 
        //*result = p_function(a)+p_function(a + h * n);
        //parallel
        int idx_start;
        int idx_end;
        initThreadBounds(&idx_start, &idx_end, n-1, totalBlocks, totalThreads);
        //parallel_ends
        for (int i = idx_start; i < idx_end; i+=2) {
            *result +=  ( p_function(a + h*(i-1)) + 
                          4 * p_function(a + h*(i)) + 
                          p_function(a + h*(i+1)) ) * h/3;

        }   
    } 


    __global__ void integralSimpson(int totalBlocks, int totalThreads,  float* result)
    {
        float res = 0;

        integralSimpsonMethod(totalBlocks, totalThreads, 0, 10, 1000, integrale_f, &res);
        result[(blockIdx.x*totalThreads + threadIdx.x)] = res;

        //printf ("Simpson method\n");
    }


    __host__ void inttest()
    {

        const int blocksNum = 32;
        const int threadNum = 32;

        float   *device_resultf; 
        float   host_resultf[threadNum*blocksNum]={0};


        cudaMalloc((void**) &device_resultf, sizeof(float)*threadNum*blocksNum);
            integralSimpson<<<blocksNum, threadNum>>>(blocksNum, threadNum, device_resultf);
        cudaThreadSynchronize();

        cudaMemcpy(host_resultf, device_resultf, sizeof(float) *threadNum*blocksNum, 
                      cudaMemcpyDeviceToHost);

        float sum = 0;
        for (int i = 0; i != blocksNum*threadNum; ++i) {
            sum += host_resultf[i];
            //  printf ("result in %i cell = %f \n", i, host_resultf[i]);
        }
        printf ("sum = %f \n", sum);
        cudaFree(device_resultf);
    }

int main(int argc, char* argv[])
{


   inttest();


    int i;
    scanf ("%d",&i);

}

问题是:<$ c $对于从 0 10 的积分,结果是〜99 ,但是当 n = 100000 或更大时,它工作正常,结果是〜50

The problem is: it works wrong when n is lower than 100000. For an integral from 0 to 10, the result is ~99, but when n = 100000 or larger it works fine and the result is ~50.

有什么问题?

推荐答案

您的 integralSimpsonMethod()函数设计为每个线程都是在整数域中每个子间隔采样至少3个正交点。因此,如果选择n使其小于内核调用中的线程数的四倍,则不可避免的是每个子间隔将重叠,并且所得到的积分将是不正确的。你需要确保代码检查和缩放线程计数或n,以便它们在计算积分时不会产生重叠。

Your integralSimpsonMethod() function is designed such that each thread is sampling at least 3 quadrature points per sub-interval in the integral domain. Therefore, if you choose n so that it is less than four times the number of threads in the kernel call, it is inevitable that each sub interval will overlap and the resulting integral will be incorrect. You need to make sure that the code checks and scales the thread count or n so that they don't produce overlap when the integral is computed.

如果你这样做对于除自我化之外的任何内容,我建议您查找Simpson规则的复合版本。这更好地适合并行实现,如果正确实现,性能会更好。

If you are doing this for anything other than self-edification, then I recommend you look up the composite version of Simpson's rule. This is much better suited to parallel implementation and will be considerably more performant if implemented correctly.

这篇关于辛普森的方法将实值函数与CUDA集成的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-27 19:23