本文介绍了为什么我的程序慢慢循环完全8192元素?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是从有问题的程序的提取。矩阵 img [] [] 的大小为SIZE×SIZE,初始化为:



img [j] [i] = 2 * j + i



res [] [] ,这里的每个字段都是img矩阵中的9个字段的平均值。

  for(i = 1; i  res [j] [i] = 0; 
for(k = -1; k for(l = -1; l res [j] [i] + = img [j + l] [i + k];
res [j] [i] / = 9;
}

这是程序的全部。为了完整起见,这里是以前的。没有代码。正如你所看到的,它只是初始化。

  #define SIZE 8192 
float img [SIZE] [SIZE] // input image
float res [SIZE] [SIZE]; // mean of mean filter
int i,j,k,l; (i = 0; i for(j = 0; j img [j] [i] =(2 * j + i )%8196;

基本上,当SIZE是2048的倍数时,执行时间:

  SIZE = 8191:3.44 secs 
SIZE = 8192:7.20 secs
SIZE = 8193:3.18 secs

编译器是GCC。
从我知道,这是因为内存管理,但我不太了解这个主题,这就是为什么我在这里问。



我已经知道malloc / free了,但是如果有人能解释这些执行时间,我已经足够快乐。 ,但是问题不是使用的内存量,它只是执行时间,所以我不知道这将如何帮助。

解决方案

p>这是因为以下相关问题导致了同样的超级对齐问题:








但是这只是因为代码还有一个问题。



从原始循环开始:



<$ p (j = 1; j res [j])中的(i = 1; i [i] = 0;
for(k = -1; k for(l = -1; l res [j] [i] + = img [j + l] [i + k];
res [j] [i] / = 9;
}

首先注意两个内部循环是微不足道的。它们可以展开如下:

  for(i = 1; i  for(j = 1; j  res [j] [i] = 0; 
res [j] [i] + = img [j-1] [i-1];
res [j] [i] + = img [j] [i-1];
res [j] [i] + = img [j + 1] [i-1];
res [j] [i] + = img [j-1] [i];
res [j] [i] + = img [j] [i];
res [j] [i] + = img [j + 1] [i];
res [j] [i] + = img [j-1] [i + 1];
res [j] [i] + = img [j] [i + 1];
res [j] [i] + = img [j + 1] [i + 1];
res [j] [i] / = 9;
}
}

这样就留下了两个外循环,



现在我们可以看到这个问题是一样的:



您是逐列迭代矩阵而不是逐行迭代。






要解决这个问题,您应该互换两个循环。

  for(j = 1; j  for(i = 1; i  res [j] [i] = 0; 
res [j] [i] + = img [j-1] [i-1];
res [j] [i] + = img [j] [i-1];
res [j] [i] + = img [j + 1] [i-1];
res [j] [i] + = img [j-1] [i];
res [j] [i] + = img [j] [i];
res [j] [i] + = img [j + 1] [i];
res [j] [i] + = img [j-1] [i + 1]
res [j] [i] + = img [j] [i + 1];
res [j] [i] + = img [j + 1] [i + 1]
res [j] [i] / = 9;
}
}

这完全消除了所有非顺序访问






Core i7 920 @ 3.5 GHz



原始代码:

  8191:1.499秒
8192:2.122秒
8193:1.582秒

互换外环:

  8191:0.376秒
8192:0.357秒
8193:0.351秒


Here is the extract from the program in question. The matrix img[][] has the size SIZE×SIZE, and is initialized at:

img[j][i] = 2 * j + i

Then, you make a matrix res[][], and each field in here is made to be the average of the 9 fields around it in the img matrix. The border is left at 0 for simplicity.

for(i=1;i<SIZE-1;i++) 
    for(j=1;j<SIZE-1;j++) {
        res[j][i]=0;
        for(k=-1;k<2;k++) 
            for(l=-1;l<2;l++) 
                res[j][i] += img[j+l][i+k];
        res[j][i] /= 9;
}

That's all there's to the program. For completeness' sake, here is what comes before. No code comes after. As you can see, it's just initialization.

#define SIZE 8192
float img[SIZE][SIZE]; // input image
float res[SIZE][SIZE]; //result of mean filter
int i,j,k,l;
for(i=0;i<SIZE;i++) 
    for(j=0;j<SIZE;j++) 
        img[j][i] = (2*j+i)%8196;

Basically, this program is slow when SIZE is a multiple of 2048, e.g. the execution times:

SIZE = 8191: 3.44 secs
SIZE = 8192: 7.20 secs
SIZE = 8193: 3.18 secs

The compiler is GCC.From what I know, this is because of memory management, but I don't really know too much about that subject, which is why I'm asking here.

Also how to fix this would be nice, but if someone could explain these execution times I'd already be happy enough.

I already know of malloc/free, but the problem is not amount of memory used, it's merely execution time, so I don't know how that would help.

解决方案

The difference is caused by the same super-alignment issue from the following related questions:

But that's only because there's one other problem with the code.

Starting from the original loop:

for(i=1;i<SIZE-1;i++) 
    for(j=1;j<SIZE-1;j++) {
        res[j][i]=0;
        for(k=-1;k<2;k++) 
            for(l=-1;l<2;l++) 
                res[j][i] += img[j+l][i+k];
        res[j][i] /= 9;
}

First notice that the two inner loops are trivial. They can be unrolled as follows:

for(i=1;i<SIZE-1;i++) {
    for(j=1;j<SIZE-1;j++) {
        res[j][i]=0;
        res[j][i] += img[j-1][i-1];
        res[j][i] += img[j  ][i-1];
        res[j][i] += img[j+1][i-1];
        res[j][i] += img[j-1][i  ];
        res[j][i] += img[j  ][i  ];
        res[j][i] += img[j+1][i  ];
        res[j][i] += img[j-1][i+1];
        res[j][i] += img[j  ][i+1];
        res[j][i] += img[j+1][i+1];
        res[j][i] /= 9;
    }
}

So that leaves the two outer-loops that we're interested in.

Now we can see the problem is the same in this question: Why does the order of the loops affect performance when iterating over a 2D array?

You are iterating the matrix column-wise instead of row-wise.


To solve this problem, you should interchange the two loops.

for(j=1;j<SIZE-1;j++) {
    for(i=1;i<SIZE-1;i++) {
        res[j][i]=0;
        res[j][i] += img[j-1][i-1];
        res[j][i] += img[j  ][i-1];
        res[j][i] += img[j+1][i-1];
        res[j][i] += img[j-1][i  ];
        res[j][i] += img[j  ][i  ];
        res[j][i] += img[j+1][i  ];
        res[j][i] += img[j-1][i+1];
        res[j][i] += img[j  ][i+1];
        res[j][i] += img[j+1][i+1];
        res[j][i] /= 9;
    }
}

This eliminates all the non-sequential access completely so you no longer get random slow-downs on large powers-of-two.


Core i7 920 @ 3.5 GHz

Original code:

8191: 1.499 seconds
8192: 2.122 seconds
8193: 1.582 seconds

Interchanged Outer-Loops:

8191: 0.376 seconds
8192: 0.357 seconds
8193: 0.351 seconds

这篇关于为什么我的程序慢慢循环完全8192元素?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-31 22:09