arrayfun可以比在MATLAB明确循环显著慢。为什么？

本文介绍了arrayfun可以比在MATLAB明确循环显著慢。为什么？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

请考虑下面这个简单的速度测试 arrayfun ：

Consider the following simple speed test for arrayfun:

T = 4000;
N = 500;
x = randn(T, N);
Func1 = @(a) (3*a^2 + 2*a - 1);

tic
Soln1 = ones(T, N);
for t = 1:T
    for n = 1:N
        Soln1(t, n) = Func1(x(t, n));
    end
end
toc

tic
Soln2 = arrayfun(Func1, x);
toc

在我的机器上（Matlab的2011B上的Linux Mint的12），本次测试的输出是：

On my machine (Matlab 2011b on Linux Mint 12), the output of this test is:

Elapsed time is 1.020689 seconds.
Elapsed time is 9.248388 seconds.

什么？！？ arrayfun ，而不可否认清洁寻找解决方案，是一个量级慢。这到底是怎么回事呢？

What the?!? arrayfun, while admittedly a cleaner looking solution, is an order of magnitude slower. What is going on here?

另外，我做的测试类似的款式为 cellfun ，发现它比外在的循环慢3倍左右。同样，这个结果是我期望的相反。

Further, I did a similar style of test for cellfun and found it to be about 3 times slower than an explicit loop. Again, this result is the opposite of what I expected.

我的问题是：为什么 arrayfun 和 cellfun 这么多慢？而鉴于此，有没有使用它们的理由（除使code好看）？

My question is: Why are arrayfun and cellfun so much slower? And given this, are there any good reasons to use them (other than to make the code look good)?

注意：我说的是标准版的 arrayfun 在这里，而不是GPU版本从并行处理工具箱

Note: I'm talking about the standard version of arrayfun here, NOT the GPU version from the parallel processing toolbox.

编辑：只是要清楚，我知道， FUNC1 上面可以由奥利指出的量化。我只选择它是因为它产生的实际问题的目的，一个简单的速度测试。

Just to be clear, I'm aware that Func1 above can be vectorized as pointed out by Oli. I only chose it because it yields a simple speed test for the purposes of the actual question.

编辑：以下grungetta的建议，我重新做了以测试功能，加速了。结果是：

Following the suggestion of grungetta, I re-did the test with feature accel off. The results are:

Elapsed time is 28.183422 seconds.
Elapsed time is 23.525251 seconds.

在换句话说，它会出现这种差异的一个重要组成部分就是JIT加速器确实比它加快了明确的为循环的一个更好的工作 arrayfun 。这似乎很奇怪我，因为 arrayfun 实际上提供了更多的信息，即它的使用表明，调用 FUNC1 并不重要。另外，我注意到，JIT加速是否打开或关闭，我的系统永远只能使用一个CPU ...

In other words, it would appear that a big part of the difference is that the JIT accelerator does a much better job of speeding up the explicit for loop than it does arrayfun. This seems odd to me, since arrayfun actually provides more information, ie, its use reveals that the order of the calls to Func1 do not matter. Also, I noted that whether the JIT accelerator is switched on or off, my system only ever uses one CPU...

`推荐答案`

您可以通过运行您的code的其他版本的想法。考虑明确写出计算，而不是在循环使用函数

You can get the idea by running other versions of your code. Consider explicitly writing out the computations, instead of using a function in your loop

tic
Soln3 = ones(T, N);
for t = 1:T
    for n = 1:N
        Soln3(t, n) = 3*x(t, n)^2 + 2*x(t, n) - 1;
    end
end
toc

时间来计算我的电脑上：

Time to compute on my computer:

Soln1  1.158446 seconds.
Soln2  10.392475 seconds.
Soln3  0.239023 seconds.
Oli    0.010672 seconds.

现在，而完全矢量解决方案显然是最快的，你可以看到，定义一个函数被调用每X个条目是一个大的开销。只是明确地写出计算了美国因素5加速。我猜这表明MATLABs JIT编译器does不支持内联函数的。据gnovice有答案，它实际上是更好地写出了正常功能，而不是匿名的。试试吧。

Now, while the fully 'vectorized' solution is clearly the fastest, you can see that defining a function to be called for every x entry is a huge overhead. Just explicitly writing out the computation got us factor 5 speedup. I guess this shows that MATLABs JIT compiler does not support inline functions. According to the answer by gnovice there, it is actually better to write a normal function rather than an anonymous one. Try it.

下一步 - 删除（矢量）内循环：

Next step - remove (vectorize) the inner loop:

tic
Soln4 = ones(T, N);
for t = 1:T
    Soln4(t, :) = 3*x(t, :).^2 + 2*x(t, :) - 1;
end
toc

Soln4  0.053926 seconds.

另一个因素5加速：有东西在这些发言中说，应避免在MATLAB循环...或者是真的？看看这则

Another factor 5 speedup: there is something in those statements saying you should avoid loops in MATLAB... Or is there really? Have a look at this then

tic
Soln5 = ones(T, N);
for n = 1:N
    Soln5(:, n) = 3*x(:, n).^2 + 2*x(:, n) - 1;
end
toc

Soln5   0.013875 seconds.

更接近全部向量化版本。 Matlab的矩阵存储列明智的。你应该总是（如果可能）组织你的计算，进行矢量化'列明智的。

Much closer to the 'fully' vectorized version. Matlab stores matrices column-wise. You should always (when possible) structure your computations to be vectorized 'column-wise'.

我们现在可以回去Soln3。循环顺序有行方式。让我们改变它

We can go back to Soln3 now. The loop order there is 'row-wise'. Lets change it

tic
Soln6 = ones(T, N);
for n = 1:N
    for t = 1:T
        Soln6(t, n) = 3*x(t, n)^2 + 2*x(t, n) - 1;
    end
end
toc

Soln6  0.201661 seconds.

更好的，但还是很糟糕。单回路 - 不错。双环 - 坏。我猜MATLAB做了改进循环性能的一些体面的工作，但还是循环的开销是存在的。如果一定要用里面的一些较重的工作，你不会注意到。但由于该计算是内存带宽有界的，你看到的循环开销。而你的会更清楚地看到调用FUNC1存在的开销。

Better, but still very bad. Single loop - good. Double loop - bad. I guess MATLAB did some decent work on improving the performance of loops, but still the loop overhead is there. If you would have some heavier work inside, you would not notice. But since this computation is memory bandwidth bounded, you do see the loop overhead. And you will even more clearly see the overhead of calling Func1 there.

那么，什么是与arrayfun？无功能要么inlinig的出现，让很多开销。但为什么这么比的双重嵌套循环差多少？其实用cellfun / arrayfun的话题进行了广泛讨论过很多次（如，的，的和）。这些功能仅仅是缓慢的，你不能使用他们这种细粒度计算。你可以将它们用于code简洁和细胞和阵列之间花哨的转换。但功能必须比你写的东西更重：

So what's up with arrayfun? No function inlinig there either, so a lot of overhead. But why so much worse than a double nested loop? Actually, the topic of using cellfun/arrayfun has been extensively discussed many times (e.g. here, here, here and here). These functions are simply slow, you can not use them for such fine-grain computations. You can use them for code brevity and fancy conversions between cells and arrays. But the function needs to be heavier than what you wrote:

tic
Soln7 = arrayfun(@(a)(3*x(:,a).^2 + 2*x(:,a) - 1), 1:N, 'UniformOutput', false);
toc

Soln7  0.016786 seconds.

注意Soln7是细胞现在..有时这是很有用的。 code表现也相当不错，现在，如果你需要的细胞作为输出，你不需要你的矩阵转换你已经使用了完全量化的解决方案了。

Note that Soln7 is a cell now.. sometimes that is useful. Code performance is quite good now, and if you need cell as output, you do not need to convert your matrix after you have used the fully vectorized solution.

那么，为什么不是一个简单的循环结构arrayfun慢？不幸的是，我们无法肯定地说，因为没有可用的源$ C $ C。你只能猜测，因为arrayfun是一种通用的功能，它可以处理各种不同的数据结构和参数的值，它不一定非常快，简单的情况下，您可以直接EX preSS作为循环嵌套。哪里的开销来自于我们无法知道。能否通过开销更好地执行避免？也许不会。但不幸的是我们可以做的唯一的事情就是学习表现来确定的情况下，在它工作得很好，而这些，它没有。

So why is arrayfun slower than a simple loop structure? Unfortunately, it is impossible for us to say for sure, since there is no source code available. You can only guess that since arrayfun is a general purpose function, which handles all kinds of different data structures and arguments, it is not necessarily very fast in simple cases, which you can directly express as loop nests. Where does the overhead come from we can not know. Could the overhead be avoided by a better implementation? Maybe not. But unfortunately the only thing we can do is study the performance to identify the cases, in which it works well, and those, where it doesn't.

更新由于该测试的执行时间短，以获得可靠的结果，现在我周围添加测试循环：

Update Since the execution time of this test is short, to get reliable results I added now a loop around the tests:

for i=1:1000
   % compute
end

下面给出了一些次：

Some times given below:

Soln5   8.192912 seconds.
Soln7  13.419675 seconds.
Oli     8.089113 seconds.

您看到arrayfun仍然很糟糕，但比矢量化的解决方案更糟糕的幅度至少不是三个数量级。在另一方面，列方式计算的单回路是尽可能快地完全量化版本......一个单CPU上被全部完成。结果Soln5和Soln7如果我切换到2芯不改变 - 在Soln5我将不得不使用一个PARFOR得到它并行化。忘掉加速... Soln7不并行运行，因为arrayfun不并行运行。另一方面OLIS量化版本

You see that the arrayfun is still bad, but at least not three orders of magnitude worse than the vectorized solution. On the other hand, a single loop with column-wise computations is as fast as the fully vectorized version... That was all done on a single CPU. Results for Soln5 and Soln7 do not change if I switch to 2 cores - In Soln5 I would have to use a parfor to get it parallelized. Forget about speedup... Soln7 does not run in parallel because arrayfun does not run in parallel. Olis vectorized version on the other hand:

Oli  5.508085 seconds.

                        这篇关于arrayfun可以比在MATLAB明确循环显著慢。为什么？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！