本文介绍了numpy 和 matlab 之间的性能差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为稀疏自编码器计算 backpropagation 算法.我已经使用 numpymatlab 在 python 中实现了它.代码几乎相同,但性能却大不相同.matlab 完成任务所需的时间为 0.252454 秒,而 numpy 为 0.973672151566,几乎是四倍.稍后我将在最小化问题中多次调用此代码,因此这种差异会导致实现之间出现几分钟的延迟.这是正常行为吗?如何提高 numpy 的性能?

I am computing the backpropagation algorithm for a sparse autoencoder. I have implemented it in python using numpy and in matlab. The code is almost the same, but the performance is very different. The time matlab takes to complete the task is 0.252454 seconds while numpy 0.973672151566, that is almost four times more. I will call this code several times later in a minimization problem so this difference leads to several minutes of delay between the implementations. Is this a normal behaviour? How could I improve the performance in numpy?

Numpy 实现:

Sparse.rho 是一个调整参数,sparse.nodes 是隐藏层的节点数(25),sparse.input(64)是输入层的节点数,theta1 和 theta2 是权重矩阵第一层和第二层的尺寸分别为 25x64 和 64x25,m 等于 10000,rhoest 的尺寸为 (25,),x 的尺寸为 10000x64,a3 10000x64 和 a2 10000x25.

Sparse.rho is a tuning parameter, sparse.nodes are the number of nodes in the hidden layer (25), sparse.input (64) the number of nodes in the input layer, theta1 and theta2 are the weight matrices for the first and second layer respectively with dimensions 25x64 and 64x25, m is equal to 10000, rhoest has a dimension of (25,), x has a dimension of 10000x64, a3 10000x64 and a2 10000x25.

UPDATE:根据响应的一些想法,我对代码进行了更改.性能现在是 numpy: 0.65 vs matlab: 0.25.

UPDATE: I have introduced changes in the code following some of the ideas of the responses. The performance is now numpy: 0.65 vs matlab: 0.25.

partial_j1 = np.zeros(sparse.theta1.shape)
partial_j2 = np.zeros(sparse.theta2.shape)
partial_b1 = np.zeros(sparse.b1.shape)
partial_b2 = np.zeros(sparse.b2.shape)
t = time.time()

delta3t = (-(x-a3)*a3*(1-a3)).T

for i in range(m):

    delta3 = delta3t[:,i:(i+1)]
    sum1 =  np.dot(sparse.theta2.T,delta3)
    delta2 = ( sum1 + sum2 ) * a2[i:(i+1),:].T* (1 - a2[i:(i+1),:].T)
    partial_j1 += np.dot(delta2, a1[i:(i+1),:])
    partial_j2 += np.dot(delta3, a2[i:(i+1),:])
    partial_b1 += delta2
    partial_b2 += delta3

print "Backprop time:", time.time() -t

Matlab 实现:

tic
for i = 1:m

    delta3 = -(data(i,:)-a3(i,:)).*a3(i,:).*(1 - a3(i,:));
    delta3 = delta3.';
    sum1 =  W2.'*delta3;
    sum2 = beta*(-sparsityParam./rhoest + (1 - sparsityParam) ./ (1.0 - rhoest) );
    delta2 = ( sum1 + sum2 ) .* a2(i,:).' .* (1 - a2(i,:).');
    W1grad = W1grad + delta2* a1(i,:);
    W2grad = W2grad + delta3* a2(i,:);
    b1grad = b1grad + delta2;
    b2grad = b2grad + delta3;
end
toc

推荐答案

说Matlab 总是比 NumPy 快"或副反之.通常,它们的性能具有可比性.使用 NumPy 时,要搞定性能你必须记住,NumPy 的速度来自于调用用 C/C++/Fortran 编写的底层函数.当你申请时它表现良好这些函数到整个数组.通常,当您在 Python 循环中对较小的数组或标量调用这些 NumPy 函数时,性能会降低.

It would be wrong to say "Matlab is always faster than NumPy" or viceversa. Often their performance is comparable. When using NumPy, to get goodperformance you have to keep in mind that NumPy's speed comes from callingunderlying functions written in C/C++/Fortran. It performs well when you applythose functions to whole arrays. In general, you get poorer performance when you call those NumPy function on smaller arrays or scalars in a Python loop.

你问的 Python 循环有什么问题?通过 Python 循环的每次迭代都是调用 next 方法.[] 索引的每次使用都是对__getitem__ 方法.每个 += 都是对 __iadd__ 的调用.每个虚线属性查找(例如在 np.dot 中)涉及函数调用.那些函数调用加起来对速度有很大的阻碍.这些钩子给 Python表达能力——字符串索引与索引不同以字典为例.相同的语法,不同的含义.通过为对象提供不同的 __getitem__ 方法来实现魔术.

What's wrong with a Python loop you ask? Every iteration through the Python loop isa call to a next method. Every use of [] indexing is a call to a__getitem__ method. Every += is a call to __iadd__. Every dotted attributelookup (such as in like np.dot) involves function calls. Those function callsadd up to a significant hinderance to speed. These hooks give Pythonexpressive power -- indexing for strings means something different than indexingfor dicts for example. Same syntax, different meanings. The magic is accomplished by giving the objects different __getitem__ methods.

但是这种表现力是以速度为代价的.所以当你不需要所有那种动态的表现力,为了获得更好的表现,试着把自己限制在NumPy 函数调用整个数组.

But that expressive power comes at a cost in speed. So when you don't need allthat dynamic expressivity, to get better performance, try to limit yourself toNumPy function calls on whole arrays.

所以,删除 for 循环;尽可能使用矢量化"方程.例如,代替

So, remove the for-loop; use "vectorized" equations when possible. For example, instead of

for i in range(m):
    delta3 = -(x[i,:]-a3[i,:])*a3[i,:]* (1 - a3[i,:])

您可以一次为每个 i 计算 delta3:

you can compute delta3 for each i all at once:

delta3 = -(x-a3)*a3*(1-a3)

for-loop 中,delta3 是一个向量,使用向量化方程 delta3 是一个矩阵.

Whereas in the for-loop delta3 is a vector, using the vectorized equation delta3 is a matrix.

for-loop 中的一些计算不依赖于 i,因此应该被提升到循环之外.例如,sum2 看起来像一个常量:

Some of the computations in the for-loop do not depend on i and therefore should be lifted outside the loop. For example, sum2 looks like a constant:

sum2 = sparse.beta*(-float(sparse.rho)/rhoest + float(1.0 - sparse.rho) / (1.0 - rhoest) )

这是一个可运行的示例,其中包含您的代码 (orig) 的替代实现 (alt).


Here is a runnable example with an alternative implementation (alt) of your code (orig).

我的 timeit 基准测试显示速度提高了 6.8 倍:

My timeit benchmark shows a 6.8x improvement in speed:

In [52]: %timeit orig()
1 loops, best of 3: 495 ms per loop

In [53]: %timeit alt()
10 loops, best of 3: 72.6 ms per loop
import numpy as np


class Bunch(object):
    """ http://code.activestate.com/recipes/52308 """
    def __init__(self, **kwds):
        self.__dict__.update(kwds)

m, n, p = 10 ** 4, 64, 25

sparse = Bunch(
    theta1=np.random.random((p, n)),
    theta2=np.random.random((n, p)),
    b1=np.random.random((p, 1)),
    b2=np.random.random((n, 1)),
)

x = np.random.random((m, n))
a3 = np.random.random((m, n))
a2 = np.random.random((m, p))
a1 = np.random.random((m, n))
sum2 = np.random.random((p, ))
sum2 = sum2[:, np.newaxis]

def orig():
    partial_j1 = np.zeros(sparse.theta1.shape)
    partial_j2 = np.zeros(sparse.theta2.shape)
    partial_b1 = np.zeros(sparse.b1.shape)
    partial_b2 = np.zeros(sparse.b2.shape)
    delta3t = (-(x - a3) * a3 * (1 - a3)).T
    for i in range(m):
        delta3 = delta3t[:, i:(i + 1)]
        sum1 = np.dot(sparse.theta2.T, delta3)
        delta2 = (sum1 + sum2) * a2[i:(i + 1), :].T * (1 - a2[i:(i + 1), :].T)
        partial_j1 += np.dot(delta2, a1[i:(i + 1), :])
        partial_j2 += np.dot(delta3, a2[i:(i + 1), :])
        partial_b1 += delta2
        partial_b2 += delta3
        # delta3: (64, 1)
        # sum1: (25, 1)
        # delta2: (25, 1)
        # a1[i:(i+1),:]: (1, 64)
        # partial_j1: (25, 64)
        # partial_j2: (64, 25)
        # partial_b1: (25, 1)
        # partial_b2: (64, 1)
        # a2[i:(i+1),:]: (1, 25)
    return partial_j1, partial_j2, partial_b1, partial_b2


def alt():
    delta3 = (-(x - a3) * a3 * (1 - a3)).T
    sum1 = np.dot(sparse.theta2.T, delta3)
    delta2 = (sum1 + sum2) * a2.T * (1 - a2.T)
    # delta3: (64, 10000)
    # sum1: (25, 10000)
    # delta2: (25, 10000)
    # a1: (10000, 64)
    # a2: (10000, 25)
    partial_j1 = np.dot(delta2, a1)
    partial_j2 = np.dot(delta3, a2)
    partial_b1 = delta2.sum(axis=1)
    partial_b2 = delta3.sum(axis=1)
    return partial_j1, partial_j2, partial_b1, partial_b2

answer = orig()
result = alt()
for a, r in zip(answer, result):
    try:
        assert np.allclose(np.squeeze(a), r)
    except AssertionError:
        print(a.shape)
        print(r.shape)
        raise

提示:请注意,我在注释中留下了所有中间数组的形状.了解数组的形状有助于我了解您的代码在做什么.数组的形状可以帮助指导您使用正确的 NumPy 函数.或者至少,注意形状可以帮助您了解操作是否合理.例如,当你计算


Tip: Notice that I left in the comments the shape of all the intermediate arrays. Knowing the shape of the arrays helped me understand what your code was doing. The shape of the arrays can help guide you toward the right NumPy functions to use. Or at least, paying attention to the shapes can help you know if an operation is sensible. For example, when you compute

np.dot(A, B)

and A.shape = (n, m)B.shape = (m, p),然后 np.dot(A, B) 将是一个形状为 (n, p) 的数组.

and A.shape = (n, m) and B.shape = (m, p), then np.dot(A, B) will be an array of shape (n, p).

它可以帮助以 C_CONTIGUOUS 顺序构建数组(至少,如果使用 np.dot).这样做可能会提高 3 倍的速度:

It can help to build the arrays in C_CONTIGUOUS-order (at least, if using np.dot). There might be as much as a 3x speed up by doing so:

下面,xxf 相同,除了 x 是 C_CONTIGUOUS 和xf 是 F_CONTIGUOUS -- 与 yyf 的关系相同.

Below, x is the same as xf except that x is C_CONTIGUOUS andxf is F_CONTIGUOUS -- and the same relationship for y and yf.

import numpy as np

m, n, p = 10 ** 4, 64, 25
x = np.random.random((n, m))
xf = np.asarray(x, order='F')

y = np.random.random((m, n))
yf = np.asarray(y, order='F')

assert np.allclose(x, xf)
assert np.allclose(y, yf)
assert np.allclose(np.dot(x, y), np.dot(xf, y))
assert np.allclose(np.dot(x, y), np.dot(xf, yf))

%timeit 基准测试显示速度差异:

%timeit benchmarks show the difference in speed:

In [50]: %timeit np.dot(x, y)
100 loops, best of 3: 12.9 ms per loop

In [51]: %timeit np.dot(xf, y)
10 loops, best of 3: 27.7 ms per loop

In [56]: %timeit np.dot(x, yf)
10 loops, best of 3: 21.8 ms per loop

In [53]: %timeit np.dot(xf, yf)
10 loops, best of 3: 33.3 ms per loop

关于 Python 中的基准测试:

使用 time.time() 对的差异可能会产生误导调用以对 Python 中的代码速度进行基准测试.您需要多次重复测量.最好禁用自动垃圾收集器.测量大的时间跨度(例如至少 10 秒的重复)也很重要,以避免由于时钟计时器分辨率不佳而导致的错误,并减少 time.time 调用开销的重要性.Python 为您提供了 timeit 模块,而不是自己编写所有代码.我基本上是用它来为代码段计时,只是为了方便我通过 IPython 终端调用它.

It can be misleading to use the difference in pairs of time.time() calls to benchmark the speed of code in Python.You need to repeat the measurement many times. It's better to disable the automatic garbage collector. It is also important to measure large spans of time (such as at least 10 seconds worth of repetitions) to avoid errors due to poor resolution in the clock timer and to reduce the significance of time.time call overhead. Instead of writing all that code yourself, Python provides you with the timeit module. I'm essentially using that to time the pieces of code, except that I'm calling it through an IPython terminal for convenience.

我不确定这是否会影响您的基准测试,但请注意它可能会有所作为.在我链接到的问题中,根据time.time,两段代码相差一个1.7 倍的因数,而使用 timeit 的基准测试显示代码段的运行时间基本相同.

I'm not sure if this is affecting your benchmarks, but be aware it could make a difference. In the question I linked to, according to time.time two pieces of code differed by a factor of 1.7x while benchmarks using timeit showed the pieces of code ran in essentially identical amounts of time.

这篇关于numpy 和 matlab 之间的性能差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-19 02:46