本文介绍了NumPy性能:uint8与浮点运算以及乘法与除法运算?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚注意到,我的脚本的执行时间仅通过将乘数更改为除法而几乎减少了一半.

为了对此进行调查,我写了一个小例子:

import numpy as np
import timeit

# uint8 array
arr1 = np.random.randint(0, high=256, size=(100, 100), dtype=np.uint8)

# float32 array
arr2 = np.random.rand(100, 100).astype(np.float32)
arr2 *= 255.0


def arrmult(a):
    """
    mult, read-write iterator
    """
    b = a.copy()
    for item in np.nditer(b, op_flags=["readwrite"]):
        item[...] = (item + 5) * 0.5

def arrmult2(a):
    """
    mult, index iterator
    """
    b = a.copy()
    for i, j in np.ndindex(b.shape):
        b[i, j] = (b[i, j] + 5) * 0.5

def arrmult3(a):
    """
    mult, vectorized
    """
    b = a.copy()
    b = (b + 5) * 0.5

def arrdiv(a):
    """
    div, read-write iterator
    """
    b = a.copy()
    for item in np.nditer(b, op_flags=["readwrite"]):
        item[...] = (item + 5) / 2

def arrdiv2(a):
    """
    div, index iterator
    """
    b = a.copy()
    for i, j in np.ndindex(b.shape):
           b[i, j] = (b[i, j] + 5)  / 2

def arrdiv3(a):
    """
    div, vectorized
    """
    b = a.copy()
    b = (b + 5) / 2




def print_time(name, t):
    print("{: <10}: {: >6.4f}s".format(name, t))

timeit_iterations = 100

print("uint8 arrays")
print_time("arrmult", timeit.timeit("arrmult(arr1)", "from __main__ import arrmult, arr1", number=timeit_iterations))
print_time("arrmult2", timeit.timeit("arrmult2(arr1)", "from __main__ import arrmult2, arr1", number=timeit_iterations))
print_time("arrmult3", timeit.timeit("arrmult3(arr1)", "from __main__ import arrmult3, arr1", number=timeit_iterations))
print_time("arrdiv", timeit.timeit("arrdiv(arr1)", "from __main__ import arrdiv, arr1", number=timeit_iterations))
print_time("arrdiv2", timeit.timeit("arrdiv2(arr1)", "from __main__ import arrdiv2, arr1", number=timeit_iterations))
print_time("arrdiv3", timeit.timeit("arrdiv3(arr1)", "from __main__ import arrdiv3, arr1", number=timeit_iterations))

print("\nfloat32 arrays")
print_time("arrmult", timeit.timeit("arrmult(arr2)", "from __main__ import arrmult, arr2", number=timeit_iterations))
print_time("arrmult2", timeit.timeit("arrmult2(arr2)", "from __main__ import arrmult2, arr2", number=timeit_iterations))
print_time("arrmult3", timeit.timeit("arrmult3(arr2)", "from __main__ import arrmult3, arr2", number=timeit_iterations))
print_time("arrdiv", timeit.timeit("arrdiv(arr2)", "from __main__ import arrdiv, arr2", number=timeit_iterations))
print_time("arrdiv2", timeit.timeit("arrdiv2(arr2)", "from __main__ import arrdiv2, arr2", number=timeit_iterations))
print_time("arrdiv3", timeit.timeit("arrdiv3(arr2)", "from __main__ import arrdiv3, arr2", number=timeit_iterations))

这将打印以下时间:

uint8 arrays
arrmult   : 2.2004s
arrmult2  : 3.0589s
arrmult3  : 0.0014s
arrdiv    : 1.1540s
arrdiv2   : 2.0780s
arrdiv3   : 0.0027s

float32 arrays
arrmult   : 1.2708s
arrmult2  : 2.4120s
arrmult3  : 0.0009s
arrdiv    : 1.5771s
arrdiv2   : 2.3843s
arrdiv3   : 0.0009s

我一直认为乘法在计算上比除法便宜.但是,对于uint8,除法的效果似乎几乎是其两倍.这与某种事实有关吗?* 0.5必须先以浮点数计算乘法,然后将结果转换回整数?

至少对于浮点数,乘法似乎比除法快.这通常是真的吗?

为什么uint8中的乘法比float32中的乘法更扩展?我认为8位无符号整数的计算应该比32位浮点数快得多!!

有人可以揭密"这个吗?

编辑:为了获得更多数据,我添加了矢量化函数(如建议的)并添加了索引迭代器.向量化函数要快得多,因此无法真正实现可比性.但是,如果将向量化函数的timeit_iterations设置得更高,则事实证明uint8float32的乘法运算都更快.我猜这还会使人更加困惑吗?!

实际上乘法可能总是快于除法,但是for循环中的主要性能泄漏不是算术运算,而是循环本身.尽管这不能解释为什么循环对于不同的操作会有不同的表现.

EDIT2 :就像已经提到过的@jotasi一样,我们正在寻找division vs. multiplicationint(or uint8)vs. float的完整说明(或float32).此外,解释矢量化方法和迭代器的不同趋势将很有趣,因为在矢量化情况下,除法似乎较慢,而在迭代器情况下则较快.

解决方案

问题是您的假设,即您测量了除法或乘法所需的时间,这是不正确的.您正在测量除法或乘法所需的开销.

实际上必须查看确切的代码来解释每种效果,每个版本的效果可能会有所不同.这个答案只能给出一个想法,一个必须考虑的问题.

问题在于,简单的int在python中根本不简单:它是一个真实的对象,必须在垃圾收集器中注册,它的大小随其值的增加而增加-对于您必须支付的所有费用:例如,对于8位整数,需要24个字节的内存! python-floats也是如此.

另一方面,一个numpy数组由简单的c样式整数/浮点数组成,没有开销,您可以节省大量内存,但是需要在访问numpy-array元素时付费. a[i]表示:必须构造一个python整数,将其注册到垃圾收集器中,并且只能使用它-开销很大.

考虑以下代码:

li1=[x%256 for x in xrange(10**4)]
arr1=np.array(li1, np.uint8)

def arrmult(a):
    for i in xrange(len(a)):
        a[i]*=5;

arrmult(li1)arrmult(arr1)快25,因为列表中的整数已经是python-ints了,不必创建!创建对象需要大量的计算时间-其他几乎可以忽略的一切.


让我们看一下您的代码,首先是乘法:

def arrmult2(a):
    ...
    b[i, j] = (b[i, j] + 5) * 0.5

对于uint8,必须发生以下情况(为简单起见,我忽略了+5):

  1. 必须创建python-int
  2. 必须将其强制转换为浮点数(创建python-float),以便能够进行浮点乘法
  3. 并转换回python-int或/和uint8

对于float32,要做的工作更少(乘法花费不多): 1.创建的python-float 2.退回float32.

所以float版本应该更快,而且是.


现在让我们来看看该部门:

def arrdiv2(a):
    ...
    b[i, j] = (b[i, j] + 5)  / 2

这里的陷阱:所有运算都是整数运算.因此,与乘法相比,不需要强制转换为python-float,因此与乘法的情况相比,我们的开销较小.与您的情况相比,unint8的除法快于"乘法.

但是,对于float32,除法和乘法运算同样快/慢,因为在这种情况下几乎没有任何变化-我们仍然需要创建python-float.


现在是矢量化版本:它们可与c样式的原始" float32s/uint8s一起使用,而无需转换(并转换为引擎盖下的相应python对象)(及其成本!).为了获得有意义的结果,您应该增加迭代次数(现在运行时间太短,无法确定地说出什么).

  1. float32的除法和乘法运算可能具有相同的运行时间,因为我希望numpy通过乘以0.5来将除数除以2(但要确保必须查看代码). /p>

  2. uint8的乘法应该更慢,因为每个uint8整数必须在与0.5乘法之前强制转换为浮点数,然后再强制转换为uint8.

  3. 对于uint8情况,numpy无法通过乘以0.5来将其除以2,因为它是整数除法.对于许多体系结构,整数除法比浮点乘法要慢-这是最慢的向量化操作.


PS:关于成本乘除法,我不会过多地谈论-还有太多其他事情会对性能产生更大的影响.例如,创建不必要的临时对象,或者如果numpy数组很大并且不适合缓存,那么内存访问将成为瓶颈-您将看不到乘法和除法之间的任何区别.

I have just noticed that the execution time of a script of mine nearly halves by only changing a multiplication to a division.

To investigate this, I have written a small example:

import numpy as np
import timeit

# uint8 array
arr1 = np.random.randint(0, high=256, size=(100, 100), dtype=np.uint8)

# float32 array
arr2 = np.random.rand(100, 100).astype(np.float32)
arr2 *= 255.0


def arrmult(a):
    """
    mult, read-write iterator
    """
    b = a.copy()
    for item in np.nditer(b, op_flags=["readwrite"]):
        item[...] = (item + 5) * 0.5

def arrmult2(a):
    """
    mult, index iterator
    """
    b = a.copy()
    for i, j in np.ndindex(b.shape):
        b[i, j] = (b[i, j] + 5) * 0.5

def arrmult3(a):
    """
    mult, vectorized
    """
    b = a.copy()
    b = (b + 5) * 0.5

def arrdiv(a):
    """
    div, read-write iterator
    """
    b = a.copy()
    for item in np.nditer(b, op_flags=["readwrite"]):
        item[...] = (item + 5) / 2

def arrdiv2(a):
    """
    div, index iterator
    """
    b = a.copy()
    for i, j in np.ndindex(b.shape):
           b[i, j] = (b[i, j] + 5)  / 2

def arrdiv3(a):
    """
    div, vectorized
    """
    b = a.copy()
    b = (b + 5) / 2




def print_time(name, t):
    print("{: <10}: {: >6.4f}s".format(name, t))

timeit_iterations = 100

print("uint8 arrays")
print_time("arrmult", timeit.timeit("arrmult(arr1)", "from __main__ import arrmult, arr1", number=timeit_iterations))
print_time("arrmult2", timeit.timeit("arrmult2(arr1)", "from __main__ import arrmult2, arr1", number=timeit_iterations))
print_time("arrmult3", timeit.timeit("arrmult3(arr1)", "from __main__ import arrmult3, arr1", number=timeit_iterations))
print_time("arrdiv", timeit.timeit("arrdiv(arr1)", "from __main__ import arrdiv, arr1", number=timeit_iterations))
print_time("arrdiv2", timeit.timeit("arrdiv2(arr1)", "from __main__ import arrdiv2, arr1", number=timeit_iterations))
print_time("arrdiv3", timeit.timeit("arrdiv3(arr1)", "from __main__ import arrdiv3, arr1", number=timeit_iterations))

print("\nfloat32 arrays")
print_time("arrmult", timeit.timeit("arrmult(arr2)", "from __main__ import arrmult, arr2", number=timeit_iterations))
print_time("arrmult2", timeit.timeit("arrmult2(arr2)", "from __main__ import arrmult2, arr2", number=timeit_iterations))
print_time("arrmult3", timeit.timeit("arrmult3(arr2)", "from __main__ import arrmult3, arr2", number=timeit_iterations))
print_time("arrdiv", timeit.timeit("arrdiv(arr2)", "from __main__ import arrdiv, arr2", number=timeit_iterations))
print_time("arrdiv2", timeit.timeit("arrdiv2(arr2)", "from __main__ import arrdiv2, arr2", number=timeit_iterations))
print_time("arrdiv3", timeit.timeit("arrdiv3(arr2)", "from __main__ import arrdiv3, arr2", number=timeit_iterations))

This prints the following timings:

uint8 arrays
arrmult   : 2.2004s
arrmult2  : 3.0589s
arrmult3  : 0.0014s
arrdiv    : 1.1540s
arrdiv2   : 2.0780s
arrdiv3   : 0.0027s

float32 arrays
arrmult   : 1.2708s
arrmult2  : 2.4120s
arrmult3  : 0.0009s
arrdiv    : 1.5771s
arrdiv2   : 2.3843s
arrdiv3   : 0.0009s

I always thought a multiplication is computationally cheaper than a division. However, for uint8 a division seems to be nearly twice as effective. Does this somehow relate to the fact, that * 0.5 has to calculate the multiplication in a float and then casting the result back to to an integer?

At least for floats multiplications seem to be faster than divisions. Is this generally true?

Why is a multiplication in uint8 more expansive than in float32? I thought an 8-bit unsigned integer should be much faster to calculate than 32-bit floats?!

Can someone "demystify" this?

EDIT: to have more data, I've included vectorized functions (like suggested) and added index iterators as well. The vectorized functions are much faster, thus not really comparable. However, if timeit_iterations is set much higher for the vectorized functions, it turns out that multiplication is faster for both, uint8 and float32. I guess this confuses even more?!

Maybe multiplication is in fact always faster than division, but the main performance leaks in the for-loops is not the arithmetical operation, but the loop itself. Although this does not explain why the loops behave differently for different operations.

EDIT2: Like @jotasi already stated, we are looking for a full explanation of division vs. multiplication and int(or uint8) vs. float (or float32). Additionally, explaining the different trends of the vectorized approaches and the iterators would be interesting, as in the vectorized case, the division seems to be slower, whereas it is faster in the iterator case.

解决方案

The problem is your assumption, that you measure the time needed for division or multiplication, which is not true. You are measuring the overhead needed for a division or multiplication.

One has really to look at the exact code to explain every effect, which can vary from version to version. This answer can only give an idea, what one has to consider.

The problem is that a simple int is not simple at all in python: it is a real object which must be registered in the garbage collector, it grows in size with its value - for all that you have to pay: for example for a 8bit integer 24 bytes memory are needed! similar goes for python-floats.

On the other hand, a numpy array consists of simple c-style integers/floats without overhead, you save a lot of memory, but pay for it during the access to an element of numpy-array. a[i] means: a python-integer must be constructed, registered in the garbage collector and only than it can be used - there is a lot of overhead.

Consider this code:

li1=[x%256 for x in xrange(10**4)]
arr1=np.array(li1, np.uint8)

def arrmult(a):
    for i in xrange(len(a)):
        a[i]*=5;

arrmult(li1) is 25 faster than arrmult(arr1) because integers in the list are already python-ints and don't have to be created! The lion's share of the calculation time is needed for creation of the objects - everything else can be almost neglected.


Let's take a look at your code, first the multiplication:

def arrmult2(a):
    ...
    b[i, j] = (b[i, j] + 5) * 0.5

In the case of the uint8 the following must happen (I neglect +5 for simplicity):

  1. a python-int must be created
  2. it must be casted to a float (python-float creation), in order to be able to do float multiplication
  3. and casted back to a python-int or/and uint8

For float32, there is less work to do (multiplication does not cost much): 1. a python-float created 2. casted back float32.

So the float-version should be faster and it is.


Now let's take a look at the division:

def arrdiv2(a):
    ...
    b[i, j] = (b[i, j] + 5)  / 2

The pitfall here: All operations are integer-operations. So compared to multiplication there is no need to cast to python-float, thus we have less overhead as in the case of multiplication. Division is "faster" for unint8 than multiplication in your case.

However, division and multiplication are equally fast/slow for float32, because almost nothing has changed in this case - we still need to create a python-float.


Now the vectorized versions: they work with c-style "raw" float32s/uint8s without conversion (and its cost!) to the corresponding python-objects under the hood. To get meaningful results you should increase the number of iteration (right now the running time is too small to say something with certainty).

  1. division and multiplication for float32 could have the same running time, because I would expect numpy to replace the division by 2 through multiplication by 0.5 (but to be sure one has to look into the code).

  2. multiplication for uint8 should be slower, because every uint8-integer must be casted to a float prior to multiplication with 0.5 and than casted back to uint8 afterwards.

  3. for the uint8 case, the numpy cannot replace the division by 2 through multiplication with 0.5 because it is an integer division. Integer division is slower than float-multiplication for a lot of architectures - this is the slowest vectorized operation.


PS: I would not dwell too much about costs multiplication vs. division - there are too many other things that can have a bigger hit on the performance. For example creating unnecessary temporary objects or if the numpy-array is large and does not fit into the cache, than the memory access will be the bottle-neck - you will see no difference between multiplication and division at all.

这篇关于NumPy性能:uint8与浮点运算以及乘法与除法运算?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-29 07:16