python - 内存中numpy strided array/broadcast array的大小？

我正在尝试在 numpy 中创建高效的广播数组，例如一组只有 1000 个元素但重复 1e6 次的 shape=[1000,1000,1000] 数组。这可以通过 np.lib.stride_tricks.as_strided 和 np.broadcast_arrays 实现。

但是，我在验证内存中没有重复时遇到了麻烦，这很关键，因为实际复制内存中数组的测试往往会使我的机器崩溃，不会留下任何追溯。

我尝试使用 .nbytes 检查数组的大小，但这似乎与实际内存使用情况不符:

>>> import numpy as np
>>> import resource
>>> initial_memuse = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
>>> pagesize = resource.getpagesize()
>>>
>>> x = np.arange(1000)
>>> memuse_x = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
>>> print("Size of x = {0} MB".format(x.nbytes/1e6))
Size of x = 0.008 MB
>>> print("Memory used = {0} MB".format((memuse_x-initial_memuse)*resource.getpagesize()/1e6))
Memory used = 150.994944 MB
>>>
>>> y = np.lib.stride_tricks.as_strided(x, [1000,10,10], strides=x.strides + (0, 0))
>>> memuse_y = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
>>> print("Size of y = {0} MB".format(y.nbytes/1e6))
Size of y = 0.8 MB
>>> print("Memory used = {0} MB".format((memuse_y-memuse_x)*resource.getpagesize()/1e6))
Memory used = 201.326592 MB
>>>
>>> z = np.lib.stride_tricks.as_strided(x, [1000,100,100], strides=x.strides + (0, 0))
>>> memuse_z = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
>>> print("Size of z = {0} MB".format(z.nbytes/1e6))
Size of z = 80.0 MB
>>> print("Memory used = {0} MB".format((memuse_z-memuse_y)*resource.getpagesize()/1e6))
Memory used = 0.0 MB

所以 .nbytes 报告数组的“理论”大小，但显然不是实际大小。 resource 检查有点尴尬，因为看起来有些东西正在加载和缓存(也许？)导致第一个 strides 占用一些内存，但 future 的 strides 没有。

tl; dr:您如何确定内存中 numpy 数组或数组 View 的实际大小？

最佳答案

一种方法是检查数组的 .base attribute，它引用数组“借用”其内存的对象。例如:

x = np.arange(1000)
print(x.flags.owndata)      # x "owns" its data
# True
print(x.base is None)       # its base is therefore 'None'
# True

a = x.reshape(100, 10)      # a is a reshaped view onto x
print(a.flags.owndata)      # it therefore "borrows" its data
# False
print(a.base is x)          # its .base is x
# True

np.lib.stride_tricks 的情况稍微复杂一些:

b = np.lib.stride_tricks.as_strided(x, [1000,100,100], strides=x.strides + (0, 0))

print(b.flags.owndata)
# False
print(b.base)
# <numpy.lib.stride_tricks.DummyArray object at 0x7fb40c02b0f0>

这里，b.base 是一个 numpy.lib.stride_tricks.DummyArray 实例，它看起来像这样:

class DummyArray(object):
    """Dummy object that just exists to hang __array_interface__ dictionaries
    and possibly keep alive a reference to a base array.
    """

    def __init__(self, interface, base=None):
        self.__array_interface__ = interface
        self.base = base

因此，我们可以检查 b.base.base :

print(b.base.base is x)
# True

一旦你有了基本数组，那么它的 .nbytes 属性应该准确地反射(reflect)它占用的内存量。

原则上，可以查看数组的 View ，或者从另一个跨距数组创建跨距数组。假设您的 View 或跨步数组最终由另一个 numpy 数组支持，您可以递归引用其 .base 属性。一旦找到 .base 为 None 的对象，您就找到了数组从中借用其内存的底层对象:

def find_base_nbytes(obj):
    if obj.base is not None:
        return find_base_nbytes(obj.base)
    return obj.nbytes

正如预期的那样，

print(find_base_nbytes(x))
# 8000

print(find_base_nbytes(y))
# 8000

print(find_base_nbytes(z))
# 8000

关于python - 内存中numpy strided array/broadcast array的大小？，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/34637875/