将numpy数组拆分为不相等的块

本文介绍了将numpy数组拆分为不相等的块的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在我的程序中，我用元素填充了一个大的numpy数组，这些元素的数量我事先都不知道.由于每次向numpy数组添加单个元素效率低下，因此我将其大小增加了用零初始化的长度为10000的块.最终导致我得到一个尾巴为零的数组.我想拥有的是数组，数组的长度恰好是有意义的元素的数量(因为以后我无法将零零与零值的实际数据点区分开).但是，直接复制切片会使RAM消耗翻倍，这确实是不希望的，因为我的数组很大.我研究了numpy.split函数，但是它们似乎只能将数组拆分成大小相等的块，这当然不适合我.

In my program I fill a large numpy array with elements, number of which I do not know in advance. Since adding single element per go to a numpy array is inefficient, I increase its size by chunks of length 10000 initialized with zeros. This leads to the situation that in the end I have an array with tail of zeros. And what I would like to have is the array, whose length is precisely number of meaningful elements (because later I cannot distinguish junky zeros from actual data points with zero value). Straightforward copying of slicing, however, doubles the RAM consumption, which is really undesirable since my arrays are quite large. I looked into numpy.split functions, but they seem to split arrays only into chunks of the equal size, which of course does not suit me.

我用以下代码说明了问题:

I illustrate the problem with the following code:

import numpy, os, random

def check_memory(mode_peak = True, mark = ''):
    """Function for measuring the memory consumption (Linux only)"""
    pid = os.getpid()
    with open('/proc/{}/status'.format(pid), 'r') as ifile:
        for line in ifile:
            if line.startswith('VmPeak' if mode_peak else 'VmSize'):
                memory = line[: -1].split(':')[1].strip().split()[0]
                memory = int(memory) / (1024 * 1024)
                break
    mode_str = 'Peak' if mode_peak else 'Current'
    print('{}{} RAM consumption: {:.3f} GB'.format(mark, mode_str, memory))

def generate_element():
    """Test element generator"""
    for i in range(12345678):
        yield numpy.array(random.randrange(0, 1000), dtype = 'i4')

check_memory(mode_peak = False, mark = '#1 ')
a = numpy.zeros(10000, dtype = 'i4')
i = 0
for element in generate_element():
    if i == len(a):
        a = numpy.concatenate((a, numpy.zeros(10000, dtype = 'i4')))
    a[i] = element
    i += 1
check_memory(mode_peak = False, mark = '#2 ')
a = a[: i]
check_memory(mode_peak = False, mark = '#3 ')
check_memory(mode_peak = True, mark = '#4 ')

这将输出:

#1 Current RAM consumption: 0.070 GB
#2 Current RAM consumption: 0.118 GB
#3 Current RAM consumption: 0.118 GB
#4 Peak RAM consumption: 0.164 GB

有人可以帮助我找到不会严重影响运行时间或RAM消耗的解决方案吗?

Can anyone help me to find a solution that does not penalize significantly runtime or RAM consumption?

我尝试使用

a = numpy.delete(a, numpy.s_[i: ])

以及

a = numpy.split(a, (i, ))[0]

但是，它导致相同的内存消耗翻倍

However, it results in the same doubled memory consumption

推荐答案

最后我弄清楚了.实际上，不仅在修整阶段而且在串联过程中都消耗了额外的内存.因此，在点#2的输出处引入峰值内存检查:

Finally I figured it out. In fact, extra memory was consumed not only during trimming stage, but also during the concatenation. Therefore, introducing a peak memory check at the point #2 outputs:

#2 Peak RAM consumption: 0.164 GB

但是，有一种resize()方法可以就地更改数组的大小/形状:

However, there is the resize() method, which changes the size/shape of an array in-place:

check_memory(mode_peak = False, mark = '#1 ')
page_size = 10000
a = numpy.zeros(page_size, dtype = 'i4')
i = 0
for element in generate_element():
    if (i != 0) and (i % page_size == 0):
        a.resize(i + page_size)
    a[i] = element
    i += 1
a.resize(i)
check_memory(mode_peak = False, mark = '#2 ')
check_memory(mode_peak = True, mark = '#2 ')

这将导致输出:

#1 Current RAM consumption: 0.070 GB
#2 Current RAM consumption: 0.118 GB
#2 Peak RAM consumption: 0.118 GB

此外，由于没有更多的重新分配，因此性能也显着提高.

In addition, as there are no more reallocations, the performance improved significantly as well.

这篇关于将numpy数组拆分为不相等的块的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！