


In my program I fill a large numpy array with elements, number of which I do not know in advance. Since adding single element per go to a numpy array is inefficient, I increase its size by chunks of length 10000 initialized with zeros. This leads to the situation that in the end I have an array with tail of zeros. And what I would like to have is the array, whose length is precisely number of meaningful elements (because later I cannot distinguish junky zeros from actual data points with zero value). Straightforward copying of slicing, however, doubles the RAM consumption, which is really undesirable since my arrays are quite large. I looked into numpy.split functions, but they seem to split arrays only into chunks of the equal size, which of course does not suit me.


I illustrate the problem with the following code:

import numpy, os, random

def check_memory(mode_peak = True, mark = ''):
    """Function for measuring the memory consumption (Linux only)"""
    pid = os.getpid()
    with open('/proc/{}/status'.format(pid), 'r') as ifile:
        for line in ifile:
            if line.startswith('VmPeak' if mode_peak else 'VmSize'):
                memory = line[: -1].split(':')[1].strip().split()[0]
                memory = int(memory) / (1024 * 1024)
    mode_str = 'Peak' if mode_peak else 'Current'
    print('{}{} RAM consumption: {:.3f} GB'.format(mark, mode_str, memory))

def generate_element():
    """Test element generator"""
    for i in range(12345678):
        yield numpy.array(random.randrange(0, 1000), dtype = 'i4')

check_memory(mode_peak = False, mark = '#1 ')
a = numpy.zeros(10000, dtype = 'i4')
i = 0
for element in generate_element():
    if i == len(a):
        a = numpy.concatenate((a, numpy.zeros(10000, dtype = 'i4')))
    a[i] = element
    i += 1
check_memory(mode_peak = False, mark = '#2 ')
a = a[: i]
check_memory(mode_peak = False, mark = '#3 ')
check_memory(mode_peak = True, mark = '#4 ')


#1 Current RAM consumption: 0.070 GB
#2 Current RAM consumption: 0.118 GB
#3 Current RAM consumption: 0.118 GB
#4 Peak RAM consumption: 0.164 GB


Can anyone help me to find a solution that does not penalize significantly runtime or RAM consumption?


a = numpy.delete(a, numpy.s_[i: ])


a = numpy.split(a, (i, ))[0]


However, it results in the same doubled memory consumption



Finally I figured it out. In fact, extra memory was consumed not only during trimming stage, but also during the concatenation. Therefore, introducing a peak memory check at the point #2 outputs:

#2 Peak RAM consumption: 0.164 GB


However, there is the resize() method, which changes the size/shape of an array in-place:

check_memory(mode_peak = False, mark = '#1 ')
page_size = 10000
a = numpy.zeros(page_size, dtype = 'i4')
i = 0
for element in generate_element():
    if (i != 0) and (i % page_size == 0):
        a.resize(i + page_size)
    a[i] = element
    i += 1
check_memory(mode_peak = False, mark = '#2 ')
check_memory(mode_peak = True, mark = '#2 ')


#1 Current RAM consumption: 0.070 GB
#2 Current RAM consumption: 0.118 GB
#2 Peak RAM consumption: 0.118 GB


In addition, as there are no more reallocations, the performance improved significantly as well.


07-23 03:45