本文介绍了numpy memmap 内存使用 - 想要迭代一次的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我在磁盘上保存了一些大矩阵.将它全部存储在内存中并不是真的可行,所以我使用 memmap 来访问它

let say I have some big matrix saved on disk. storing it all in memory is not really feasible so I use memmap to access it

A = np.memmap(filename, dtype='float32', mode='r', shape=(3000000,162))

现在假设我想迭代这个矩阵(本质上不是以有序的方式),这样每一行都只会被访问一次.

now let say I want to iterate over this matrix (not essentially in an ordered fashion) such that each row will be accessed exactly once.

p = some_permutation_of_0_to_2999999()

我想做这样的事情:

start = 0
end = 3000000
num_rows_to_load_at_once = some_size_that_will_fit_in_memory()
while start < end:
    indices_to_access = p[start:start+num_rows_to_load_at_once]
    do_stuff_with(A[indices_to_access, :])
    start = min(end, start+num_rows_to_load_at_once)

随着这个过程在我的计算机上运行变得越来越慢,我的 RAM 和虚拟内存使用量呈爆炸式增长.

as this process goes on my computer is becoming slower and slower and my RAM and virtual memory usage is exploding.

有没有办法强制 np.memmap 使用一定数量的内存?(我知道我需要的行数不会超过我计划一次读取的行数,而且缓存不会真正帮助我,因为我只访问了每一行一次)

Is there some way to force np.memmap to use up to a certain amount of memory? (I know I won't need more than the amount of rows I'm planning to read at a time and that caching won't really help me since I'm accessing each row exactly once)

也许有其他方法可以按自定义顺序在 np 数组上迭代(类似于生成器)?我可以使用 file.seek 手动编写它,但它恰好比 np.memmap 实现慢得多

Maybe instead is there some other way to iterate (generator like) over a np array in a custom order? I could write it manually using file.seek but it happens to be much slower than np.memmap implementation

do_stuff_with() 不保留对其接收到的数组的任何引用,因此在这方面没有内存泄漏"

do_stuff_with() does not keep any reference to the array it receives so no "memory leaks" in that aspect

谢谢

推荐答案

这是我一直在努力解决的一个问题.我处理大型图像数据集,而 numpy.memmap 为处理这些大型图像集提供了便捷的解决方案.

This has been an issue that I've been trying to deal with for a while. I work with large image datasets and numpy.memmap offers a convenient solution for working with these large sets.

但是,正如您所指出的,如果我需要访问每个帧(或您的案例中的行)来执行某些操作,RAM 使用量最终将达到最大值.

However, as you've pointed out, if I need to access each frame (or row in your case) to perform some operation, RAM usage will max out eventually.

幸运的是,我最近找到了一个解决方案,它允许您在限制 RAM 使用量的同时遍历整个 memmap 数组.

Fortunately, I recently found a solution that will allow you to iterate through the entire memmap array while capping the RAM usage.

解决方案:

import numpy as np

# create a memmap array
input = np.memmap('input', dtype='uint16', shape=(10000,800,800), mode='w+')

# create a memmap array to store the output
output = np.memmap('output', dtype='uint16', shape=(10000,800,800), mode='w+')

def iterate_efficiently(input, output, chunk_size):
    # create an empty array to hold each chunk
    # the size of this array will determine the amount of RAM usage
    holder = np.zeros([chunk_size,800,800], dtype='uint16')

    # iterate through the input, replace with ones, and write to output
    for i in range(input.shape[0]):
        if i % chunk_size == 0:
            holder[:] = input[i:i+chunk_size] # read in chunk from input
            holder += 5 # perform some operation
            output[i:i+chunk_size] = holder # write chunk to output

def iterate_inefficiently(input, output):
    output[:] = input[:] + 5

计时结果:

In [11]: %timeit iterate_efficiently(input,output,1000)
1 loop, best of 3: 1min 48s per loop

In [12]: %timeit iterate_inefficiently(input,output)
1 loop, best of 3: 2min 22s per loop

磁盘上的阵列大小约为 12GB.使用 iterate_efficiently 函数将内存使用量保持在 1.28GB,而 iterate_inefficiently 函数最终在 RAM 中达到 12GB.

The size of the array on disk is ~12GB. Using the iterate_efficiently function keeps the memory usage to 1.28GB whereas the iterate_inefficiently function eventually reaches 12GB in RAM.

这是在 Mac OS 上测试的.

This was tested on Mac OS.

这篇关于numpy memmap 内存使用 - 想要迭代一次的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-15 19:56