我生成一个npz文件,如下所示:

import numpy as np
import os

# Generate npz file
dataset_text_filepath = 'test_np_load.npz'
texts = []
for text_number in range(30000):
    texts.append(np.random.random_integers(0, 20000,
                 size = np.random.random_integers(0, 100)))
texts = np.array(texts)
np.savez(dataset_text_filepath, texts=texts)


这给了我这个〜7MiB npz文件(基本上只有1个变量texts,它是Numpy数组的NumPy数组):

python - 查询保存为npz的NumPy数组的NumPy数组很慢-LMLPHP

我用numpy.load()加载:

# Load data
dataset = np.load(dataset_text_filepath)


如果我按以下方式查询它,则需要几分钟:

# Querying data: the slow way
for i in range(20):
    print('Run {0}'.format(i))
    random_indices = np.random.randint(0, len(dataset['texts']), size=10)
    dataset['texts'][random_indices]


而如果我按以下方式查询,则只需不到5秒的时间:

# Querying data: the fast way
data_texts = dataset['texts']
for i in range(20):
    print('Run {0}'.format(i))
    random_indices = np.random.randint(0, len(data_texts), size=10)
    data_texts[random_indices]


为什么第二种方法比第一种方法快得多?

最佳答案

每次使用dataset['texts']都会读取文件。 loadnpz仅返回文件加载器,而不是实际数据。这是一个“惰性加载器”,仅在访问时才加载特定的数组。 load文档可能更清晰,但是他们说:

- If the file is a ``.npz`` file, the returned value supports the context
  manager protocol in a similar fashion to the open function::

    with load('foo.npz') as data:
        a = data['a']

  The underlying file descriptor is closed when exiting the 'with' block.


并来自savez

 When opening the saved ``.npz`` file with `load` a `NpzFile` object is
returned. This is a dictionary-like object which can be queried for
its list of arrays (with the ``.files`` attribute), and for the arrays
themselves.


help(np.lib.npyio.NpzFile)中的更多详细信息

09-28 06:09