python - HDF5添加numpy数组缓慢

第一次使用hdf5，是否可以帮助我弄清楚出了什么问题，为什么添加3d numpy数组比较慢。
预处理需要3s，加上3d numpy数组（100x512x512）30s并随每个样本增加

首先，我使用以下方法创建hdf：

def create_h5(fname_):
  """
  Run only once
  to create h5 file for dicom images
  """
  f = h5py.File(fname_, 'w', libver='latest')

  dtype_ = h5py.special_dtype(vlen=bytes)


  num_samples_train = 1397
  num_samples_test = 1595 - 1397
  num_slices = 100

  f.create_dataset('X_train', (num_samples_train, num_slices, 512, 512),
    dtype=np.int16, maxshape=(None, None, 512, 512),
    chunks=True, compression="gzip", compression_opts=4)
  f.create_dataset('y_train', (num_samples_train,), dtype=np.int16,
    maxshape=(None, ), chunks=True, compression="gzip", compression_opts=4)
  f.create_dataset('i_train', (num_samples_train,), dtype=dtype_,
    maxshape=(None, ), chunks=True, compression="gzip", compression_opts=4)
  f.create_dataset('X_test', (num_samples_test, num_slices, 512, 512),
    dtype=np.int16, maxshape=(None, None, 512, 512), chunks=True,
    compression="gzip", compression_opts=4)
  f.create_dataset('y_test', (num_samples_test,), dtype=np.int16, maxshape=(None, ), chunks=True,
    compression="gzip", compression_opts=4)
  f.create_dataset('i_test', (num_samples_test,), dtype=dtype_,
    maxshape=(None, ),
    chunks=True, compression="gzip", compression_opts=4)

  f.flush()
  f.close()
  print('HDF5 file created')

然后我运行代码更新hdf文件：

num_samples_train = 1397
num_samples_test = 1595 - 1397

lbl = pd.read_csv(lbl_fldr + 'stage1_labels.csv')

patients = os.listdir(dicom_fldr)
patients.sort()

f = h5py.File(h5_fname, 'a') #r+ tried

train_counter = -1
test_counter = -1

for sample in range(0, len(patients)):

    sw_start = time.time()

    pat_id = patients[sample]
    print('id: %s sample: %d \t train_counter: %d test_counter: %d' %(pat_id, sample, train_counter+1, test_counter+1), flush=True)

    sw_1 = time.time()
    patient = load_scan(dicom_fldr + patients[sample])
    patient_pixels = get_pixels_hu(patient)
    patient_pixels = select_slices(patient_pixels)

    if patient_pixels.shape[0] != 100:
        raise ValueError('Slices != 100: ', patient_pixels.shape[0])



    row = lbl.loc[lbl['id'] == pat_id]

    if row.shape[0] > 1:
        raise ValueError('Found duplicate ids: ', row.shape[0])

    print('Time preprocessing: %0.2f' %(time.time() - sw_1), flush=True)



    sw_2 = time.time()
    #found test patient
    if row.shape[0] == 0:
        test_counter += 1

        f['X_test'][test_counter] = patient_pixels
        f['i_test'][test_counter] = pat_id
        f['y_test'][test_counter] = -1


    #found train
    else:
        train_counter += 1

        f['X_train'][train_counter] = patient_pixels
        f['i_train'][train_counter] = pat_id
        f['y_train'][train_counter] = row.cancer

    print('Time saving: %0.2f' %(time.time() - sw_2), flush=True)

    sw_el = time.time() - sw_start
    sw_rem = sw_el* (len(patients) - sample)
    print('Elapsed: %0.2fs \t rem: %0.2fm %0.2fh ' %(sw_el, sw_rem/60, sw_rem/3600), flush=True)


f.flush()
f.close()

最佳答案

缓慢几乎可以肯定是由于压缩和分块。很难做到这一点。在过去的项目中，我通常不得不关闭压缩，因为它太慢了，尽管我总体上并没有放弃HDF5中的压缩想法。

首先，您应该尝试确认压缩和分块是性能问题的原因。关闭分块和压缩（即省略chunks=True, compression="gzip", compression_opts=4参数），然后重试。我怀疑会快很多。

如果要使用压缩，则必须了解分块的工作原理，因为HDF逐块压缩数据。用谷歌搜索，但至少要阅读section on chunking from the h5py docs。以下引用至关重要：

分块具有性能影响。建议将块的总大小保持在10 KiB和1 MiB之间，对于较大的数据集，则应将其更大。还请记住，访问块中的任何元素时，将从磁盘读取整个块。

通过设置chunks=True，您可以让h5py自动确定块大小（打印数据集的chunks属性以查看大小）。假设第一个维度（您的sample维度）中的块大小为5。这意味着添加一个样本时，底层的HDF库将从磁盘读取包含该样本的所有块（因此，总共它将完全读取5个样本）。对于每个块，HDF都会读取，解压缩，添加新数据，对其进行压缩，然后将其写回到磁盘。不用说，这很慢。 HDF具有块高速缓存，因此未压缩的块可以驻留在内存中，从而缓解了这种情况。但是，块缓存似乎很小（请参阅here），因此我认为在for循环的每次迭代中，所有块都在缓存中换入和换出。我在h5py中找不到任何更改块缓存大小的设置。

您可以通过为chunks关键字参数分配一个元组来显式设置块大小。考虑到所有这些，您可以尝试使用不同的块大小。我的第一个实验是将第一（样本）维中的块大小设置为1，以便可以访问单个样本，而无需将其他样本读入缓存。让我知道是否有帮助，我很想知道。

即使您发现适合写入数据的块大小，读取时它仍然可能很慢，具体取决于读取的切片。选择块大小时，请记住您的应用程序通常如何读取数据。您可能必须使文件创建例程适应这些块的大小（例如，逐块填充数据集）。或者，您可以决定不值得这样做，并创建未压缩的HDF5文件。

最后，我将在shuffle=True调用中设置create_dataset。这样可以为您提供更好的压缩率。但是，它不应该影响性能。

关于python - HDF5添加numpy数组缓慢，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/41771992/