HDF5核心驱动程序

HDF5核心驱动程序

本文介绍了HDF5核心驱动程序(H5FD_CORE):加载选定的数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当前,我通过h5py在python中加载HDF5数据,并将数据集读取到内存中.

Currently, I load HDF5 data in python via h5py and read a dataset into memory.

f = h5py.File('myfile.h5', 'r')
dset = f['mydataset'][:]

这有效,但是如果"mydataset"是myfile.h5中唯一的数据集,则以下效率更高:

This works, but if 'mydataset' is the only dataset in myfile.h5, then the following is more efficient:

f = h5py.File('myfile.h5', 'r', driver='core')
dset = f['mydataset'][:]

我相信这是因为核心"驱动程序内存映射了整个文件,这是将数据加载到内存中的一种优化方式.

I believe this is because the 'core' driver memory maps the entire file, which is an optimised way of loading data into memory.

我的问题是:是否可以在选定的数据集上使用核心"驱动程序?换句话说,在加载文件时,我只希望存储映射所选数据集和/或组的信息.我有一个包含许多数据集的文件,我想将每个数据集依次加载到内存中.我无法一次全部加载它们,因为总的来说它们不适合存储在内存中.

My question is: is it possible to use 'core' driver on selected dataset(s)? In other words, on loading the file I only wish to memory map selected datasets and/or groups. I have a file with many datasets and I would like to load each one into memory sequentially. I cannot load them all at once, since on aggregate they won't fit in memory.

我知道一种替代方法是将具有多个数据集的单个HDF5文件拆分为每个具有一个数据集的许多HDF5文件.但是,我希望可能会有一个更优雅的解决方案,可能使用h5py低级API.

I understand one alternative is to split my single HDF5 file with many datasets into many HDF5 files with one dataset each. However, I am hoping there might be a more elegant solution, possibly using h5py low-level API.

更新:即使我要问的是不可能的,也有人可以解释为什么在整个数据集中读取时使用driver='core'具有更好的性能吗?将HDF5文件的唯一数据集读入内存与通过core驱动程序对其进行映射的内存有很大不同吗?

Update: Even if what I am asking is not possible, can someone explain why using driver='core' has substantially better performance when reading in a whole dataset? Is reading the only dataset of an HDF5 file into memory very different from memory mapping it via core driver?

推荐答案

我想这与您通过在任意轴上循环而不设置适当的块高速缓存大小来读取文件是同样的问题.

I guess it is the same problem as if you read the file by looping over an abitrary axis without setting a proper chunk-cache-size.

如果您正在使用核心驱动程序进行读取,则可以确保从磁盘顺序读取整个文件,并且其他所有操作(解压缩,分块数据到压缩数据等)都完全在RAM中完成.

If you are reading it with the core driver, it is guaranteed that the whole file is read sequentially from the disk and everything else (decompressing, chunked data to compact data,...) is done completely in RAM.

我从此处使用了最简单的花式切片示例 https://stackoverflow.com/a/48405220/4045774写入数据.

I used the simplest form of fancy slicing example from here https://stackoverflow.com/a/48405220/4045774 to write the data.

import h5py as h5
import time
import numpy as np
import h5py_cache as h5c

def Reading():
    File_Name_HDF5='Test.h5'

    t1=time.time()
    f = h5.File(File_Name_HDF5, 'r',driver='core')
    dset = f['Test'][:]
    f.close()
    print(time.time()-t1)

    t1=time.time()
    f = h5c.File(File_Name_HDF5, 'r',chunk_cache_mem_size=1024**2*500)
    dset = f['Test'][:]
    f.close()
    print(time.time()-t1)

    t1=time.time()
    f = h5.File(File_Name_HDF5, 'r')
    dset = f['Test'][:]
    print(time.time()-t1)
    f.close()

if __name__ == "__main__":
    Reading()

这在我的计算机上为 2,38s (核心驱动程序), 2,29s (具有500 MB块缓存大小), 4,29s (默认的块缓存大小为1MB)

This gives on my machine 2,38s (core driver), 2,29s (with 500 MB chunk-cache-size), 4,29s (with the default chunk-cache-size of 1MB)

这篇关于HDF5核心驱动程序(H5FD_CORE):加载选定的数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-06 09:52