本文介绍了HDF5是否支持并发读取或写入不同文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图了解HDF5并发的限制.

I'm trying to understand the limits of HDF5 concurrency.

HDF5有两种版本:并行HDF5 默认.并行版本当前在Ubuntu中提供,默认版本在Anaconda中提供(由--enable-parallel标志判断).

There are two builds of HDF5: parallel HDF5 and default. The parallel version is is currently supplied in Ubuntu, and the default in Anaconda (judged by --enable-parallel flag).

我知道并行写入同一文件是不可能的.但是,我不完全了解默认操作或并行构建可以将以下操作扩展到什么范围:

I know that parallel writes to the same file are impossible. However, I don't fully understand to what extend the following actions are possible with default or with parallel build:

  • 从同一文件读取数据的几个过程
  • 从不同文件读取数据的几个过程
  • 多个过程写入不同的文件.

此外,anaconda是否有任何默认情况下不启用--enable-parallel标志的原因? ( https://github.com/conda/conda-recipes/blob/master/hdf5/build.sh )

Also, are there any reasons anaconda does not have --enable-parallel flag on by default? (https://github.com/conda/conda-recipes/blob/master/hdf5/build.sh)

推荐答案

AFAICT,有三种构建libhdf5的方法:

AFAICT, there are three ways to build libhdf5:

  • 既没有线程安全也没有MPI支持(如您发布的conda食谱中一样)
  • 具有MPI支持,但没有线程安全性
  • 具有线程安全性,但不支持MPI
  • with neither thread-safety nor MPI support (as in the conda recipe you posted)
  • with MPI support but no thread safety
  • with thread safety but no MPI support

也就是说,--enable-threadsafe--enable-parallel标志是互斥的( https ://www.hdfgroup.org/hdf5-quest.html#p5thread ).

That is, the --enable-threadsafe and --enable-parallel flags are mutually exclusive (https://www.hdfgroup.org/hdf5-quest.html#p5thread).

对于同时读取一个或多个文件,答案是您需要线程安全( https://www.hdfgroup.org/hdf5-quest.html#tsafe ):

As for concurrent reads on one or even multiple files, the answer is that you need thread safety (https://www.hdfgroup.org/hdf5-quest.html#tsafe):

用户经常惊讶地发现(1)同时访问 一个HDF5文件中包含不同的数据集,并且(2)同时访问 不同的HDF5文件都需要HDF5的线程安全版本 库.尽管这些示例中的每个线程都访问不同 数据, HDF5库修改了以下全局数据结构: 独立于特定的HDF5数据集或HDF5文件. HDF5依赖 线程安全版本的库API调用周围的信号灯 该库可保护数据结构免遭破坏 从不同线程同时操作. HDF5的示例 必须保护的库全局数据结构是 自由空间管理器和打开的文件列表.

Users are often surprised to learn that (1) concurrent access to different datasets in a single HDF5 file and (2) concurrent access to different HDF5 files both require a thread-safe version of the HDF5 library. Although each thread in these examples is accessing different data, the HDF5 library modifies global data structures that are independent of a particular HDF5 dataset or HDF5 file. HDF5 relies on a semaphore around the library API calls in the thread-safe version of the library to protect the data structure from corruption by simultaneous manipulation from different threads. Examples of HDF5 library global data structures that must be protected are the freespace manager and open file lists.

上面的链接不再起作用,因为HDF Group重新组织了他们的网站.有一个页面有关线程安全性和 HDF5知识库中的并发访问有用的信息.

The links above no longer work because the HDF Group reorganised their website. There is a page Questions about thread-safety and concurrent access in the HDF5 Knowledge Base that contains some useful information.

尽管本文仅提及单个进程中的并发线程,但它似乎同样适用于分叉的子进程:请参见此h5py 多处理示例.

While only concurrent threads on a single process are mentioned in the passage, it appears to apply equally to forked subprocesses: see this h5py multiprocessing example.

现在,对于 parallel 访问,您可能要使用"Parallel HDF5",但是这些功能需要使用MPI. h5py 支持此模式,但它更加复杂和深奥,甚至比线程安全模式的移植性差.更重要的是,由于该库不是线程安全的,试图通过libhdf5的并行构建来天真地执行并发读取会导致意外的结果.

Now, for parallel access, you might want to use "Parallel HDF5" but those features requires using MPI. This pattern is supported by h5py but is more complicated and esoteric, and probably even less portable than thread-safe mode. More importantly, trying to naively do concurrent reads with a parallel build of libhdf5 will lead to unexpected results because the library isn't thread-safe.

除效率外,线程安全构建标志的局限性之一是缺少Windows支持( https://www.hdfgroup.org/hdf5-quest.html#gconc ):

Besides efficiency, one limitation of the thread-safe build flag is lack of Windows support (https://www.hdfgroup.org/hdf5-quest.html#gconc):

考虑到并发读取访问是HDF5的被吹捧的功能"之一,从Python读取(不同!)文件时获得奇怪的损坏结果绝对是出乎意料且令人沮丧的.也许conda更好的默认配方是在支持它的平台上包含--enable-threadsafe,但是我想您最终会遇到平台特定的行为.也许应该为这三种构建模式使用单独的软件包?

Getting weird corrupt results when reading (different!) files from Python is definitely unexpected and frustrating given how concurrent read access is one of the touted "features" of HDF5. Perhaps a better default recipe for conda would be to include --enable-threadsafe on those platforms that support it, but I guess then you would end up with platform-specific behavior. Maybe there ought to be separate packages for the three build modes instead?

这篇关于HDF5是否支持并发读取或写入不同文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-20 14:34