本文介绍了串联两个大 pandas .HDFStore HDF5文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

该问题与连接大量HDF5文件"相关. .

我有几个巨大的HDF5文件(压缩后约为20GB),无法容纳RAM.它们每个都存储几个相同格式的pandas.DataFrame,并且索引不重叠.

I have several huge HDF5 files (~20GB compressed), which could not fit the RAM. Each of them stores several pandas.DataFrames of identical format and with indexes that do not overlap.

我想将它们串联在一起,以使单个HDF5文件与所有DataFrame正确串联.一种方法是逐块读取每个文件,然后将其保存到单个文件中,但实际上这将花费大量时间.

I'd like to concatenate them to have a single HDF5 file with all DataFrames properly concatenated. One way to do this is to read each of them chunk-by-chunk and then save to a single file, but indeed it would take quite a lot of time.

有没有特殊的工具或方法来执行此操作而无需遍历文件?

Are there any special tools or methods to do this without iterating through files?

推荐答案

请参阅文档此处 odo项目(以前为into).请注意,如果使用into库,则参数顺序已切换(这是更改名称的动机,以避免造成混淆!)

see docs here for the odo project (formerly into). Note if you use the into library, then the argument order has been switched (that was the motivation for changing the name, to avoid confusion!)

您基本上可以做到:

from odo import odo
odo('hdfstore://path_store_1::table_name',
    'hdfstore://path_store_new_name::table_name')

执行这样的多项操作将添加到rhs存储中.

doing multiple operations like this will append to the rhs store.

这将自动为您执行块操作.

This will automatically do the chunk operations for you.

这篇关于串联两个大 pandas .HDFStore HDF5文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-13 00:26