本文介绍了如何将多个 pandas 数据帧组合到一个键/组下的HDF5对象中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从800 GB的大型CSV解析数据.对于每一行数据,我都将其另存为pandas数据框.

I am parsing data from a large csv sized 800 GB. For each line of data, I save this as a pandas dataframe.

readcsvfile = csv.reader(csvfile)
for i, line in readcsvfile:
    # parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
    # save as pandas dataframe
    df = pd.DataFrame(dictionary_line, index=[i])

现在,我想将其保存为HDF5格式,并查询h5,就好像它是整个csv文件一样.

Now, I would like to save this into an HDF5 format, and query the h5 as if it was the entire csv file.

import pandas as pd
store = pd.HDFStore("pathname/file.h5")

hdf5_key = "single_key"

csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]

到目前为止,我的方法是:

My approach so far has been:

import pandas as pd
store = pd.HDFStore("pathname/file.h5")

hdf5_key = "single_key"

csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
readcsvfile = csv.reader(csvfile)
for i, line in readcsvfile:
    # parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
    # save as pandas dataframe
    df = pd.DataFrame(dictionary_line, index=[i])
    store.append(hdf5_key, df, data_columns=csv_columns, index=False)

也就是说,我尝试用一​​个键将每个数据帧df保存到HDF5中.但是,这失败了:

That is, I try to save each dataframe df into the HDF5 under one key. However, this fails:

  Attribute 'superblocksize' does not exist in node: '/hdf5_key/_i_table/index'

因此,我可以先将所有内容保存到一个熊猫数据框中,即

So, I could try to save everything into one pandas dataframe first, i.e.

import pandas as pd
store = pd.HDFStore("pathname/file.h5")

hdf5_key = "single_key"

csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
readcsvfile = csv.reader(csvfile)
total_df = pd.DataFrame()
for i, line in readcsvfile:
    # parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
    # save as pandas dataframe
    df = pd.DataFrame(dictionary_line, index=[i])
    total_df = pd.concat([total_df, df])   # creates one big CSV

现在存储为HDF5格式

and now store into HDF5 format

    store.append(hdf5_key, total_df, data_columns=csv_columns, index=False)

但是,我不认为我具有将所有csv行保存为total_df并转换为HDF5格式的RAM/存储空间.

However, I don't think I have the RAM/storage to save all csv lines into total_df into HDF5 format.

那么,如何将每个单行" df附加到HDF5中,使其最终成为一个大数据帧(如原始的csv)?

So, how do I append each "single-line" df into an HDF5 so that it ends up as one big dataframe (like the original csv)?

这是具有不同数据类型的csv文件的具体示例:

Here's a concrete example of a csv file with different data types:

 order    start    end    value    
 1        1342    1357    category1
 1        1459    1489    category7
 1        1572    1601    category23
 1        1587    1599    category2
 1        1591    1639    category1
 ....
 15        792     813    category13
 15        892     913    category5
 ....

推荐答案

您的代码应该可以工作,您可以尝试以下代码吗:

Your code should work, can you try the following code:

import pandas as pd
import numpy as np

store = pd.HDFStore("file.h5", "w")
hdf5_key = "single_key"
csv_columns = ["COL%d" % i for i in range(1, 56)]
for i in range(10):
    df = pd.DataFrame(np.random.randn(1, len(csv_columns)), columns=csv_columns)
    store.append(hdf5_key, df,  data_column=csv_columns, index=False)
store.close()

如果代码有效,则您的数据有问题.

If the code works, then there are something wrong with your data.

这篇关于如何将多个 pandas 数据帧组合到一个键/组下的HDF5对象中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-23 21:53