问题描述
我正在从800 GB的大型CSV解析数据.对于每一行数据,我都将其另存为pandas数据框.
I am parsing data from a large csv sized 800 GB. For each line of data, I save this as a pandas dataframe.
readcsvfile = csv.reader(csvfile)
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
# save as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i])
现在,我想将其保存为HDF5格式,并查询h5,就好像它是整个csv文件一样.
Now, I would like to save this into an HDF5 format, and query the h5 as if it was the entire csv file.
import pandas as pd
store = pd.HDFStore("pathname/file.h5")
hdf5_key = "single_key"
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
到目前为止,我的方法是:
My approach so far has been:
import pandas as pd
store = pd.HDFStore("pathname/file.h5")
hdf5_key = "single_key"
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
readcsvfile = csv.reader(csvfile)
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
# save as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i])
store.append(hdf5_key, df, data_columns=csv_columns, index=False)
也就是说,我尝试用一个键将每个数据帧df
保存到HDF5中.但是,这失败了:
That is, I try to save each dataframe df
into the HDF5 under one key. However, this fails:
Attribute 'superblocksize' does not exist in node: '/hdf5_key/_i_table/index'
因此,我可以先将所有内容保存到一个熊猫数据框中,即
So, I could try to save everything into one pandas dataframe first, i.e.
import pandas as pd
store = pd.HDFStore("pathname/file.h5")
hdf5_key = "single_key"
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
readcsvfile = csv.reader(csvfile)
total_df = pd.DataFrame()
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by csv field:value, "dictionary_line"
# save as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i])
total_df = pd.concat([total_df, df]) # creates one big CSV
现在存储为HDF5格式
and now store into HDF5 format
store.append(hdf5_key, total_df, data_columns=csv_columns, index=False)
但是,我不认为我具有将所有csv行保存为total_df
并转换为HDF5格式的RAM/存储空间.
However, I don't think I have the RAM/storage to save all csv lines into total_df
into HDF5 format.
那么,如何将每个单行" df附加到HDF5中,使其最终成为一个大数据帧(如原始的csv)?
So, how do I append each "single-line" df into an HDF5 so that it ends up as one big dataframe (like the original csv)?
这是具有不同数据类型的csv文件的具体示例:
Here's a concrete example of a csv file with different data types:
order start end value
1 1342 1357 category1
1 1459 1489 category7
1 1572 1601 category23
1 1587 1599 category2
1 1591 1639 category1
....
15 792 813 category13
15 892 913 category5
....
推荐答案
您的代码应该可以工作,您可以尝试以下代码吗:
Your code should work, can you try the following code:
import pandas as pd
import numpy as np
store = pd.HDFStore("file.h5", "w")
hdf5_key = "single_key"
csv_columns = ["COL%d" % i for i in range(1, 56)]
for i in range(10):
df = pd.DataFrame(np.random.randn(1, len(csv_columns)), columns=csv_columns)
store.append(hdf5_key, df, data_column=csv_columns, index=False)
store.close()
如果代码有效,则您的数据有问题.
If the code works, then there are something wrong with your data.
这篇关于如何将多个 pandas 数据帧组合到一个键/组下的HDF5对象中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!