问题描述
我有一个100M行csv文件(实际上许多单独的csv文件),总计84GB。我需要将其转换为具有单个浮点数据集的HDF5文件。我在测试中使用了 h5py ,没有任何问题,但现在我无法在不耗尽内存的情况下执行最终数据集。
I have a 100M line csv file (actually many separate csv files) totaling 84GB. I need to convert it to a HDF5 file with a single float dataset. I used h5py in testing without any problems, but now I can't do the final dataset without running out of memory.
我如何写入HDF5而不必将整个数据集存储在内存中?我希望在这里的实际代码,因为它应该很简单。
How can I write to HDF5 without having to store the whole dataset in memory? I'm expecting actual code here, because it should be quite simple.
我只是看着 pytables ,但它不看像数组类(对应于HDF5数据集)可以迭代地写入。同样, pandas 在其 io_tools中有
,但我无法加载整个数据集一次,所以不会工作。 read_csv
和 to_hdf
I was just looking into pytables, but it doesn't look like the array class (which corresponds to a HDF5 dataset) can be written to iteratively. Similarly, pandas has read_csv
and to_hdf
methods in its io_tools
, but I can't load the whole dataset at one time so that won't work. Perhaps you can help me solve the problem correctly with other tools in pytables or pandas.
推荐答案
调用 to_hdf
:
import numpy as np
import pandas as pd
filename = '/tmp/test.h5'
df = pd.DataFrame(np.arange(10).reshape((5,2)), columns=['A', 'B'])
print(df)
# A B
# 0 0 1
# 1 2 3
# 2 4 5
# 3 6 7
# 4 8 9
# Save to HDF5
df.to_hdf(filename, 'data', mode='w', format='table')
del df # allow df to be garbage collected
# Append more data
df2 = pd.DataFrame(np.arange(10).reshape((5,2))*10, columns=['A', 'B'])
df2.to_hdf(filename, 'data', append=True)
print(pd.read_hdf(filename, 'data'))
产生
A B
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
0 0 10
1 20 30
2 40 50
3 60 70
4 80 90
请注意,您需要在第一次调用 df.to_hdf 时使用
format ='table'
code>使表可附加。否则,默认情况下格式为'fixed'
,这样读取和写入速度更快,但会创建一个不能追加的表。
Note that you need to use format='table'
in the first call to df.to_hdf
to make the table appendable. Otherwise, the format is 'fixed'
by default, which is faster for reading and writing, but creates a table which can not be appended to.
因此,您可以一次处理每个CSV,使用 append = True
构建hdf5文件。然后覆盖DataFrame或使用 del df
允许旧的DataFrame被垃圾回收。
Thus, you can process each CSV one at a time, use append=True
to build the hdf5 file. Then overwrite the DataFrame or use del df
to allow the old DataFrame to be garbage collected.
或者,您可以:
Alternatively, instead of calling df.to_hdf
, you could append to a HDFStore:
import numpy as np
import pandas as pd
filename = '/tmp/test.h5'
store = pd.HDFStore(filename)
for i in range(2):
df = pd.DataFrame(np.arange(10).reshape((5,2)) * 10**i, columns=['A', 'B'])
store.append('data', df)
store.close()
store = pd.HDFStore(filename)
data = store['data']
print(data)
store.close()
产生
A B
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
0 0 10
1 20 30
2 40 50
3 60 70
4 80 90
这篇关于将大csv转换为hdf5的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!