本文介绍了将大csv转换为hdf5的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个100M行csv文件(实际上许多单独的csv文件),总计84GB。我需要将其转换为具有单个浮点数据集的HDF5文件。我在测试中使用了 h5py ,没有任何问题,但现在我无法在不耗尽内存的情况下执行最终数据集。

I have a 100M line csv file (actually many separate csv files) totaling 84GB. I need to convert it to a HDF5 file with a single float dataset. I used h5py in testing without any problems, but now I can't do the final dataset without running out of memory.

我如何写入HDF5而不必将整个数据集存储在内存中?我希望在这里的实际代码,因为它应该很简单。

How can I write to HDF5 without having to store the whole dataset in memory? I'm expecting actual code here, because it should be quite simple.

我只是看着 pytables ,但它不看像数组类(对应于HDF5数据集)可以迭代地写入。同样, pandas 在其 io_tools中有 read_csv to_hdf ,但我无法加载整个数据集一次,所以不会工作。

I was just looking into pytables, but it doesn't look like the array class (which corresponds to a HDF5 dataset) can be written to iteratively. Similarly, pandas has read_csv and to_hdf methods in its io_tools, but I can't load the whole dataset at one time so that won't work. Perhaps you can help me solve the problem correctly with other tools in pytables or pandas.

推荐答案

调用 to_hdf

import numpy as np
import pandas as pd

filename = '/tmp/test.h5'

df = pd.DataFrame(np.arange(10).reshape((5,2)), columns=['A', 'B'])
print(df)
#    A  B
# 0  0  1
# 1  2  3
# 2  4  5
# 3  6  7
# 4  8  9

# Save to HDF5
df.to_hdf(filename, 'data', mode='w', format='table')
del df    # allow df to be garbage collected

# Append more data
df2 = pd.DataFrame(np.arange(10).reshape((5,2))*10, columns=['A', 'B'])
df2.to_hdf(filename, 'data', append=True)

print(pd.read_hdf(filename, 'data'))

产生

    A   B
0   0   1
1   2   3
2   4   5
3   6   7
4   8   9
0   0  10
1  20  30
2  40  50
3  60  70
4  80  90

请注意,您需要在第一次调用 df.to_hdf 时使用 format ='table' code>使表可附加。否则,默认情况下格式为'fixed',这样读取和写入速度更快,但会创建一个不能追加的表。

Note that you need to use format='table' in the first call to df.to_hdf to make the table appendable. Otherwise, the format is 'fixed' by default, which is faster for reading and writing, but creates a table which can not be appended to.

因此,您可以一次处理每个CSV,使用 append = True 构建hdf5文件。然后覆盖DataFrame或使用 del df 允许旧的DataFrame被垃圾回收。

Thus, you can process each CSV one at a time, use append=True to build the hdf5 file. Then overwrite the DataFrame or use del df to allow the old DataFrame to be garbage collected.

或者,您可以:

Alternatively, instead of calling df.to_hdf, you could append to a HDFStore:

import numpy as np
import pandas as pd

filename = '/tmp/test.h5'
store = pd.HDFStore(filename)

for i in range(2):
    df = pd.DataFrame(np.arange(10).reshape((5,2)) * 10**i, columns=['A', 'B'])
    store.append('data', df)

store.close()

store = pd.HDFStore(filename)
data = store['data']
print(data)
store.close()

产生

    A   B
0   0   1
1   2   3
2   4   5
3   6   7
4   8   9
0   0  10
1  20  30
2  40  50
3  60  70
4  80  90

这篇关于将大csv转换为hdf5的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-06 09:49