将大csv转换为hdf5

本文介绍了将大csv转换为hdf5的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个100M行csv文件（实际上许多单独的csv文件），总计84GB。我需要将其转换为具有单个浮点数据集的HDF5文件。我在测试中使用了 h5py ，没有任何问题，但现在我无法在不耗尽内存的情况下执行最终数据集。

I have a 100M line csv file (actually many separate csv files) totaling 84GB. I need to convert it to a HDF5 file with a single float dataset. I used h5py in testing without any problems, but now I can't do the final dataset without running out of memory.

我如何写入HDF5而不必将整个数据集存储在内存中？我希望在这里的实际代码，因为它应该很简单。

How can I write to HDF5 without having to store the whole dataset in memory? I'm expecting actual code here, because it should be quite simple.

我只是看着 pytables ，但它不看像数组类（对应于HDF5数据集）可以迭代地写入。同样， pandas 在其 io_tools中有 read_csv 和 to_hdf ，但我无法加载整个数据集一次，所以不会工作。

I was just looking into pytables, but it doesn't look like the array class (which corresponds to a HDF5 dataset) can be written to iteratively. Similarly, pandas has read_csv and to_hdf methods in its io_tools, but I can't load the whole dataset at one time so that won't work. Perhaps you can help me solve the problem correctly with other tools in pytables or pandas.

推荐答案

调用 to_hdf ：

import numpy as np
import pandas as pd

filename = '/tmp/test.h5'

df = pd.DataFrame(np.arange(10).reshape((5,2)), columns=['A', 'B'])
print(df)
#    A  B
# 0  0  1
# 1  2  3
# 2  4  5
# 3  6  7
# 4  8  9

# Save to HDF5
df.to_hdf(filename, 'data', mode='w', format='table')
del df    # allow df to be garbage collected

# Append more data
df2 = pd.DataFrame(np.arange(10).reshape((5,2))*10, columns=['A', 'B'])
df2.to_hdf(filename, 'data', append=True)

print(pd.read_hdf(filename, 'data'))

产生

请注意，您需要在第一次调用 df.to_hdf 时使用 format ='table' code>使表可附加。否则，默认情况下格式为'fixed'，这样读取和写入速度更快，但会创建一个不能追加的表。

Note that you need to use format='table' in the first call to df.to_hdf to make the table appendable. Otherwise, the format is 'fixed' by default, which is faster for reading and writing, but creates a table which can not be appended to.

因此，您可以一次处理每个CSV，使用 append = True 构建hdf5文件。然后覆盖DataFrame或使用 del df 允许旧的DataFrame被垃圾回收。

Thus, you can process each CSV one at a time, use append=True to build the hdf5 file. Then overwrite the DataFrame or use del df to allow the old DataFrame to be garbage collected.

或者，您可以：

Alternatively, instead of calling df.to_hdf, you could append to a HDFStore:

import numpy as np
import pandas as pd

filename = '/tmp/test.h5'
store = pd.HDFStore(filename)

for i in range(2):
    df = pd.DataFrame(np.arange(10).reshape((5,2)) * 10**i, columns=['A', 'B'])
    store.append('data', df)

store.close()

store = pd.HDFStore(filename)
data = store['data']
print(data)
store.close()

产生

                        这篇关于将大csv转换为hdf5的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！