如何可逆地向/从磁盘存储和加载 Pandas 数据帧

本文介绍了如何可逆地向/从磁盘存储和加载 Pandas 数据帧的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

现在我每次运行脚本时都会导入一个相当大的 CSV 作为数据框.是否有一个很好的解决方案可以在两次运行之间保持该数据帧始终可用，这样我就不必花费所有时间等待脚本运行?

Right now I'm importing a fairly large CSV as a dataframe every time I run the script. Is there a good solution for keeping that dataframe constantly available in between runs so I don't have to spend all that time waiting for the script to run?

推荐答案

最简单的方法是 pickle 使用 to_pickle:

The easiest way is to pickle it using to_pickle:

df.to_pickle(file_name)  # where to save it, usually as a .pkl

然后你可以使用:

df = pd.read_pickle(file_name)

注意:在 0.11.1 之前，save 和 load 是执行此操作的唯一方法(它们现在已被弃用，取而代之的是 to_pickle 和 read_pickle).

Note: before 0.11.1 save and load were the only way to do this (they are now deprecated in favor of to_pickle and read_pickle respectively).

另一个流行的选择是使用 HDF5(pytables) 提供非常快大型数据集的访问时间:

Another popular choice is to use HDF5 (pytables) which offers very fast access times for large datasets:

import pandas as pd
store = pd.HDFStore('store.h5')

store['df'] = df  # save it
store['df']  # load it

更高级的策略在食谱.

从 0.13 开始，还有 msgpack更好的互操作性，作为 JSON 的更快替代方案，或者如果您有 python 对象/文本重数据(请参阅这个问题).

Since 0.13 there's also msgpack which may be be better for interoperability, as a faster alternative to JSON, or if you have python object/text-heavy data (see this question).

这篇关于如何可逆地向/从磁盘存储和加载 Pandas 数据帧的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！