问题描述
现在我每次运行脚本时都会导入一个相当大的 CSV
作为数据框.是否有一个很好的解决方案可以在两次运行之间保持该数据帧始终可用,这样我就不必花费所有时间等待脚本运行?
Right now I'm importing a fairly large CSV
as a dataframe every time I run the script. Is there a good solution for keeping that dataframe constantly available in between runs so I don't have to spend all that time waiting for the script to run?
推荐答案
The easiest way is to pickle it using to_pickle
:
df.to_pickle(file_name) # where to save it, usually as a .pkl
然后你可以使用:
df = pd.read_pickle(file_name)
注意:在 0.11.1 之前,save
和 load
是执行此操作的唯一方法(它们现在已被弃用,取而代之的是 to_pickle
和 read_pickle
).
Note: before 0.11.1 save
and load
were the only way to do this (they are now deprecated in favor of to_pickle
and read_pickle
respectively).
另一个流行的选择是使用 HDF5(pytables) 提供 非常快 大型数据集的访问时间:
Another popular choice is to use HDF5 (pytables) which offers very fast access times for large datasets:
import pandas as pd
store = pd.HDFStore('store.h5')
store['df'] = df # save it
store['df'] # load it
更高级的策略在 食谱.
从 0.13 开始,还有 msgpack更好的互操作性,作为 JSON 的更快替代方案,或者如果您有 python 对象/文本重数据(请参阅这个问题).
Since 0.13 there's also msgpack which may be be better for interoperability, as a faster alternative to JSON, or if you have python object/text-heavy data (see this question).
这篇关于如何可逆地向/从磁盘存储和加载 Pandas 数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!