问题描述
我使用大型数据框架已经有一段时间了,并且我一直在使用csv格式来存储输入数据和结果.我注意到,读取和写入这些文件花了很多时间,例如,这大大减慢了数据的批处理速度.我想知道文件格式本身是否相关.有没有首选的文件格式,以便更快地读取/写入Pandas DataFrame和/或Numpy数组?
I've been working for a while with very large DataFrames and I've been using the csv format to store input data and results. I've noticed that a lot of time goes into reading and writing these files which, for example, dramatically slows down batch processing of data. I was wondering if the file format itself is of relevance. Is there apreferred file format for faster reading/writing Pandas DataFrames and/or Numpy arrays?
推荐答案
使用HDF5.胜过将平面文件写下来.您可以查询.文档位于此处
Use HDF5. Beats writing flat files hands down. And you can query. Docs are here
这是性能比较与SQL的比较.已更新,以显示SQL/HDF_fixed/HDF_table/CSV读写性能.
Here's a perf comparison vs SQL. Updated to show SQL/HDF_fixed/HDF_table/CSV write and read perfs.
文档现在包括效果部分:
Docs now include a performance section:
在此处
这篇关于使用Pandas和/或Numpy进行读/写操作的最快文件格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!