问题描述
我对HDF5的性能和并发性有以下疑问:
I have the following questions about HDF5 performance and concurrency:
- HDF5是否支持并发写访问?
- 除了并发方面的考虑之外,HDF5在 I/O性能方面的性能如何(压缩率是否会影响性能)?
- 自从我将HDF5与Python配合使用以来,其性能与Sqlite相比如何?
- Does HDF5 support concurrent write access?
- Concurrency considerations aside, how is HDF5 performance in terms of I/O performance (does compression rates affect the performance)?
- Since I use HDF5 with Python, how does its performance compare to Sqlite?
参考文献:
- http://www.sqlite.org/faq.html#q5
- Locking sqlite file on NFS filesystem possible?
- http://pandas.pydata.org/
推荐答案
已更新为使用熊猫0.13.1
Updated to use pandas 0.13.1
1)不. http://pandas.pydata. org/pandas-docs/dev/io.html#notes-caveats . 可以通过多种方式进行,例如让您不同的线程/进程写出计算结果,然后将单个进程组合在一起.
1) No. http://pandas.pydata.org/pandas-docs/dev/io.html#notes-caveats. There are various ways to do this, e.g. have your different threads/processes write out the computation results, then have a single process combine.
2)根据存储的数据类型,处理方式以及获取方式,HDF5可以提供更好的性能.将HDFStore
作为单个数组存储,压缩后的浮点数据(换句话说,不以允许查询的格式存储)将快速存储/读取.即使以表格式存储(这会降低写入性能),也将提供非常好的写入性能.您可以查看它进行一些详细的比较(HDFStore
在幕后使用). http://www.pytables.org/,这是一张不错的图片:
2) depending the type of data you store, how you do it, and how you want to retrieve, HDF5 can offer vastly better performance. Storing in an HDFStore
as a single array, float data, compressed (in other words, not storing it in a format that allows for querying), will be stored/read amazing fast. Even storing in the table format (which slows down the write performance), will offer quite good write performance. You can look at this for some detailed comparsions (which is what HDFStore
uses under the hood). http://www.pytables.org/, here's a nice picture:
(由于PyTables 2.3现在已对查询建立索引),因此perf实际上比这更好因此,回答您的问题,如果您想要任何一种性能,HDF5就是您的最佳选择.
(and since PyTables 2.3 the queries are now indexed), so perf actually is MUCH better than thisSo to answer your question, if you want any kind of performance, HDF5 is the way to go.
写作:
In [14]: %timeit test_sql_write(df)
1 loops, best of 3: 6.24 s per loop
In [15]: %timeit test_hdf_fixed_write(df)
1 loops, best of 3: 237 ms per loop
In [16]: %timeit test_hdf_table_write(df)
1 loops, best of 3: 901 ms per loop
In [17]: %timeit test_csv_write(df)
1 loops, best of 3: 3.44 s per loop
阅读
In [18]: %timeit test_sql_read()
1 loops, best of 3: 766 ms per loop
In [19]: %timeit test_hdf_fixed_read()
10 loops, best of 3: 19.1 ms per loop
In [20]: %timeit test_hdf_table_read()
10 loops, best of 3: 39 ms per loop
In [22]: %timeit test_csv_read()
1 loops, best of 3: 620 ms per loop
这是代码
import sqlite3
import os
from pandas.io import sql
In [3]: df = DataFrame(randn(1000000,2),columns=list('AB'))
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
A 1000000 non-null values
B 1000000 non-null values
dtypes: float64(2)
def test_sql_write(df):
if os.path.exists('test.sql'):
os.remove('test.sql')
sql_db = sqlite3.connect('test.sql')
sql.write_frame(df, name='test_table', con=sql_db)
sql_db.close()
def test_sql_read():
sql_db = sqlite3.connect('test.sql')
sql.read_frame("select * from test_table", sql_db)
sql_db.close()
def test_hdf_fixed_write(df):
df.to_hdf('test_fixed.hdf','test',mode='w')
def test_csv_read():
pd.read_csv('test.csv',index_col=0)
def test_csv_write(df):
df.to_csv('test.csv',mode='w')
def test_hdf_fixed_read():
pd.read_hdf('test_fixed.hdf','test')
def test_hdf_table_write(df):
df.to_hdf('test_table.hdf','test',format='table',mode='w')
def test_hdf_table_read():
pd.read_hdf('test_table.hdf','test')
当然是YMMV.
这篇关于HDF5-并发,压缩和I/O性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!