问题描述
我正在尝试对使用pandas HDFStore pytables接口创建的HDF文件进行打包.数据帧的主要索引是时间,但我增加了data_columns
列,以便可以通过这些data_columns筛选磁盘上的数据.
I am trying to ptrepack a HDF file that was created with pandas HDFStore pytables interface.The main index of the dataframe was time but I made some more columns data_columns
so that I can filter for data on-disk via these data_columns.
现在,我想按以下列之一对HDF文件进行排序(因为按我的喜好,该选择太慢了,为84 GB文件),请使用带有sortby
选项的ptrepack,如下所示:
Now I would like to sort the HDF file by one of those columns (because the selection is too slow for my taste, 84 GB file), using ptrepack with the sortby
option like so:
()[maye@luna4 .../nominal]$ ptrepack --chunkshape=auto --propindexes --complevel=9 --complib=blosc --sortby=clat C9.h5 C9_sorted.h5
我收到错误消息:
回溯(最近通话最近):文件 第10行中的"/usr/local/epd/bin/ptrepack" sys.exit(main())文件"/usr/local/epd/lib/python2.7/site-packages/tables/scripts/ptrepack.py", 480行,在主要 upgradeflavors = upgradeflavors)文件"/usr/local/epd/lib/python2.7/site-packages/tables/scripts/ptrepack.py", 第225行,在copyChildren中 引发RuntimeError(请检查节点名称是否不是" RuntimeError:请检查节点名称是否不重复 目标,如果需要,请添加--overwrite-nodes标志.在 特别要注意,rootUEP不会欺骗您.
Traceback (most recent call last): File "/usr/local/epd/bin/ptrepack", line 10, in sys.exit(main()) File "/usr/local/epd/lib/python2.7/site-packages/tables/scripts/ptrepack.py", line 480, in main upgradeflavors=upgradeflavors) File "/usr/local/epd/lib/python2.7/site-packages/tables/scripts/ptrepack.py", line 225, in copyChildren raise RuntimeError("Please check that the node names are not " RuntimeError: Please check that the node names are not duplicated in destination, and if so, add the --overwrite-nodes flag if desired. In particular, pay attention that rootUEP is not fooling you.
这是否意味着我无法通过索引列对HDF文件进行排序,因为它们不是完整"索引?
Does this mean, that I can not sort a HDF file by an index column, because they are not 'full' indexes?
推荐答案
下面是一个完整的示例.
Here is a complete example.
使用data_column创建框架.将索引重置为完整索引.使用ptrepack来排序.
Create the frame with a data_column. Reset the index to a full index. Use ptrepack tosortby it.
In [16]: df = DataFrame(randn(10,2),columns=list('AB')).to_hdf('test.h5','df',data_columns=['B'],mode='w',table=True)
In [17]: store = pd.HDFStore('test.h5')
In [18]: store
Out[18]:
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df frame_table (typ->appendable,nrows->10,ncols->2,indexers->[index],dc->[B])
In [19]: store.get_storer('df').group.table
Out[19]:
/df/table (Table(10,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
"B": Float64Col(shape=(), dflt=0.0, pos=2)}
byteorder := 'little'
chunkshape := (2730,)
autoIndex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
"B": Index(6, medium, shuffle, zlib(1)).is_CSI=False}
In [20]: store.create_table_index('df',columns=['B'],optlevel=9,kind='full')
In [21]: store.get_storer('df').group.table
Out[21]:
/df/table (Table(10,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
"B": Float64Col(shape=(), dflt=0.0, pos=2)}
byteorder := 'little'
chunkshape := (2730,)
autoIndex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
"B": Index(9, full, shuffle, zlib(1)).is_CSI=True}
In [22]: store.close()
In [25]: !ptdump -avd test.h5
/ (RootGroup) ''
/._v_attrs (AttributeSet), 4 attributes:
[CLASS := 'GROUP',
PYTABLES_FORMAT_VERSION := '2.0',
TITLE := '',
VERSION := '1.0']
/df (Group) ''
/df._v_attrs (AttributeSet), 14 attributes:
[CLASS := 'GROUP',
TITLE := '',
VERSION := '1.0',
data_columns := ['B'],
encoding := None,
index_cols := [(0, 'index')],
info := {'index': {}},
levels := 1,
nan_rep := b'nan',
non_index_axes := [(1, ['A', 'B'])],
pandas_type := b'frame_table',
pandas_version := b'0.10.1',
table_type := b'appendable_frame',
values_cols := ['values_block_0', 'B']]
/df/table (Table(10,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
"B": Float64Col(shape=(), dflt=0.0, pos=2)}
byteorder := 'little'
chunkshape := (2730,)
autoindex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"B": Index(9, full, shuffle, zlib(1)).is_csi=True}
/df/table._v_attrs (AttributeSet), 15 attributes:
[B_dtype := b'float64',
B_kind := ['B'],
CLASS := 'TABLE',
FIELD_0_FILL := 0,
FIELD_0_NAME := 'index',
FIELD_1_FILL := 0.0,
FIELD_1_NAME := 'values_block_0',
FIELD_2_FILL := 0.0,
FIELD_2_NAME := 'B',
NROWS := 10,
TITLE := '',
VERSION := '2.6',
index_kind := b'integer',
values_block_0_dtype := b'float64',
values_block_0_kind := ['A']]
Data dump:
[0] (0, [1.10989047288066], 0.396613633081911)
[1] (1, [0.0981650001268093], -0.9209780702446433)
[2] (2, [-0.2429293157073629], -1.779366453624283)
[3] (3, [0.7305529521507728], 1.243565083939927)
[4] (4, [-0.1480724789512519], 0.5260130757651649)
[5] (5, [1.2560020435792643], 0.5455842491255144)
[6] (6, [1.20129355706986], 0.47930635538027244)
[7] (7, [0.9973598999689721], 0.8602929579025727)
[8] (8, [-0.40070941088441786], 0.7622228032635253)
[9] (9, [0.35865804118145655], 0.29939126149826045)
这是创建完全排序的索引的另一种方法(与以这种方式编写索引相反)
This is a another way to create a completely sorted index (as opposed to writing it this way)
In [23]: !ptrepack --sortby=B test.h5 test_sorted.h5
In [26]: !ptdump -avd test_sorted.h5
/ (RootGroup) ''
/._v_attrs (AttributeSet), 4 attributes:
[CLASS := 'GROUP',
PYTABLES_FORMAT_VERSION := '2.1',
TITLE := '',
VERSION := '1.0']
/df (Group) ''
/df._v_attrs (AttributeSet), 14 attributes:
[CLASS := 'GROUP',
TITLE := '',
VERSION := '1.0',
data_columns := ['B'],
encoding := None,
index_cols := [(0, 'index')],
info := {'index': {}},
levels := 1,
nan_rep := b'nan',
non_index_axes := [(1, ['A', 'B'])],
pandas_type := b'frame_table',
pandas_version := b'0.10.1',
table_type := b'appendable_frame',
values_cols := ['values_block_0', 'B']]
/df/table (Table(10,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
"B": Float64Col(shape=(), dflt=0.0, pos=2)}
byteorder := 'little'
chunkshape := (2730,)
/df/table._v_attrs (AttributeSet), 15 attributes:
[B_dtype := b'float64',
B_kind := ['B'],
CLASS := 'TABLE',
FIELD_0_FILL := 0,
FIELD_0_NAME := 'index',
FIELD_1_FILL := 0.0,
FIELD_1_NAME := 'values_block_0',
FIELD_2_FILL := 0.0,
FIELD_2_NAME := 'B',
NROWS := 10,
TITLE := '',
VERSION := '2.6',
index_kind := b'integer',
values_block_0_dtype := b'float64',
values_block_0_kind := ['A']]
Data dump:
[0] (2, [-0.2429293157073629], -1.779366453624283)
[1] (1, [0.0981650001268093], -0.9209780702446433)
[2] (9, [0.35865804118145655], 0.29939126149826045)
[3] (0, [1.10989047288066], 0.396613633081911)
[4] (6, [1.20129355706986], 0.47930635538027244)
[5] (4, [-0.1480724789512519], 0.5260130757651649)
[6] (5, [1.2560020435792643], 0.5455842491255144)
[7] (8, [-0.40070941088441786], 0.7622228032635253)
[8] (7, [0.9973598999689721], 0.8602929579025727)
[9] (3, [0.7305529521507728], 1.243565083939927)
这篇关于ptrepack sortby需要“完整"索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!