问题描述
在Pytables中使用cols方法的目的是什么?我有很大的数据集,我有兴趣从该数据集中只读取一列.
What is the purpose for using cols method in Pytables? I have got big dataset and I am interested in reading only one column from that dataset.
这两种方法给我的时间相同,但是可变内存消耗却完全不同:
These two methods gives me same time, but totally different variable memory consumption:
import tables
from sys import getsizeof
f = tables.open_file(myhdf5_path, 'r')
# These two methods takes the same amount of time
x = f.root.set1[:500000]['param1']
y = f.root.set1.cols.param1[:500000]
# But totally different memory consumption:
print(getsizeof(x)) # gives me 96
print(getsizeof(y)) # gives me 2000096
它们都是相同的numpy数组数据类型.谁能解释一下使用cols方法的目的是什么?
They are both the same numpy array data type. Can anybody explain me what is the purpose of using cols method?
%time x = f.root.set1[:500000]['param1'] # gives ~7ms
%time y = f.root.set1.cols.param1[:500000] # gives also about 7ms
推荐答案
您的问题引起了我的好奇.我通常使用 table.read(field='name')
,因为它会补充我使用的其他 table.read_
方法(例如:.read_where()
和.read_coordinates()
).
Your question caught my curiosity. I typically use table.read(field='name')
because it compliments the other table.read_
methods I use (for example: .read_where()
and .read_coordinates()
).
查看文档后,我发现至少有4种方法可以使用PyTables读取表数据的一列.您显示了2,还有2个:
table.read(field='name')
table.col('name')
(单数)
After a reviewing the docs, I found at least 4 ways to read one column of table data with PyTables. You showed 2, and there are 2 more:table.read(field='name')
table.col('name')
(singular)
我对整个表(数据集)进行了全部4种测试和2种测试,以进行其他比较.我为所有6个对象调用了getsizeof()
,其大小根据方法而有所不同.尽管所有4个在numpy索引方面的行为都相同,但我怀疑返回的对象有所不同.但是,我不是PyTables开发人员,所以这比实际要多. getsizeof()
也可能以不同的方式解释对象.
I ran some tests with all 4, plus 2 tests on the entire table (dataset) for additional comparisons. I called getsizeof()
for all 6 objects, and the size varies based on method. Although all 4 behave the same with numpy indexing, I suspect there's a difference in the returned object. However, I'm not a PyTables developer, so this is more inference than fact. It could also be that getsizeof()
interprets the object differently.
以下代码:
import tables as tb
import numpy as np
from sys import getsizeof
# Create h5 file with 1 dataset
h5f = tb.open_file('SO_55254831.h5', 'w')
mydtype = np.dtype([('param1',float),('param2',float),('param3',float)])
arr = np.array(np.arange(3.*500000.).reshape(500000,3))
recarr = np.core.records.array(arr,dtype=mydtype)
h5f.create_table('/', 'set1', obj=recarr )
# Close, then Reopen file READ ONLY
h5f.close()
h5f = tb.open_file('SO_55254831.h5', 'r')
testds_1 = h5f.root.set1
print ("\nFOR: testds_1 = h5f.root.set1")
print (testds_1.dtype)
print (testds_1.shape)
print (getsizeof(testds_1)) # gives 128
testds_2 = h5f.root.set1.read()
print ("\nFOR: testds_2 = h5f.root.set1.read()")
print (getsizeof(testds_2)) # gives 12000096
x = h5f.root.set1[:500000]['param1']
print ("\nFOR: x = h5f.root.set1[:500000]['param1']")
print(getsizeof(x)) # gives 96
print ("\nFOR: y = h5f.root.set1.cols.param1[:500000]")
y = h5f.root.set1.cols.param1[:500000]
print(getsizeof(y)) # gives 4000096
print ("\nFOR: z = h5f.root.set1.read(stop=500000,field='param1')")
z = h5f.root.set1.read(stop=500000,field='param1')
print(getsizeof(z)) # also gives 4000096
print ("\nFOR: a = h5f.root.set1.col('param1')")
a = h5f.root.set1.col('param1')
print(getsizeof(a)) # also gives 4000096
h5f.close()
上方的输出
FOR: testds_1 = h5f.root.set1
[('param1', '<f8'), ('param2', '<f8'), ('param3', '<f8')]
(500000,)
128
FOR: testds_2 = h5f.root.set1.read()
12000096
FOR: x = h5f.root.set1[:500000]['param1']
96
FOR: y = h5f.root.set1.cols.param1[:500000]
4000096
FOR: z = h5f.root.set1.read(stop=500000,field='param1')
4000096
FOR: a = h5f.root.set1.col('param1')
4000096
这篇关于PyTables-使用cols方法的大内存消耗的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!