python - 索引大型3D HDF5数据集以基于2D条件进行子集

我有一个大型的3D HDF5数据集，它表示某个变量的位置（X，Y）和时间。接下来，我有一个2D numpy数组，其中包含相同（X，Y）位置的分类。我想实现的是，我可以从3D HDF5数据集中提取属于2D数组中某个类的所有时间序列。
我举个例子：

import numpy as np
import h5py

# Open the HDF5 dataset
NDVI_file = 'NDVI_values.hdf5'
f_NDVI = h5py.File(NDVI_file,'r')
NDVI_data = f_NDVI["NDVI"]

# See what's in the dataset
NDVI_data
<HDF5 dataset "NDVI": shape (1319, 2063, 53), type "<f4">

# Let's make a random 1319 x 2063 classification containing class numbers 0-4
classification = np.random.randint(5, size=(1319, 2063))

现在我们有了3D HDF5数据集和2D分类。让我们寻找属于类号“3”的像素

# Look for the X,Y locations that have class number '3'
idx = np.where(classification == 3)

这将返回一个大小为2的元组，其中包含与条件匹配的X、Y对，在我的随机示例中，对的数量为544433。我现在应该如何使用这个idx变量来创建一个大小为（544433,53）的二维数组，该数组包含分类类号为“3”的像素的544433时间序列？
我做了一些测试与花式索引和纯粹的三维数组，这个例子会很好：

subset = 3D_numpy_array[idx[0],idx[1],:]

然而，HDF5数据集太大，不能转换成一个麻木数组；当我试图在HDF5数据集上直接使用相同的索引方法时：

# Try to use fancy indexing directly on HDF5 dataset
NDVI_subset = np.array(NDVI_data[idx[0],idx[1],:])

它给了我一个错误：

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "h5py\_objects.pyx", line 54, in h5py._objects.with_phil.wrapper     (C:\aroot\work\h5py\_objects.c:2584)
File "h5py\_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (C:\aroot\work\h5py\_objects.c:2543)
File "C:\Users\vtrichtk\AppData\Local\Continuum\Anaconda2\lib\site-packages\h5py\_hl\dataset.py", line 431, in __getitem__
selection = sel.select(self.shape, args, dsid=self.id)
File "C:\Users\vtrichtk\AppData\Local\Continuum\Anaconda2\lib\site-packages\h5py\_hl\selections.py", line 95, in select
sel[args]
File "C:\Users\vtrichtk\AppData\Local\Continuum\Anaconda2\lib\site-packages\h5py\_hl\selections.py", line 429, in __getitem__
raise TypeError("Indexing elements must be in increasing order")
TypeError: Indexing elements must be in increasing order

我尝试的另一件事是在第3维中创建一个与HDF5数据集形状匹配的3D数组。np.repeat变量than得到大小为3的元组：

classification_3D = np.repeat(np.reshape(classification,(1319,2063,1)),53,axis=2)
idx = np.where(classification == 3)

但是下面的语句却抛出了完全相同的错误：

NDVI_subset = np.array(NDVI_data[idx])

这是因为HDF5数据集的工作方式与纯numpy数组不同吗？文件上说“选择坐标必须以递增的顺序给出”
在这种情况下，是否有人建议我如何在不必将完整的HDF5数据集读入内存的情况下让它工作（这不工作）？
非常感谢你！

最佳答案

在h5py中的高级/幻想索引并不像np.ndarray一般。
设置一个小测试用例：

import h5py
f=h5py.File('test.h5','w')
dset=f.create_dataset('data',(5,3,2),dtype='i')
dset[...]=np.arange(5*3*2).reshape(5,3,2)
x=np.arange(5*3*2).reshape(5,3,2)

ind=np.where(x%2)

我可以选择所有奇数值：

In [202]: ind
Out[202]:
(array([0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4], dtype=int32),
 array([0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2], dtype=int32),
 array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32))

In [203]: x[ind]
Out[203]: array([ 1,  3,  5,  7,  9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29])
In [204]: dset[ind]
...
TypeError: Indexing elements must be in increasing order

我可以在单个维度上使用如下列表进行索引：dset[[1,2,3],...]，但重复索引值或更改顺序会产生错误，dset[[1,1,2,2],...]或dset[[2,1,0],...]。dset[:,[0,1],:]可以。
几个切片是好的，dset[0:3,1:3,:]，或者一个切片和列表，dset[0:3,[1,2],:]。
但是2个列表产生

TypeError: Only one indexing vector or array is currently allowed for advanced selection

因此dset[[0,1,2],[1,2],:]的索引元组在几个方面是错误的。
我不知道这其中有多少是np.where存储的限制，还有多少只是h5模块中的未完成开发。也许两者兼而有之。
因此，您需要从文件中加载更简单的块，并在生成的NUMPY数组上执行更高级的索引。
在我的情况下，我只需要做：

In [225]: dset[:,:,1]
Out[225]:
array([[ 1,  3,  5],
       [ 7,  9, 11],
       [13, 15, 17],
       [19, 21, 23],
       [25, 27, 29]])

关于python - 索引大型3D HDF5数据集以基于2D条件进行子集，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/38761878/