问题描述
假设我使用 xarray.open_dataset(..., decode_times=False)
加载了一个 xarray.Dataset
对象,打印时看起来像这样:
Say I have an xarray.Dataset
object loaded in using xarray.open_dataset(..., decode_times=False)
that looks like this when printed:
<xarray.Dataset>
Dimensions: (bnds: 2, lat: 15, lon: 34, plev: 8, time: 3650)
Coordinates:
* time (time) float64 3.322e+04 3.322e+04 3.322e+04 3.322e+04 ...
* plev (plev) float64 1e+05 8.5e+04 7e+04 5e+04 2.5e+04 1e+04 5e+03 ...
* lat (lat) float64 40.46 43.25 46.04 48.84 51.63 54.42 57.21 60.0 ...
* lon (lon) float64 216.6 219.4 222.2 225.0 227.8 230.6 233.4 236.2 ...
Dimensions without coordinates: bnds
Data variables:
time_bnds (time, bnds) float64 3.322e+04 3.322e+04 3.322e+04 3.322e+04 ...
lat_bnds (lat, bnds) float64 39.07 41.86 41.86 44.65 44.65 47.44 47.44 ...
lon_bnds (lon, bnds) float64 215.2 218.0 218.0 220.8 220.8 223.6 223.6 ...
hus (time, plev, lat, lon) float64 0.006508 0.007438 0.008751 ...
对 lat
、lon
和 time
给定的多个范围进行子集化的最佳方法是什么?我尝试链接一系列条件并使用 xarray.Dataset.where
,但我收到一条错误消息:
What would be the best way to subset this given multiple ranges for lat
, lon
, and time
? I've tried chaining a series of conditions and used xarray.Dataset.where
, but I get an error saying:
IndexError: The indexing operation you are attempting to perform is not valid on netCDF4.Variable object. Try loading your data into memory first by calling .load().
我无法将整个数据集加载到内存中,那么执行此操作的典型方法是什么?
I can't load the entire dataset into memory, so what would be the typical way to do this?
推荐答案
NetCDF4 不支持 NumPy 支持的所有多维索引操作.但是确实支持切片(非常快)和一维索引(有点慢).
NetCDF4 doesn't support all of the multi-dimensional indexing operations supported by NumPy. But does support slicing (which is very fast) and one dimensional indexing (somewhat slower).
一些值得尝试的事情:
- 用切片索引(例如,
.sel(time=slice(start, end))
)before 用一维数组索引.这应该将基于数组的索引从 netCDF4 卸载到 Dask/NumPy. - 将您的索引操作拆分为更多的中间操作,这些操作一次沿更少的维度进行索引.听起来您已经尝试过这个,但也许值得多探索一下.
- 要优化性能,请使用
.chunk()
尝试不同的 Dask 分块方案.
- Index with slices (e.g.,
.sel(time=slice(start, end))
) before indexing with 1-dimensional arrays. This should offload the array-based indexing from netCDF4 to Dask/NumPy. - Split up your indexing operations into more intermediate operations that index along fewer dimensions at once. It sounds like you've already tried this one, but maybe it's worth exploring a little more.
- To optimize performance, try different Dask chunking schemes using the
.chunk()
.
如果这不起作用,请将完整的独立示例发布到 GitHub 上的 xarray 问题跟踪器,我们可以更详细地研究它.
If that doesn't work, post a full self-contained example to the xarray issue tracker on GitHub and we can take a look into it in more detail.
这篇关于子集 xarray.Dataset 关于多个坐标的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!