python - 如何在xarray数据集中分割/优化维度？

简介：我有一个数据集，其收集方式使得维度最初不可用。我想从本质上讲是未分化数据的一大块，并向其添加维，以便可以对其进行查询，子集化等。这是以下问题的核心。

这是我拥有的xarray数据集：

<xarray.Dataset>
Dimensions:  (chain: 1, draw: 2000, rows: 24000)
Coordinates:
  * chain    (chain) int64 0
  * draw     (draw) int64 0 1 2 3 4 5 6 7 ... 1993 1994 1995 1996 1997 1998 1999
  * rows     (rows) int64 0 1 2 3 4 5 6 ... 23994 23995 23996 23997 23998 23999
Data variables:
    obs      (chain, draw, rows) float64 4.304 3.985 4.612 ... 6.343 5.538 6.475
Attributes:
    created_at:                 2019-12-27T17:16:13.847972
    inference_library:          pymc3
    inference_library_version:  3.8

这里的rows维度对应于我需要还原到数据的许多子维度。特别地，24,000行对应于来自240个条件的100个样本（这100个样本位于连续的块中）。这些条件是gate，input，growth medium和od的组合。

我想结束这样的事情：

<xarray.Dataset>
Dimensions:  (chain: 1, draw: 2000, gate: 1, input: 4, growth_medium: 3, sample: 100, rows: 24000)
Coordinates:
  * chain    (chain) int64 0
  * draw     (draw) int64 0 1 2 3 4 5 6 7 ... 1993 1994 1995 1996 1997 1998 1999
  * rows     *MultiIndex*
  * gate     (gate) int64 'AND'
  * input    (input) int64 '00', '01', '10', '11'
  * growth_medium (growth_medium) 'standard', 'rich', 'slow'
  * sample   (sample) int64 0 1 2 3 4 5 6 7 ... 95 96 97 98 99
Data variables:
    obs      (chain, draw, gate, input, growth_medium, samples) float64 4.304 3.985 4.612 ... 6.343 5.538 6.475
Attributes:
    created_at:                 2019-12-27T17:16:13.847972
    inference_library:          pymc3
    inference_library_version:  3.8

我有一个pandas数据框，用于指定门，输入和生长培养基的值-每行给出一组门，输入和生长培养基的值，以及一个索引，用于指定相应位置（在rows中）出现一组100个样本。目的是该数据框是标记数据集的指南。

我查看了有关“重塑和重组数据”的xarray文档，但没有看到如何结合使用这些操作来完成所需的操作。我怀疑我需要将它们与GroupBy结合使用，但是我不知道如何。谢谢！

后来：我对这个问题有解决方案，但令人厌恶的是，我希望有人能解释我的错误，以及哪种更优雅的方法是可能的。

因此，首先，我将原始Dataset中的所有数据提取为原始的numpy形式：

foo = qm.idata.posterior_predictive['obs'].squeeze('chain').values.T
foo.shape # (24000, 2000)

然后我根据需要重塑了它：

bar = np.reshape(foo, (240, 100, 2000))

这大致为我提供了所需的形状：有240种不同的实验条件，每种条件都有100个变体，对于这些变体，我的数据集中有2000个蒙特卡洛样本。

现在，我从Pandas DataFrame中提取有关240个实验条件的信息：

import pandas as pd
# qdf is the original dataframe with the experimental conditions and some
# extraneous information in other columns
new_df = qdf[['gate', 'input', 'output', 'media', 'od_lb', 'od_ub', 'temperature']]
idx = pd.MultiIndex.from_frame(new_df)

最后，我从numpy数组和熊猫DataArray重新组装了一个MultiIndex：

xr.DataArray(bar, name='obs', dims=['regions', 'conditions', 'draws'],
             coords={'regions': idx, 'conditions': range(100), 'draws': range(2000)})

如我所愿，生成的DataArray具有这些坐标：

Coordinates:
  * regions      (regions) MultiIndex
  - gate         (regions) object 'AND' 'AND' 'AND' 'AND' ... 'AND' 'AND' 'AND'
  - input        (regions) object '00' '10' '10' '10' ... '01' '01' '11' '11'
  - output       (regions) object '0' '0' '0' '0' '0' ... '0' '0' '0' '1' '1'
  - media        (regions) object 'standard_media' ... 'high_osm_media_five_percent'
  - od_lb        (regions) float64 0.0 0.001 0.001 ... 0.0001 0.0051 0.0051
  - od_ub        (regions) float64 0.0001 0.0051 0.0051 2.0 ... 0.0003 2.0 2.0
  - temperature  (regions) int64 30 30 37 30 37 30 37 ... 37 30 37 30 37 30 37
  * conditions   (conditions) int64 0 1 2 3 4 5 6 7 ... 92 93 94 95 96 97 98 99
  * draws        (draws) int64 0 1 2 3 4 5 6 ... 1994 1995 1996 1997 1998 1999

但是，那真是太可怕了，而且我不得不贯穿所有xarray抽象的漂亮层以达到这一点似乎是错误的。尤其是因为这似乎不是科学工作流程中的不寻常部分：将相对原始的数据集与需要与数据结合的元数据电子表格一起获得。那我在做什么错？有什么更优雅的解决方案？

最佳答案

给定起始数据集，类似于：

<xarray.Dataset>
Dimensions:  (draw: 2, row: 24)
Coordinates:
  * draw     (draw) int32 0 1
  * row      (row) int32 0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
Data variables:
    obs      (draw, row) int32 0 1 2 3 4 5 6 7 8 ... 39 40 41 42 43 44 45 46 47

您可以连接多个纯xarray命令来细分维度（以相同的形状但使用多索引来获取数据），甚至可以重塑数据集的形状。要细分尺寸，可以使用以下代码：

multiindex_ds = ds.assign_coords(
    dim_0=["a", "b", "c"], dim_1=[0,1], dim_2=range(4)
).stack(
    dim=("dim_0", "dim_1", "dim_2")
).reset_index(
    "row", drop=True
).rename(
    row="dim"
)
multiindex_ds

其输出是：

<xarray.Dataset>
Dimensions:  (dim: 24, draw: 2)
Coordinates:
  * draw     (draw) int32 0 1
  * dim      (dim) MultiIndex
  - dim_0    (dim) object 'a' 'a' 'a' 'a' 'a' 'a' ... 'c' 'c' 'c' 'c' 'c' 'c'
  - dim_1    (dim) int64 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
  - dim_2    (dim) int64 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
Data variables:
    obs      (draw, dim) int32 0 1 2 3 4 5 6 7 8 ... 39 40 41 42 43 44 45 46 47

此外，然后可以将多索引拆开，从而有效地重塑数据集：

reshaped_ds = multiindex_ds.unstack("dim")
reshaped_ds

输出：

<xarray.Dataset>
Dimensions:  (dim_0: 3, dim_1: 2, dim_2: 4, draw: 2)
Coordinates:
  * draw     (draw) int32 0 1
  * dim_0    (dim_0) object 'a' 'b' 'c'
  * dim_1    (dim_1) int64 0 1
  * dim_2    (dim_2) int64 0 1 2 3
Data variables:
    obs      (draw, dim_0, dim_1, dim_2) int32 0 1 2 3 4 5 ... 42 43 44 45 46 47

我认为仅此一项并不能完全满足您的需求，因为您想将一个维度转换为两个维度，其中一个是多索引。所有的构建块都在这里。

例如，您可以使用regions和conditions执行此步骤（包括取消堆栈），然后执行以下步骤（现在不进行堆栈）将regions转换为多索引。另一个选择是从一开始就使用所有尺寸，将它们拆开，然后再次堆叠，将conditions保留在最终的多索引之外。

详细答案

答案结合了几个非常不相关的命令，要查看它们各自正在执行的操作可能很棘手。

assign_coords

第一步是创建新的尺寸和坐标并将其添加到数据集。这是必需的，因为接下来的方法需要数据集中已经存在的尺寸和坐标。

在assign_coords之后立即停止将产生以下数据集：

<xarray.Dataset>
Dimensions:  (dim_0: 3, dim_1: 2, dim_2: 4, draw: 2, row: 24)
Coordinates:
  * draw     (draw) int32 0 1
  * row      (row) int32 0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
  * dim_0    (dim_0) <U1 'a' 'b' 'c'
  * dim_1    (dim_1) int32 0 1
  * dim_2    (dim_2) int32 0 1 2 3
Data variables:
    obs      (draw, row) int32 0 1 2 3 4 5 6 7 8 ... 39 40 41 42 43 44 45 46 47

stack

数据集现在包含3个维度，总共可以包含24个元素，但是，由于当前相对于这24个元素而言数据是平坦的，因此我们必须将它们堆叠到单个24个元素的多索引中以使其形状兼容。

我发现assign_coords后跟stack是最自然的解决方案，但是，另一种可能性是生成与上面的操作类似的多重索引，并直接使用多重索引调用assign_coords，从而不需要堆栈。

此步骤将所有3个新维度合并为一个维度：

<xarray.Dataset>
Dimensions:  (dim: 24, draw: 2, row: 24)
Coordinates:
  * draw     (draw) int32 0 1
  * row      (row) int32 0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
  * dim      (dim) MultiIndex
  - dim_0    (dim) object 'a' 'a' 'a' 'a' 'a' 'a' ... 'c' 'c' 'c' 'c' 'c' 'c'
  - dim_1    (dim) int64 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
  - dim_2    (dim) int64 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
Data variables:
    obs      (draw, row) int32 0 1 2 3 4 5 6 7 8 ... 39 40 41 42 43 44 45 46 47

请注意，根据需要，我们现在有2个尺寸为24的尺寸。

reset_index

现在，我们在数据集中显示了最终维度作为坐标，我们希望这个新坐标成为索引变量obs的坐标。 set_index似乎是正确的选择，但是，我们的每个坐标都对其本身进行索引（与set_index文档中的示例，其中x同时对x和a坐标进行索引），这意味着不能使用set_index在这种情况下。使用的方法是reset_index删除坐标row而不删除尺寸row。

在以下输出中，可以看到row现在是没有坐标的维了：

<xarray.Dataset>
Dimensions:  (dim: 24, draw: 2, row: 24)
Coordinates:
  * draw     (draw) int32 0 1
  * dim      (dim) MultiIndex
  - dim_0    (dim) object 'a' 'a' 'a' 'a' 'a' 'a' ... 'c' 'c' 'c' 'c' 'c' 'c'
  - dim_1    (dim) int64 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
  - dim_2    (dim) int64 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
Dimensions without coordinates: row
Data variables:
    obs      (draw, row) int32 0 1 2 3 4 5 6 7 8 ... 39 40 41 42 43 44 45 46 47

rename

当前的数据集几乎是最后一个，唯一的问题是obs变量仍然具有row维度，而不是所需的维度：dim。它看起来确实不是rename的预期用途，但可以用来使dim吸收row，从而产生所需的最终结果（在上面称为multiindex_ds）。

同样，在这里，set_index似乎是选择的方法，但是，如果使用rename(row="dim")代替set_index(row="dim")，则将multiindex折叠为由元组组成的索引：

<xarray.Dataset>
Dimensions:  (draw: 2, row: 24)
Coordinates:
  * draw     (draw) int32 0 1
  * row      (row) object ('a', 0, 0) ('a', 0, 1) ... ('c', 1, 2) ('c', 1, 3)
Data variables:
    obs      (draw, row) int32 0 1 2 3 4 5 6 7 8 ... 39 40 41 42 43 44 45 46 47

关于python - 如何在xarray数据集中分割/优化维度？，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/59504320/