本文介绍了何时在 pandas 中使用多索引与xarray的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

pandas数据透视表文档似乎建议处理以下内容:使用多重索引的两个维度的数据:

The pandas pivot tables documentation seems to recomend dealing with more than two dimensions of data by using multiindexing:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: import pandas.util.testing as tm; tm.N = 3

In [4]: def unpivot(frame):
   ...:         N, K = frame.shape
   ...:         data = {'value' : frame.values.ravel('F'),
   ...:                 'variable' : np.asarray(frame.columns).repeat(N),
   ...:                 'date' : np.tile(np.asarray(frame.index), K)}
   ...:         return pd.DataFrame(data, columns=['date', 'variable', 'value'])
   ...: 

In [5]: df = unpivot(tm.makeTimeDataFrame())

In [6]: df
Out[6]: 
         date variable     value    value2
0  2000-01-03        A  0.462461  0.924921
1  2000-01-04        A -0.517911 -1.035823
2  2000-01-05        A  0.831014  1.662027
3  2000-01-03        B -0.492679 -0.985358
4  2000-01-04        B -1.234068 -2.468135
5  2000-01-05        B  1.725218  3.450437
6  2000-01-03        C  0.453859  0.907718
7  2000-01-04        C -0.763706 -1.527412
8  2000-01-05        C  0.839706  1.679413
9  2000-01-03        D -0.048108 -0.096216
10 2000-01-04        D  0.184461  0.368922
11 2000-01-05        D -0.349496 -0.698993

In [7]: df['value2'] = df['value'] * 2

In [8]: df.pivot('date', 'variable')
Out[8]: 
               value                                  value2            \
variable           A         B         C         D         A         B   
date                                                                     
2000-01-03 -1.558856 -1.144732 -0.234630 -1.252482 -3.117712 -2.289463   
2000-01-04 -1.351152 -0.173595  0.470253 -1.181006 -2.702304 -0.347191   
2000-01-05  0.151067 -0.402517 -2.625085  1.275430  0.302135 -0.805035   


variable           C         D  
date                            
2000-01-03 -0.469259 -2.504964  
2000-01-04  0.940506 -2.362012  
2000-01-05 -5.250171  2.550861  

我认为xarray是用于处理这样的多维数据集的:

I thought that xarray was made for handling multidimensional datasets like this:

In [9]: import xarray as xr

In [10]: xr.DataArray(dict([(var, df[df.variable==var].drop('variable', 1)) for var in np.unique(df.variable)]))
Out[10]: 
<xarray.DataArray ()>
array({'A':         date     value    value2
0 2000-01-03  0.462461  0.924921
1 2000-01-04 -0.517911 -1.035823
2 2000-01-05  0.831014  1.662027, 'C':         date     value    value2
6 2000-01-03  0.453859  0.907718
7 2000-01-04 -0.763706 -1.527412
8 2000-01-05  0.839706  1.679413, 'B':         date     value    value2
3 2000-01-03 -0.492679 -0.985358
4 2000-01-04 -1.234068 -2.468135
5 2000-01-05  1.725218  3.450437, 'D':          date     value    value2
9  2000-01-03 -0.048108 -0.096216
10 2000-01-04  0.184461  0.368922
11 2000-01-05 -0.349496 -0.698993}, dtype=object)

这些方法中的一种是否比另一种更好?为什么xarray不能完全取代多索引?

Is one of these approaches better than the other? Why hasn't xarray completely replaced multiindexing?

推荐答案

在多维数组上进行工作似乎已经过渡到xarray.熊猫将降低对3D面板数据结构的支持,并会在文档甚至建议使用xarray处理多维数组:

There does seem to be a transition to xarray for doing work on multi-dimensional arrays. Pandas will be depreciating the support for the 3D Panels data structure and in the documentation even suggest using xarray for working with multidemensional arrays:

此外,xarray软件包是从头开始构建的, 特别是为了支持多维分析 是Panel的主要用例之一.这是xarray的链接 面板过渡文档."

In addition, the xarray package was built from the ground up, specifically in order to support the multi-dimensional analysis that is one of Panel s main use cases. Here is a link to the xarray panel-transition documentation.'

xarray文档中,他们陈述了自己的目的和目标:

From the xarray documentation they state their aims and goals:

...我们的目标受众是需要标注N维的任何人 数组,但是我们特别关注于数据分析的需求 物理科学家,尤其是已经知道并 爱netCDF

...Our target audience is anyone who needs N-dimensional labelled arrays, but we are particularly focused on the data analysis needs of physical scientists – especially geoscientists who already know and love netCDF

xarray与直接使用numpy相比的主要优势在于,它使用标签的方式与pandas在多个维度上的使用方式相同.如果您正在使用3维数据,则可以使用多维索引或xarray进行互换.随着数据集中维度数量的增加,xarray变得更加易于管理.我无法评论它们在效率或速度方面的表现.

The main advantage of xarray over using straight numpy is that it makes use of labels in the same way pandas does over multiple dimensions. If you are working with 3-dimensional data using multi-indexing or xarray might be interchangeable. As the number of dimensions grows in your data set xarray becomes much more manageable.I cannot comment on how each performs in terms of efficiency or speed.

这篇关于何时在 pandas 中使用多索引与xarray的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-21 06:14