问题描述
是否有一种好方法可以将具有 n 级索引的DataFrame转换为 n -D Numpy数组(又名 n -张量)?
Is there a good way to transform a DataFrame with an n-level index into an n-D Numpy array (a.k.a n-tensor)?
假设我像这样设置一个DataFrame
Suppose I set up a DataFrame like
from pandas import DataFrame, MultiIndex
index = range(2), range(3)
value = range(2 * 3)
frame = DataFrame(value, columns=['value'],
index=MultiIndex.from_product(index)).drop((1, 0))
print frame
输出
value
0 0 0
1 1
2 3
1 1 5
2 6
索引是2级层次结构索引.我可以使用
The index is a 2-level hierarchical index. I can extract a 2-D Numpy array from the data using
print frame.unstack().values
输出
[[ 0. 1. 2.]
[ nan 4. 5.]]
这如何推广到 n 级索引?
How does this generalize to an n-level index?
在玩unstack()
时,它似乎只能用于按摩DataFrame的2D形状,而不能添加轴.
Playing with unstack()
, it seems that it can only be used to massage the 2-D shape of the DataFrame, but not to add an axis.
我不能使用frame.values.reshape(x, y, z)
,因为这将要求框架完全包含x * y * z
行,因此无法保证.这是我在上面的示例中通过drop()
一行来试图演示的.
I cannot use e.g. frame.values.reshape(x, y, z)
, since this would require that the frame contains exactly x * y * z
rows, which cannot be guaranteed. This is what I tried to demonstrate by drop()
ing a row in the above example.
任何建议都将受到高度赞赏.
Any suggestions are highly appreciated.
推荐答案
编辑.这种方法比我在下面给出的方法要优雅得多(并且快两个数量级).
Edit. This approach is much more elegant (and two orders of magnitude faster) than the one I gave below.
# create an empty array of NaN of the right dimensions
shape = map(len, frame.index.levels)
arr = np.full(shape, np.nan)
# fill it using Numpy's advanced indexing
arr[frame.index.labels] = frame.values.flat
原始解决方案.给定与上述类似的设置,但使用3-D,
Original solution. Given a setup similar to above, but in 3-D,
from pandas import DataFrame, MultiIndex
from itertools import product
index = range(2), range(2), range(2)
value = range(2 * 2 * 2)
frame = DataFrame(value, columns=['value'],
index=MultiIndex.from_product(index)).drop((1, 0, 1))
print(frame)
我们有
value
0 0 0 0
1 1
1 0 2
1 3
1 0 0 4
1 0 6
1 7
现在,我们继续使用reshape()
路线,但需要进行一些预处理以确保沿每个尺寸的长度保持一致.
Now, we proceed using the reshape()
route, but with some preprocessing to ensure that the length along each dimension will be consistent.
首先,使用所有维度的完整笛卡尔积对数据框重新编制索引. NaN
值将根据需要插入.此操作可能很慢,而且会占用大量内存,具体取决于维数和数据帧的大小.
First, reindex the data frame with the full cartesian product of all dimensions. NaN
values will be inserted as needed. This operation can be both slow and consume a lot of memory, depending on the number of dimensions and on the size of the data frame.
levels = map(tuple, frame.index.levels)
index = list(product(*levels))
frame = frame.reindex(index)
print(frame)
输出
value
0 0 0 0
1 1
1 0 2
1 3
1 0 0 4
1 NaN
1 0 6
1 7
现在,reshape()
将按预期工作.
shape = map(len, frame.index.levels)
print(frame.values.reshape(shape))
输出
[[[ 0. 1.]
[ 2. 3.]]
[[ 4. nan]
[ 6. 7.]]]
(相当丑陋的)单缸飞机是
The (rather ugly) one-liner is
frame.reindex(list(product(*map(tuple, frame.index.levels)))).values\
.reshape(map(len, frame.index.levels))
这篇关于将具有n级层次结构索引的Pandas DataFrame转换为n-D Numpy数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!