问题描述
假设我们从
开始import numpy as np
a = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
如何有效地将其制作为等效于pandas的数据框
import pandas as pd
>>> pd.DataFrame({'a': [0, 0, 1, 1], 'b': [1, 3, 5, 7], 'c': [2, 4, 6, 8]})
a b c
0 0 1 2
1 0 3 4
2 1 5 6
3 1 7 8
这个想法是让a
列在原始数组的第一个维度上具有索引,而其余列则是二维数组在原始数组的后两个维度中的垂直串联. /p>
(这很容易使用循环;问题是在没有循环的情况下该如何做.)
更多示例
使用@Divakar的出色建议:
>>> np.random.randint(0,9,(4,3,2))
array([[[0, 6],
[6, 4],
[3, 4]],
[[5, 1],
[1, 3],
[6, 4]],
[[8, 0],
[2, 3],
[3, 1]],
[[2, 2],
[0, 0],
[6, 3]]])
应该做成类似这样的东西:
>>> pd.DataFrame({
'a': [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3],
'b': [0, 6, 3, 5, 1, 6, 8, 2, 3, 2, 0, 6],
'c': [6, 4, 4, 1, 3, 4, 0, 3, 1, 2, 0, 3]})
a b c
0 0 0 6
1 0 6 4
2 0 3 4
3 1 5 1
4 1 1 3
5 1 6 4
6 2 8 0
7 2 2 3
8 2 3 1
9 3 2 2
10 3 0 0
11 3 6 3
这是在最终将其作为DataFrame投放之前,对NumPy进行大部分处理的一种方法,
m,n,r = a.shape
out_arr = np.column_stack((np.repeat(np.arange(m),n),a.reshape(m*n,-1)))
out_df = pd.DataFrame(out_arr)
如果您确切地知道列数为2
,这样我们将以b
和c
作为最后两列,以a
作为第一列,则可以添加如下列名所以-
out_df = pd.DataFrame(out_arr,columns=['a', 'b', 'c'])
样品运行-
>>> a
array([[[2, 0],
[1, 7],
[3, 8]],
[[5, 0],
[0, 7],
[8, 0]],
[[2, 5],
[8, 2],
[1, 2]],
[[5, 3],
[1, 6],
[3, 2]]])
>>> out_df
a b c
0 0 2 0
1 0 1 7
2 0 3 8
3 1 5 0
4 1 0 7
5 1 8 0
6 2 2 5
7 2 8 2
8 2 1 2
9 3 5 3
10 3 1 6
11 3 3 2
Suppose we start with
import numpy as np
a = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
How can this be efficiently be made into a pandas DataFrame equivalent to
import pandas as pd
>>> pd.DataFrame({'a': [0, 0, 1, 1], 'b': [1, 3, 5, 7], 'c': [2, 4, 6, 8]})
a b c
0 0 1 2
1 0 3 4
2 1 5 6
3 1 7 8
The idea is to have the a
column have the index in the first dimension in the original array, and the rest of the columns be a vertical concatenation of the 2d arrays in the latter two dimensions in the original array.
(This is easy to do with loops; the question is how to do it without them.)
Longer Example
Using @Divakar's excellent suggestion:
>>> np.random.randint(0,9,(4,3,2))
array([[[0, 6],
[6, 4],
[3, 4]],
[[5, 1],
[1, 3],
[6, 4]],
[[8, 0],
[2, 3],
[3, 1]],
[[2, 2],
[0, 0],
[6, 3]]])
Should be made to something like:
>>> pd.DataFrame({
'a': [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3],
'b': [0, 6, 3, 5, 1, 6, 8, 2, 3, 2, 0, 6],
'c': [6, 4, 4, 1, 3, 4, 0, 3, 1, 2, 0, 3]})
a b c
0 0 0 6
1 0 6 4
2 0 3 4
3 1 5 1
4 1 1 3
5 1 6 4
6 2 8 0
7 2 2 3
8 2 3 1
9 3 2 2
10 3 0 0
11 3 6 3
Here's one approach that does most of the processing on NumPy before finally putting it out as a DataFrame, like so -
m,n,r = a.shape
out_arr = np.column_stack((np.repeat(np.arange(m),n),a.reshape(m*n,-1)))
out_df = pd.DataFrame(out_arr)
If you precisely know that the number of columns would be 2
, such that we would have b
and c
as the last two columns and a
as the first one, you can add column names like so -
out_df = pd.DataFrame(out_arr,columns=['a', 'b', 'c'])
Sample run -
>>> a
array([[[2, 0],
[1, 7],
[3, 8]],
[[5, 0],
[0, 7],
[8, 0]],
[[2, 5],
[8, 2],
[1, 2]],
[[5, 3],
[1, 6],
[3, 2]]])
>>> out_df
a b c
0 0 2 0
1 0 1 7
2 0 3 8
3 1 5 0
4 1 0 7
5 1 8 0
6 2 2 5
7 2 8 2
8 2 1 2
9 3 5 3
10 3 1 6
11 3 3 2
这篇关于从一个Numpy 3d数组高效创建Pandas DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!