我有一个很大的数据框,可以通过groupby
从中获取所需的数据。我需要从新数据框的索引中获取几个单独的列。
原始数据框的一部分如下所示:
code place vl year week
0 111.0002.0056 region1 1 2017 29
1 112.6500.2285 region2 1 2017 31
2 112.5600.6325 region2 1 2017 30
3 112.5600.6325 region2 1 2017 30
4 112.5600.8159 region2 1 2017 30
5 111.0002.0056 region2 1 2017 29
6 111.0002.0056 region2 1 2017 30
7 111.0002.0056 region2 1 2017 28
8 112.5600.8159 region3 1 2017 31
9 112.5600.8159 region3 1 2017 28
10 111.0002.0114 region3 1 2017 31
....
应用
groupby
之后,它看起来像这样(代码:df_test1 = df_test.groupby(['code' , 'year', 'week', 'place'])['vl'].sum().unstack(fill_value=0)
): place region1 region2 region3 region4 index1
code year week
111.0002.0006 2017 29 0 3 0 0 (111.0002.0006, 2017, 29)
30 0 7 0 0 (111.0002.0006, 2017, 30)
111.0002.0018 2017 29 0 0 0 0 (111.0002.0018, 2017, 29)
111.0002.0029 2017 30 0 0 0 0 (111.0002.0029, 2017, 30)
111.0002.0055 2017 28 0 33 0 8 (111.0002.0055, 2017, 28)
29 1 155 2 41 (111.0002.0055, 2017, 29)
30 0 142 1 39 (111.0002.0055, 2017, 30)
31 0 31 0 13 (111.0002.0055, 2017, 31)
111.0002.0056 2017 28 9 36 0 4 (111.0002.0056, 2017, 28)
29 20 75 2 37 (111.0002.0056, 2017, 29)
30 17 81 2 33 (111.0002.0056, 2017, 30)
....
我将索引保存在单独的列
index1
(代码:df_test1['index1'] = df_test1.index
)中我需要从
index1
列中分离出三个单独的列code
,year
和week
。结果应如下所示:
region1 region2 region3 region4 code year week
0 3 0 0 111.0002.0006 2017 29
0 7 0 0 111.0002.0006 2017 30
0 0 0 0 111.0002.0018 2017 29
0 0 0 0 111.0002.0029 2017 30
0 33 0 8 111.0002.0055 2017 28
1 155 2 41 111.0002.0055 2017 29
0 142 1 39 111.0002.0055 2017 30
0 31 0 13 111.0002.0055 2017 31
....
我将不胜感激任何建议!
最佳答案
您添加 reset_index
而不是df_test1['index1'] = df_test1.index
,并为干净的df
添加 rename_axis
-它删除列名place
:
df_test1 = df_test.groupby(['code' , 'year', 'week', 'place'])['vl'].sum() \
.unstack(fill_value=0) \
.reset_index() \
.rename_axis(None, axis=1)
print (df_test1)
code year week region1 region2 region3
0 111.0002.0056 2017 28 0 1 0
1 111.0002.0056 2017 29 1 1 0
2 111.0002.0056 2017 30 0 1 0
3 111.0002.0114 2017 31 0 0 1
4 112.5600.6325 2017 30 0 2 0
5 112.5600.8159 2017 28 0 0 1
6 112.5600.8159 2017 30 0 1 0
7 112.5600.8159 2017 31 0 0 1
8 112.6500.2285 2017 31 0 1 0
最后,如有必要,更改列的顺序:
#all cols are columns in df_test1
cols = ['code' , 'year', 'week']
df_test1 = df_test1[[x for x in df_test1.columns if x not in cols] + cols]
print (df_test1)
region1 region2 region3 code year week
0 0 1 0 111.0002.0056 2017 28
1 1 1 0 111.0002.0056 2017 29
2 0 1 0 111.0002.0056 2017 30
3 0 0 1 111.0002.0114 2017 31
4 0 2 0 112.5600.6325 2017 30
5 0 0 1 112.5600.8159 2017 28
6 0 1 0 112.5600.8159 2017 30
7 0 0 1 112.5600.8159 2017 31
8 0 1 0 112.6500.2285 2017 31