我有一个具有这种格式的世界指标数据集
country year indicatorName value
USA 1970 Agricultural Land ...
USA 1970 Crop production ...
...
USA 2000 Agricultural Land ...
USA 2000 Crop production ...
...
Mexico 1970 Agricultural Land ...
Mexico 1970 Crop production ...
...
Mexico 2000 Agricultural Land ...
Mexico 2000 Crop production ...
这里有一些指标我没有包括,但是这两个是我感兴趣的指标。我想将
value
的相应Crop production
划分为每个Agricultural Land
每个country
的year
。让我们将结果命名为crop_prod_density
。我不知道该怎么做
df.groupby(['country', 'year'])
如何从此处执行操作以产生以下输出:
添加新的行指示器
country year indicatorName value USA 1970 Agricultural Land ... USA 1970 Crop production ... USA 1970 crop_prod_density ...
为分组(国家/地区,年份)的所有行添加具有相同值的新列
country year indicatorName value crop_prod_density USA 1970 Agricultural Land ... us_value_1970 USA 1970 Crop production ... us_value_1970 ... Mexico 2000 Agricultural Land ... mx_value_2000 Mexico 2000 Crop production ... mx_value_2000
仅具有此列的新数据框
country year crop_prod_density USA 1970 us_value_1970 ... USA 2000 us_value_2000 ... Mexico 1970 mx_value_1970 ... Mexico 2000 mx_value_2000
最佳答案
您可以先用set_index
用unstack
重塑形状,然后再用div
除以:
print (df)
country year indicatorName value
0 USA 1970 Agricultural Land 10
1 USA 1970 Crop production 2
2 USA 2000 Agricultural Land 10
3 USA 2000 Crop production 3
4 Mexico 1970 Agricultural Land 10
5 Mexico 1970 Crop production 5
6 Mexico 2000 Agricultural Land 10
7 Mexico 2000 Crop production 4
df = (df.set_index(['country','year','indicatorName'])['value']
.unstack()
.assign(crop_prod_density=lambda x: x['Crop production'].div(x['Agricultural Land'])))
print (df)
indicatorName Agricultural Land Crop production crop_prod_density
country year
Mexico 1970 10 5 0.5
2000 10 4 0.4
USA 1970 10 2 0.2
2000 10 3 0.3
然后通过
stack
重新调整形状:df1 = df.stack().reset_index(name='value')
print (df1)
country year indicatorName value
0 Mexico 1970 Agricultural Land 10.0
1 Mexico 1970 Crop production 5.0
2 Mexico 1970 crop_prod_density 0.5
3 Mexico 2000 Agricultural Land 10.0
4 Mexico 2000 Crop production 4.0
5 Mexico 2000 crop_prod_density 0.4
6 USA 1970 Agricultural Land 10.0
7 USA 1970 Crop production 2.0
8 USA 1970 crop_prod_density 0.2
9 USA 2000 Agricultural Land 10.0
10 USA 2000 Crop production 3.0
11 USA 2000 crop_prod_density 0.3
对于将原始新列追加到索引新列的索引,但是最后必须通过
reindex
更改列的顺序:df2 =(df.set_index(['crop_prod_density'], append=True)
.stack()
.reset_index(name='value')
.reindex(columns=['country','year','indicatorName','value','crop_prod_density']))
print (df2)
country year indicatorName value crop_prod_density
0 Mexico 1970 Agricultural Land 10 0.5
1 Mexico 1970 Crop production 5 0.5
2 Mexico 2000 Agricultural Land 10 0.4
3 Mexico 2000 Crop production 4 0.4
4 USA 1970 Agricultural Land 10 0.2
5 USA 1970 Crop production 2 0.2
6 USA 2000 Agricultural Land 10 0.3
7 USA 2000 Crop production 3 0.3
最后删除不必要的列并从
MultiIndex
创建列:df3 = (df.drop(['Crop production','Agricultural Land'], axis=1)
.reset_index()
.rename_axis(None, 1))
print (df3)
country year crop_prod_density
0 Mexico 1970 0.5
1 Mexico 2000 0.4
2 USA 1970 0.2
3 USA 2000 0.3
关于python - Pandas数据框中的按行操作,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/47546355/