对于以下数据框,我想在每个citydistrict组中填充缺少的年份(从2015年到2017年);然后通过按列分组来计算pctcitydistrictyear,最后一步,然后水平显示valuepct列?

  city district  value  year
0   sh        a      2  2015
1   sh        a      3  2016
2   sh        b      5  2015
3   sh        b      3  2016
4   bj        c      4  2015
5   bj        c      3  2017


我到目前为止所做的:

1.填写缺少的年份,但仍无法正常工作:

rng = pd.date_range('2015', '2017', freq='YS').dt.year
df = df.apply(lambda x: x.reindex(rng, fill_value = 0))


2.通过按pctcity分组来计算district

df['pct'] = df.sort_values('year').groupby(['city', 'district']).value.pct_change()


3.水平显示valuepct列,但顺序不是我想要的:

df.pivot_table(columns='year', index=['city','district'], values=['value', 'pct'], fill_value='NaN').reset_index()


我到目前为止的输出:

      city   district       pct            value
year                  2015 2016  2017  2015 2016 2017
0      bj        c     NaN  NaN -0.25   4.0  NaN    3
1      sh        a     NaN  0.5   NaN   2.0    3  NaN
2      sh        b     NaN -0.4   NaN   5.0    3  NaN


我怎样才能得到预期的结果呢?

city  district      2015         2016         2017
                value  pct    value  pct  value   pct
bj     c          4                         3
sh     a          2             3    0.5
sh     b          5             3   -0.4


python - 按多列填充缺少的年份组,并按顺序在 Pandas 中水平显示多列-LMLPHP

谢谢。

最佳答案

DataFrame.swaplevelDataFrame.sort_index一起使用,还为reindex添加了另一个解决方案:

rng = pd.date_range('2015', '2017', freq='YS').year
c = df['city'].unique()
d = df['district'].unique()
mux = pd.MultiIndex.from_product([c, d, rng], names=['city','district','year'])

df = df.set_index(['city','district','year']).reindex(mux)

df['pct'] = df.sort_values('year').groupby(['city', 'district']).value.pct_change()

df = df.pivot_table(columns='year',
                    index=['city','district'],
                    values=['value', 'pct'],
                    fill_value='NaN')

df = df.swaplevel(0,1, axis=1).sort_index(axis=1, level=0)
print (df)
year          2015       2016        2017
               pct value  pct value   pct value
city district
bj   c         NaN   4.0  0.0   NaN -0.25     3
sh   a         NaN   2.0  0.5     3  0.00   NaN
     b         NaN   5.0 -0.4     3  0.00   NaN


编辑:错误:


  ValueError:无法处理非唯一的多索引!


表示传递给groupby的每列都有重复项,因此这里是['city','district','year']。解决方案是创建唯一的值-例如总体而言:

print (df)
#  city district  value  year
#0   sh        a      2  2015
#0   sh        a     20  2015
#1   sh        a      3  2016
#2   sh        b      5  2015
#3   sh        b      3  2016
#4   bj        c      4  2015
#5   bj        c      3  2017

rng = pd.date_range('2015', '2017', freq='YS').year
c = df['city'].unique()
d = df['district'].unique()
mux = pd.MultiIndex.from_product([c, d, rng], names=['city','district','year'])




print (df.groupby(['city','district','year'])['value'].mean())
city  district  year
bj    c         2015     4
                2017     3
sh    a         2015    11
                2016     3
      b         2015     5
                2016     3
Name: value, dtype: int64

df = df.groupby(['city','district','year'])['value'].mean().reindex(mux)

print (df)
#city  district  year
#sh    a         2015    11.0
#                2016     3.0
#                2017     NaN
#      b         2015     5.0
#                2016     3.0
#                2017     NaN
#      c         2015     NaN
#                2016     NaN
#                2017     NaN
#bj    a         2015     NaN
#                2016     NaN
#                2017     NaN
#      b         2015     NaN
#                2016     NaN
#                2017     NaN
#      c         2015     4.0
#                2016     NaN
#                2017     3.0
#Name: value, dtype: float64

关于python - 按多列填充缺少的年份组,并按顺序在 Pandas 中水平显示多列,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/58446571/

10-12 19:40