Pandas:将操作应用于 MultiIndex 中的重复列

特别是使用列名我不喜欢得票最高的答案是它需要 .diff() 或 .div() - hacks，这使得代码当子级超过两列时，难以阅读且难以实现. 解决方案设置将pandas导入为pddf = pd.DataFrame([[5777, 5385, 5419, 4887],[4849, 3759, 4539, 3381],[4971, 3824, 4645, 3424],[4827, 3459, 4552, 3153],[5207, 3670, 4876, 3358],],index=pd.to_datetime(['2001-01-01','2002-01-01','2003-01-01','2004-01-01','2005-01-01']),列=pd.MultiIndex.from_tuples([('非农总人数', '雇佣人数'), ('非农总人数', '职位空缺'),('私人总数', '招聘'), ('私人总数', '职位空缺')]))打印文件非农合计私营合计招聘职位空缺招聘职位空缺2001-01-01 5777 5385 5419 48872002-01-01 4849 3759 4539 33812003-01-01 4971 3824 4645 34242004-01-01 4827 3459 4552 31532005-01-01 5207 3670 4876 3358试试:df.T.groupby(level=0).diff(-1).dropna().T非农合计私营合计雇用雇用2001-01-01 392.0 532.02002-01-01 1090.0 1158.02003-01-01 1147.0 1221.02004-01-01 1368.0 1399.02005-01-01 1537.0 1518.0要应用其他变换，比如比率，你可以这样做:print df.T.groupby(level=0).apply(lambda x: np.exp(np.log(x).diff(-1))).dropna().T非农合计私营合计雇用雇用2001-01-01 1.072795 1.1088602002-01-01 1.289971 1.3425022003-01-01 1.299948 1.3566002004-01-01 1.395490 1.4437042005-01-01 1.418801 1.452055或者:print df.T.groupby(level=0).apply(lambda x: x.div(x.shift(-1))).dropna().T非农合计私营合计雇用雇用2001-01-01 1.072795 1.1088602002-01-01 1.289971 1.3425022003-01-01 1.299948 1.3566002004-01-01 1.395490 1.4437042005-01-01 1.418801 1.452055要重命名列并与原始数据框结合，您可以:df2 = df.T.groupby(level=0).diff(-1).dropna().Tdf2.columns = pd.MultiIndex.from_tuples([('非农总量', '差异'),('总私人', '差异')])pd.concat([df, df2],axis=1).sort_index(axis=1)看起来像: 非农总量私人总量\招聘职位空缺招聘职位空缺2001-01-01 5777 5385 392.0 5419 48872002-01-01 4849 3759 1090.0 4539 33812003-01-01 4971 3824 1147.0 4645 34242004-01-01 4827 3459 1368.0 4552 31532005-01-01 5207 3670 1537.0 4876 3358区别2001-01-01 532.02002-01-01 1158.02003-01-01 1221.02004-01-01 1399.02005-01-01 1518.0I have MultiColumns: the second level repetitively contains Job Openings and Hires. I would like to subtract one from another for each of the top-level columns - but all I try gets me into index-errors or slice errors. How can I compute it?Sample data:>>> df.head()Out[25]: Total nonfarm Total private Hires Job openings Hires Job openingsdate2001-01-01 5777 5385 5419 48872002-01-01 4849 3759 4539 33812003-01-01 4971 3824 4645 34242004-01-01 4827 3459 4552 31532005-01-01 5207 3670 4876 3358expected output:Out[25]: Total nonfarm Total private difference differencedate2001-01-01 1234 56782002-01-01 1234 56782003-01-01 1234 56782004-01-01 1234 56782005-01-01 1234 5678where the numbers obviously are not correct.Specifically within an apply()In order to have a generally applicable way, I was trying to set updef apply(group): result = group.loc[:, pd.IndexSlice[:, 'Job openings']].div(group.loc[:, pd.IndexSlice[:, 'Hires']].values) result.columns = pd.MultiIndex.from_product([[group.columns.get_level_values(0)[0]], ['Ratio']]) return result.valuesfoo = df.groupby(axis=1, level=0).apply(apply)Which suffers from two problems:I need to cheat around with .values in order to get the divide properlyfoo is not a proper dataframe:Accommodation and food services [[0.76], [0.480349344978], [0.501388888889], [...Arts, entertainment, and recreation [[0.558139534884], [0.46017699115], [0.2483221...Construction [[0.35], [0.274881516588], [0.267260579065], [...I first tried to return result, instead of result.values, but that just lead to a data frame full of NaNSpecifically with using the column namesWhat I don't like about the highest-voted answer is that it requires on .diff() or .div() - hacks, which make the code hard to read and are hard to implement when there's more than two columns at the sub-level. 解决方案 Setupimport pandas as pddf = pd.DataFrame( [ [5777, 5385, 5419, 4887], [4849, 3759, 4539, 3381], [4971, 3824, 4645, 3424], [4827, 3459, 4552, 3153], [5207, 3670, 4876, 3358], ], index=pd.to_datetime(['2001-01-01', '2002-01-01', '2003-01-01', '2004-01-01', '2005-01-01']), columns=pd.MultiIndex.from_tuples( [('Total nonfarm', 'Hires'), ('Total nonfarm', 'Job Openings'), ('Total private', 'Hires'), ('Total private', 'Job Openings')] ))print df Total nonfarm Total private Hires Job Openings Hires Job Openings2001-01-01 5777 5385 5419 48872002-01-01 4849 3759 4539 33812003-01-01 4971 3824 4645 34242004-01-01 4827 3459 4552 31532005-01-01 5207 3670 4876 3358Try:df.T.groupby(level=0).diff(-1).dropna().T Total nonfarm Total private Hires Hires2001-01-01 392.0 532.02002-01-01 1090.0 1158.02003-01-01 1147.0 1221.02004-01-01 1368.0 1399.02005-01-01 1537.0 1518.0To apply other transforms, say a ratio, you could do:print df.T.groupby(level=0).apply(lambda x: np.exp(np.log(x).diff(-1))).dropna().T Total nonfarm Total private Hires Hires2001-01-01 1.072795 1.1088602002-01-01 1.289971 1.3425022003-01-01 1.299948 1.3566002004-01-01 1.395490 1.4437042005-01-01 1.418801 1.452055Or:print df.T.groupby(level=0).apply(lambda x: x.div(x.shift(-1))).dropna().T Total nonfarm Total private Hires Hires2001-01-01 1.072795 1.1088602002-01-01 1.289971 1.3425022003-01-01 1.299948 1.3566002004-01-01 1.395490 1.4437042005-01-01 1.418801 1.452055To rename columns and combine with the original dataframe you can:df2 = df.T.groupby(level=0).diff(-1).dropna().Tdf2.columns = pd.MultiIndex.from_tuples( [('Total nonfarm', 'difference'), ('Total private', 'difference')])pd.concat([df, df2], axis=1).sort_index(axis=1)Looks like: Total nonfarm Total private \ Hires Job Openings difference Hires Job Openings2001-01-01 5777 5385 392.0 5419 48872002-01-01 4849 3759 1090.0 4539 33812003-01-01 4971 3824 1147.0 4645 34242004-01-01 4827 3459 1368.0 4552 31532005-01-01 5207 3670 1537.0 4876 3358 difference2001-01-01 532.02002-01-01 1158.02003-01-01 1221.02004-01-01 1399.02005-01-01 1518.0 这篇关于Pandas:将操作应用于 MultiIndex 中的重复列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！上岸，阿里云！