我有两个输入数据框
df1(请注意,此DF可以包含更多数据列)
Sample Animal Time Sex
0 1 A one male
1 2 A two male
2 3 B one female
3 4 C one male
4 5 D one female
和df2
a b c
Sample
1 0.2 0.4 0.3
2 0.5 0.7 0.2
3 0.4 0.1 0.9
4 0.4 0.2 0.3
5 0.6 0.2 0.4
并且我想将它们结合起来,以便获得以下信息:
one_a one_b one_c two_a two_b two_c Sex
Animal
A 0.2 0.4 0.3 0.5 0.7 0.2 male
B 0.4 0.1 0.9 NaN NaN NaN female
C 0.4 0.2 0.3 NaN NaN NaN male
D 0.6 0.2 0.4 NaN NaN NaN female
这是我的工作方式:
df2.reset_index(inplace = True)
df3 = pd.melt(df2, id_vars=['Sample'], value_vars=list(cols))
df4 = pd.merge(df3, df1, on='Sample')
df4['moo'] = df4['Group'] + '_' + df4['variable']
df5 = pd.pivot_table(df4, values='value', index='Animal', columns='moo')
df6 = df1.groupby('Animal').agg('first')
pd.concat([df5, df6], axis=1).drop('Sample',1).drop('Group',1)
这工作得很好,但对于大型数据集可能会很慢。我想知道是否有任何熊猫专业人士看得更好(阅读速度更快,效率更高)?我是熊猫的新手,可以想象这里有一些我不知道的捷径。
最佳答案
这里有几个步骤。关键是要创建类似one_a one_b .... two_c
的列,我们需要在Time
索引中添加Sample
列以构建多级索引,然后在unstack
中获得所需的表单。然后,需要groupby
索引上的Animal
进行汇总并减少NaN
的数量。其余只是格式上的一些操作。
import pandas as pd
# your data
# ==============================
# set index
df1 = df1.set_index('Sample')
print(df1)
Animal Time Sex
Sample
1 A one male
2 A two male
3 B one female
4 C one male
5 D one female
print(df2)
a b c
Sample
1 0.2 0.4 0.3
2 0.5 0.7 0.2
3 0.4 0.1 0.9
4 0.4 0.2 0.3
5 0.6 0.2 0.4
# processing
# =============================
df = df1.join(df2)
df_temp = df.set_index(['Animal', 'Sex','Time'], append=True).unstack()
print(df_temp)
a b c
Time one two one two one two
Sample Animal Sex
1 A male 0.2 NaN 0.4 NaN 0.3 NaN
2 A male NaN 0.5 NaN 0.7 NaN 0.2
3 B female 0.4 NaN 0.1 NaN 0.9 NaN
4 C male 0.4 NaN 0.2 NaN 0.3 NaN
5 D female 0.6 NaN 0.2 NaN 0.4 NaN
# rename the columns if you wish
df_temp.columns = ['{}_{}'.format(x, y) for x, y in zip(df_temp.columns.get_level_values(1), df_temp.columns.get_level_values(0))]
print(df_temp)
one_a two_a one_b two_b one_c two_c
Sample Animal Sex
1 A male 0.2 NaN 0.4 NaN 0.3 NaN
2 A male NaN 0.5 NaN 0.7 NaN 0.2
3 B female 0.4 NaN 0.1 NaN 0.9 NaN
4 C male 0.4 NaN 0.2 NaN 0.3 NaN
5 D female 0.6 NaN 0.2 NaN 0.4 NaN
result = df_temp.reset_index('Sex').groupby(level='Animal').agg(max).sort_index(axis=1)
print(result)
Sex one_a one_b one_c two_a two_b two_c
Animal
A male 0.2 0.4 0.3 0.5 0.7 0.2
B female 0.4 0.1 0.9 NaN NaN NaN
C male 0.4 0.2 0.3 NaN NaN NaN
D female 0.6 0.2 0.4 NaN NaN NaN
关于python - Pandas 合并数据框并旋转创建新列,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/31418456/