问题描述
我有2个数据帧,它们都有一个可能有重复的键列,但数据帧大部分具有相同的重复键。我想在这个关键字上合并这些数据框,但是当两者都具有相同的重复时,这些重复的数据帧被分别合并。另外如果一个数据框比另一个数据帧重复一个,我希望将其值填充为NaN。例如: df1 = pd.DataFrame({'key':['K0','K1','K2' ,'K2','K2','K3'],
'A':['A0','A1','A2','A3','A4' b $ b columns = ['key','A'])
df2 = pd.DataFrame({'B':['B0','B1','B2','B3','B4' ,'B5','B6'],
'key':['K0','K1','K2','K2','K3','K3','K4']},
columns = ['key','B'])
键A
0 K0 A0
1 K1 A1
2 K2 A2
3 K2 A3
4 K2 A4
5 K3 A5
键B
0 K0 B0
1 K1 B1
2 K2 B2
3 K2 B3
4 K3 B4
5 K3 B5
6 K4 B6
我试图获得以下输出
键AB
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K2 A3 B3
6 K2 A4 NaN
8 K3 A5 B4
9 K3 NaN B5
10 K4 NaN B6
所以基本上,我想将重复的K2键视为K2_1,K2_2,然后在数据框上进行how ='outer'合并。
任何想法如何才能实现这一点?
再次更快
%% cython
#在jupyter笔记本中使用cython
#在另一个单元格中运行`%load_ext Cython`
from collections import defaultdict
import numpy as np
def cg(x):
cnt = defaultdict(lambda:0)
for x in x.tolist() :
cnt [j] + = 1
yield cnt [j]
def fastcount(x):
return [i for c in cg (x)]
df1 ['cc'] = fastcount(df1.key.values)
df2 ['cc'] = fastcount(df2.key.values)
df1.merge(df2,how ='outer')。drop('cc',1)
更快的回答;不可缩放
def fastcount(x):
unq,inv = np。 unique(x,return_inverse = 1)
m = np.arange(len(unq))[:, None] == inv
return(m.cumsum(1)* m).sum(0)
df1 ['cc'] = fastcount(df1.key.values)
df2 ['cc'] = fastcount(df2.key.values)
df1.merge(df2,how ='outer')。drop('cc',1)
旧答案
df1 ['cc'] = df1.groupby ('key')。cumcount()
df2 ['cc'] = df2.groupby('key')。cumcount()
df1.merge(df2,how ='outer ').drop('cc',1)
I have 2 dataframes, both have a key column which could have duplicates, but the dataframes mostly have the same duplicated keys. I'd like to merge these dataframes on that key, but in such a way that when both have the same duplicate those duplicates are merged respectively. In addition if one dataframe has more duplicates of a key than the other, I'd like it's values to be filled as NaN. For example:
df1 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K2', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']},
columns=['key', 'A'])
df2 = pd.DataFrame({'B': ['B0', 'B1', 'B2', 'B3', 'B4', 'B5', 'B6'],
'key': ['K0', 'K1', 'K2', 'K2', 'K3', 'K3', 'K4']},
columns=['key', 'B'])
key A
0 K0 A0
1 K1 A1
2 K2 A2
3 K2 A3
4 K2 A4
5 K3 A5
key B
0 K0 B0
1 K1 B1
2 K2 B2
3 K2 B3
4 K3 B4
5 K3 B5
6 K4 B6
I'm trying to get the following output
key A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K2 A3 B3
6 K2 A4 NaN
8 K3 A5 B4
9 K3 NaN B5
10 K4 NaN B6
So basically, I'd like to treat the duplicated K2 keys as K2_1, K2_2, ... and then do the how='outer' merge on the dataframes.Any ideas how I can accomplish this?
faster again
%%cython
# using cython in jupyter notebook
# in another cell run `%load_ext Cython`
from collections import defaultdict
import numpy as np
def cg(x):
cnt = defaultdict(lambda: 0)
for j in x.tolist():
cnt[j] += 1
yield cnt[j]
def fastcount(x):
return [i for i in cg(x)]
df1['cc'] = fastcount(df1.key.values)
df2['cc'] = fastcount(df2.key.values)
df1.merge(df2, how='outer').drop('cc', 1)
faster answer; not scalable
def fastcount(x):
unq, inv = np.unique(x, return_inverse=1)
m = np.arange(len(unq))[:, None] == inv
return (m.cumsum(1) * m).sum(0)
df1['cc'] = fastcount(df1.key.values)
df2['cc'] = fastcount(df2.key.values)
df1.merge(df2, how='outer').drop('cc', 1)
old answer
df1['cc'] = df1.groupby('key').cumcount()
df2['cc'] = df2.groupby('key').cumcount()
df1.merge(df2, how='outer').drop('cc', 1)
这篇关于将大 pandas 数据框与重复的重复文件进行合并的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!