将大 pandas 数据框与重复的重复文件进行合并

本文介绍了将大 pandas 数据框与重复的重复文件进行合并的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有2个数据帧，它们都有一个可能有重复的键列，但数据帧大部分具有相同的重复键。我想在这个关键字上合并这些数据框，但是当两者都具有相同的重复时，这些重复的数据帧被分别合并。另外如果一个数据框比另一个数据帧重复一个，我希望将其值填充为NaN。例如：

  df1 = pd.DataFrame（{'key'：['K0'，'K1'，'K2' ，'K2'，'K2'，'K3']，
'A'：['A0'，'A1'，'A2'，'A3'，'A4' b $ b columns = ['key'，'A']）
 df2 = pd.DataFrame（{'B'：['B0'，'B1'，'B2'，'B3'，'B4' ，'B5'，'B6']，
'key'：['K0'，'K1'，'K2'，'K2'，'K3'，'K3'，'K4']}，
 columns = ['key'，'B']）
 
键A 
 0 K0 A0 
 1 K1 A1 
 2 K2 A2 
 3 K2 A3 
 4 K2 A4 
 5 K3 A5 
 
键B 
 0 K0 B0 
 1 K1 B1 
 2 K2 B2 
 3 K2 B3 
 4 K3 B4 
 5 K3 B5 
 6 K4 B6

我试图获得以下输出

 键AB 
 0 K0 A0 B0 
 1 K1 A1 B1 
 2 K2 A2 B2 
 3 K2 A3 B3 
 6 K2 A4 NaN 
 8 K3 A5 B4 
 9 K3 NaN B5 
 10 K4 NaN B6

所以基本上，我想将重复的K2键视为K2_1，K2_2，然后在数据框上进行how ='outer'合并。
任何想法如何才能实现这一点？

解决方案

再次更快

  %% cython 
＃在jupyter笔记本中使用cython 
＃在另一个单元格中运行`％load_ext Cython` 
 from collections import defaultdict 
 import numpy as np 
 
 def cg（x）：
 cnt = defaultdict（lambda：0）
 
 for x in x.tolist（） ：
 cnt [j] + = 1 
 yield cnt [j] 
 
 
 def fastcount（x）：
 return [i for c in cg （x）] 
 
 df1 ['cc'] = fastcount（df1.key.values）
 df2 ['cc'] = fastcount（df2.key.values）
 
 df1.merge（df2，how ='outer'）。drop（'cc'，1）

更快的回答;不可缩放

  def fastcount（x）：
 unq，inv = np。 unique（x，return_inverse = 1）
m = np.arange（len（unq））[:, None] == inv 
 return（m.cumsum（1）* m）.sum（0） 
 
 df1 ['cc'] = fastcount（df1.key.values）
 df2 ['cc'] = fastcount（df2.key.values）
 
 df1.merge（df2，how ='outer'）。drop（'cc'，1）

旧答案

  df1 ['cc'] = df1.groupby （'key'）。cumcount（）
 df2 ['cc'] = df2.groupby（'key'）。cumcount（）
 
 df1.merge（df2，how ='outer '）.drop（'cc'，1）

I have 2 dataframes, both have a key column which could have duplicates, but the dataframes mostly have the same duplicated keys. I'd like to merge these dataframes on that key, but in such a way that when both have the same duplicate those duplicates are merged respectively. In addition if one dataframe has more duplicates of a key than the other, I'd like it's values to be filled as NaN. For example:

df1 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K2', 'K2', 'K3'],
                    'A':   ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']}, 
                   columns=['key', 'A'])
df2 = pd.DataFrame({'B':   ['B0', 'B1', 'B2', 'B3', 'B4', 'B5', 'B6'],
                    'key': ['K0', 'K1', 'K2', 'K2', 'K3', 'K3', 'K4']}, 
                   columns=['key', 'B'])

  key   A
0  K0  A0
1  K1  A1
2  K2  A2
3  K2  A3
4  K2  A4
5  K3  A5

  key   B
0  K0  B0
1  K1  B1
2  K2  B2
3  K2  B3
4  K3  B4
5  K3  B5
6  K4  B6

I'm trying to get the following output

   key    A   B
0   K0   A0  B0
1   K1   A1  B1
2   K2   A2  B2
3   K2   A3  B3
6   K2   A4  NaN
8   K3   A5  B4
9   K3  NaN  B5
10  K4  NaN  B6

So basically, I'd like to treat the duplicated K2 keys as K2_1, K2_2, ... and then do the how='outer' merge on the dataframes.Any ideas how I can accomplish this?

解决方案

faster again

%%cython
# using cython in jupyter notebook
# in another cell run `%load_ext Cython`
from collections import defaultdict
import numpy as np

def cg(x):
    cnt = defaultdict(lambda: 0)

    for j in x.tolist():
        cnt[j] += 1
        yield cnt[j]


def fastcount(x):
    return [i for i in cg(x)]

df1['cc'] = fastcount(df1.key.values)
df2['cc'] = fastcount(df2.key.values)

df1.merge(df2, how='outer').drop('cc', 1)

faster answer; not scalable

def fastcount(x):
    unq, inv = np.unique(x, return_inverse=1)
    m = np.arange(len(unq))[:, None] == inv
    return (m.cumsum(1) * m).sum(0)

df1['cc'] = fastcount(df1.key.values)
df2['cc'] = fastcount(df2.key.values)

df1.merge(df2, how='outer').drop('cc', 1)

old answer

df1['cc'] = df1.groupby('key').cumcount()
df2['cc'] = df2.groupby('key').cumcount()

df1.merge(df2, how='outer').drop('cc', 1)

这篇关于将大 pandas 数据框与重复的重复文件进行合并的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！