如何识别Python Pandas Dataframe中重复行的首次出现

本文介绍了如何识别Python Pandas Dataframe中重复行的首次出现的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个Pandas DataFrame，其中的一组列具有重复的值.例如:

I have a pandas DataFrame with duplicate values for a set of columns. For example:

df = pd.DataFrame({'Column1': {0: 1, 1: 2, 2: 3}, 'Column2': {0: 'ABC', 1: 'XYZ', 2: 'ABC'}, 'Column3': {0: 'DEF', 1: 'DEF', 2: 'DEF'}, 'Column4': {0: 10, 1: 40, 2: 10})

In [2]: df
Out[2]:
   Column1 Column2 Column3  Column4 is_duplicated  dup_index
0        1     ABC     DEF       10         False          0
1        2     XYZ     DEF       40         False          1
2        3     ABC     DEF       10          True          0

行(1)和(3)相同.本质上，第(3)行是第(1)行的副本.

我正在寻找以下输出:

Is_Duplicate，包含该行是否为重复项[可以通过在数据框列(Column2，Column3和Column4)上使用重复"方法来完成)

Is_Duplicate, containing whether the row is a duplicate or not [can be accomplished by using "duplicated" method on dataframe columns (Column2, Column3 and Column4)]

Dup_Index重复行的原始索引.

In [3]: df
Out[3]:
   Column1 Column2 Column3  Column4  Is_Duplicate  Dup_Index
0        1     ABC     DEF       10         False          0
1        2     XYZ     DEF       40         False          1
2        3     ABC     DEF       10          True          0

推荐答案

有一个DataFrame方法用于第一列:

There is a DataFrame method duplicated for the first column:

In [11]: df.duplicated(['Column2', 'Column3', 'Column4'])
Out[11]:
0    False
1    False
2     True

In [12]: df['is_duplicated'] = df.duplicated(['Column2', 'Column3', 'Column4'])

要进行第二次操作，您可以尝试执行以下操作:

To do the second you could try something like this:

In [13]: g = df.groupby(['Column2', 'Column3', 'Column4'])

In [14]: df1 = df.set_index(['Column2', 'Column3', 'Column4'])

In [15]: df1.index.map(lambda ind: g.indices[ind][0])
Out[15]: array([0, 1, 0])

In [16]: df['dup_index'] = df1.index.map(lambda ind: g.indices[ind][0])

In [17]: df
Out[17]:
   Column1 Column2 Column3  Column4 is_duplicated  dup_index
0        1     ABC     DEF       10         False          0
1        2     XYZ     DEF       40         False          1
2        3     ABC     DEF       10          True          0

这篇关于如何识别Python Pandas Dataframe中重复行的首次出现的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！