检查 Pandas 数据框列中的重复值 | 数据框列中的重复值

本文介绍了检查 Pandas 数据框列中的重复值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在 Pandas 中有没有办法检查数据框列是否有重复值，而不实际删除行?我有一个删除重复行的函数，但是，我只希望它在以下情况下运行特定列中实际上存在重复项.

目前，我将列中唯一值的数量与行数进行比较:如果唯一值少于行，则存在重复值并且代码运行.

 if len(df['Student'].unique())

是否有更简单或更有效的方法来检查特定列中是否存在重复值，使用 Pandas?

我正在处理的一些示例数据(只显示了两列).如果找到重复项，则另一个函数确定要保留哪一行(日期最早的行):

 学生日期0 乔 2017 年 12 月1 詹姆斯 2018 年 1 月2 鲍勃 2018 年 4 月3 乔 2017 年 12 月4 杰克 2018 年 2 月5 杰克 2018 年 3 月

解决方案

`主要问题`

列中是否存在重复值，真/假?

╔=========╦==============╗║ 学生 ║ 日期 ║╠＝＝＝＝＝＝＝＝＝╬＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝╣║ 乔 ║ 2017 年 12 月 ║╠＝＝＝＝＝＝＝＝＝╬＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝╣║ 鲍勃 ║ 2018 年 4 月 ║╠＝＝＝＝＝＝＝＝＝╬＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝＝╣║ 乔 ║ 2018 年 12 月 ║╚=========╩==============╝

假设上面的数据框 (df)，我们可以通过以下方式快速检查 Student 列中是否有重复:

boolean = not df["Student"].is_unique # True(归功于@Carsten)boolean = df['Student'].duplicated().any() # 真

`进一步阅读和参考`

上面我们使用的是 Pandas 系列方法之一.pandas DataFrame 有几个有用的方法，两个其中:

drop_duplicates(self[, subset, keep, inplace]) - 返回删除重复行的数据帧，可选择仅考虑某些列.
重复(self[,subset,keep]) - 返回表示重复行的布尔系列，可选择仅考虑某些列.

这些方法可以作为一个整体应用在DataFrame上，而不是像上面那样只是一个Serie(列).相当于:

boolean = df.duplicated(subset=['Student']).any() # True# 我们期待 True，因为乔可以被看到两次.

但是，如果我们对整个框架感兴趣，我们可以继续这样做:

boolean = df.duplicated().any() # Falseboolean = df.duplicated(subset=['Student','Date']).any() # False# 我们在这里期待 False - 没有重复的行# IE.乔 2017 年 12 月，乔 2018 年 12 月

还有最后一个有用的提示.通过使用 keep 参数，我们通常可以跳过几行直接访问我们需要的内容:

保持:{'first','last', False}，默认'first'

first : 除第一次出现外，删除重复项.
last : 除了最后一次出现之外，删除重复项.
False :删除所有重复项.

`玩的例子`

将pandas导入为pd导入 io数据 = '''\学生,日期乔，2017 年 12 月鲍勃，2018 年 4 月乔，2018 年 12 月'''df = pd.read_csv(io.StringIO(data), sep=',')# 方法 1:简单的真/假boolean = df.duplicated(subset=['Student']).any()print(boolean, end='\n\n') # 真# 方法二:先存储布尔数组，检查再删除duplicate_in_student = df.duplicated(subset=['Student'])如果duplicate_in_student.any():打印(df.loc[~duplicate_in_student], end='\n\n')# 方法三:使用 drop_duplicates 方法df.drop_duplicates(subset=['Student'], inplace=True)打印(df)

退货

真学生日期0 乔 2017 年 12 月1 鲍勃 2018 年 4 月学生日期0 乔 2017 年 12 月1 鲍勃 2018 年 4 月

Is there a way in pandas to check if a dataframe column has duplicate values, without actually dropping rows? I have a function that will remove duplicate rows, however, I only want it to run if there are actually duplicates in a specific column.

Currently I compare the number of unique values in the column to the number of rows: if there are less unique values than rows then there are duplicates and the code runs.

 if len(df['Student'].unique()) < len(df.index):
    # Code to remove duplicates based on Date column runs

Is there an easier or more efficient way to check if duplicate values exist in a specific column, using pandas?

Some of the sample data I am working with (only two columns shown). If duplicates are found then another function identifies which row to keep (row with oldest date):

    Student Date
0   Joe     December 2017
1   James   January 2018
2   Bob     April 2018
3   Joe     December 2017
4   Jack    February 2018
5   Jack    March 2018

解决方案

`Main question`

╔═════════╦═══════════════╗
║ Student ║ Date          ║
╠═════════╬═══════════════╣
║ Joe     ║ December 2017 ║
╠═════════╬═══════════════╣
║ Bob     ║ April 2018    ║
╠═════════╬═══════════════╣
║ Joe     ║ December 2018 ║
╚═════════╩═══════════════╝

Assuming above dataframe (df), we could do a quick check if duplicated in the Student col by:

boolean = not df["Student"].is_unique      # True (credit to @Carsten)
boolean = df['Student'].duplicated().any() # True

`Further reading and references`

Above we are using one of the Pandas Series methods. The pandas DataFrame has several useful methods, two of which are:

drop_duplicates(self[, subset, keep, inplace]) - Return DataFrame with duplicate rows removed, optionally only considering certain columns.
duplicated(self[, subset, keep]) - Return boolean Series denoting duplicate rows, optionally only considering certain columns.

These methods can be applied on the DataFrame as a whole, and not just a Serie (column) as above. The equivalent would be:

boolean = df.duplicated(subset=['Student']).any() # True
# We were expecting True, as Joe can be seen twice.

However, if we are interested in the whole frame we could go ahead and do:

boolean = df.duplicated().any() # False
boolean = df.duplicated(subset=['Student','Date']).any() # False
# We were expecting False here - no duplicates row-wise
# ie. Joe Dec 2017, Joe Dec 2018

And a final useful tip. By using the keep paramater we can normally skip a few rows directly accessing what we need:

first : Drop duplicates except for the first occurrence.
last : Drop duplicates except for the last occurrence.
False : Drop all duplicates.

`Example to play around with`

import pandas as pd
import io

data = '''\
Student,Date
Joe,December 2017
Bob,April 2018
Joe,December 2018'''

df = pd.read_csv(io.StringIO(data), sep=',')

# Approach 1: Simple True/False
boolean = df.duplicated(subset=['Student']).any()
print(boolean, end='\n\n') # True

# Approach 2: First store boolean array, check then remove
duplicate_in_student = df.duplicated(subset=['Student'])
if duplicate_in_student.any():
    print(df.loc[~duplicate_in_student], end='\n\n')

# Approach 3: Use drop_duplicates method
df.drop_duplicates(subset=['Student'], inplace=True)
print(df)

Returns

True

  Student           Date
0     Joe  December 2017
1     Bob     April 2018

  Student           Date
0     Joe  December 2017
1     Bob     April 2018

这篇关于检查 Pandas 数据框列中的重复值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！