我有一个csv文件。看起来像这样;

name,id,
AAA,1111,
BBB,2222,
CCC,3333,
DDD,2222,


我想找出id列中是否有重复项。如果是,找出重复项。在这种情况下,答案为2222

我有代码来查找是否存在重复项。这里是;

import pandas as pd
csv_file = 'C:/test.csv'
df = pd.read_csv(csv_file)
df['id'].duplicated().any()


问题是如何找出重复项?

我正在使用python 2.7和panda。

最佳答案

我认为您可以使用duplicated(省略keep,因为默认是keep='first')。或者,如果您需要值tolist

print df['id'][df.duplicated(subset=['id'])]
3    2222
Name: id, dtype: int64

print df['id'][df.duplicated(subset=['id'])].tolist()
[2222]


您可以检查duplicated

print df.duplicated(subset=['id'], keep='first')
0    False
1    False
2    False
3     True
dtype: bool

print df.duplicated(subset=['id'], keep='last')
0    False
1     True
2    False
3    False
dtype: bool

print df.duplicated(subset=['id'], keep=False)
0    False
1     True
2    False
3     True
dtype: bool


如果您需要重复的行,请使用子集:

print df[df.duplicated(subset=['id'], keep='first')]
  name    id
3  DDD  2222

print df[df.duplicated(subset=['id'], keep='last')]
  name    id
1  BBB  2222

print df[df.duplicated(subset=['id'], keep=False)]
  name    id
1  BBB  2222
3  DDD  2222


使用drop_duplicates进行删除:

print df.drop_duplicates(subset=['id'], keep='first')
  name    id
0  AAA  1111
1  BBB  2222
2  CCC  3333

print df.drop_duplicates(subset=['id'], keep='last')
  name    id
0  AAA  1111
2  CCC  3333
3  DDD  2222

print df.drop_duplicates(subset=['id'], keep=False)
  name    id
0  AAA  1111
2  CCC  3333

关于python - 找出python Pandas 数据结构中的重复项,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/35376954/

10-12 13:11
查看更多