我有一个csv文件。看起来像这样;
name,id,
AAA,1111,
BBB,2222,
CCC,3333,
DDD,2222,
我想找出
id
列中是否有重复项。如果是,找出重复项。在这种情况下,答案为2222
。我有代码来查找是否存在重复项。这里是;
import pandas as pd
csv_file = 'C:/test.csv'
df = pd.read_csv(csv_file)
df['id'].duplicated().any()
问题是如何找出重复项?
我正在使用python 2.7和panda。
最佳答案
我认为您可以使用duplicated
(省略keep
,因为默认是keep='first'
)。或者,如果您需要值tolist
:
print df['id'][df.duplicated(subset=['id'])]
3 2222
Name: id, dtype: int64
print df['id'][df.duplicated(subset=['id'])].tolist()
[2222]
您可以检查
duplicated
:print df.duplicated(subset=['id'], keep='first')
0 False
1 False
2 False
3 True
dtype: bool
print df.duplicated(subset=['id'], keep='last')
0 False
1 True
2 False
3 False
dtype: bool
print df.duplicated(subset=['id'], keep=False)
0 False
1 True
2 False
3 True
dtype: bool
如果您需要重复的行,请使用子集:
print df[df.duplicated(subset=['id'], keep='first')]
name id
3 DDD 2222
print df[df.duplicated(subset=['id'], keep='last')]
name id
1 BBB 2222
print df[df.duplicated(subset=['id'], keep=False)]
name id
1 BBB 2222
3 DDD 2222
使用
drop_duplicates
进行删除:print df.drop_duplicates(subset=['id'], keep='first')
name id
0 AAA 1111
1 BBB 2222
2 CCC 3333
print df.drop_duplicates(subset=['id'], keep='last')
name id
0 AAA 1111
2 CCC 3333
3 DDD 2222
print df.drop_duplicates(subset=['id'], keep=False)
name id
0 AAA 1111
2 CCC 3333
关于python - 找出python Pandas 数据结构中的重复项,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/35376954/