python - 用 Pandas 标记列中的重复值

我有一个看起来像这样的df：

email      is_new   col_n
a@a        1           z
a@a        1           x
b@b        1           y

我想更新电子邮件地址的第一个实例的is_new列。新的df应该如下所示：

  email      is_new      col_n
    a@a        0           z
    a@a        1           x
    b@b        0           y

我已经尝试创建IF语句来检查电子邮件地址的数量，但是它不起作用：

   1.  if df[df["email"].groupby().unique()> 1] ==True:
        print('ook')

   2. df.loc[df.groupby('email').groupby().unique(), 'is_new']=1

最佳答案

让我们尝试groupby和cumcount：

df['is_new'] = df.groupby('email').cumcount().astype(bool).astype(int)

要么，

df['is_new'] = df.groupby('email').cumcount().ne(0).astype(int)

df
  email  is_new col_n
0   a@a       0     z
1   a@a       1     x
2   b@b       0     y

细节
cumcount返回一行中每个项目的递增计数：

df2 = pd.concat([df] * 2, ignore_index=True).sort_values('email')

df2.groupby('email').cumcount()

0    0
1    1
3    2
4    3
2    0
5    1
dtype: int64

这只是一个代表性的示例，但是计数可以大于1。我可以使用上述两种选择之一将所有大于0的计数转换为1：

df2.groupby('email').cumcount().ne(0).astype(int)
# df2.groupby('email').cumcount().astype(bool).astype(int)

0    0
1    1
3    1
4    1
2    0
5    1
dtype: int64

关于python - 用 Pandas 标记列中的重复值，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/54116918/