本文介绍了 pandas :根据col [B]的条件,将col [A]中的重复项保留在行中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出数据框:

df = pd.DataFrame({'col1': ['A', 'A', 'A','B','B'], 'col2': ['type1', 'type2', 'type1', 'type2', 'type1'] , 'hour': ['18:03:30','18:00:48', '18:13:46', '18:11:29', '18:06:31']  })


col1 col2   hour
A   type1   18:03:30 # Drop this row as (A type1) already present
A   type2   18:00:48
A   type1   18:13:46 # keep this row as (A type1) already present.
B   type2   18:11:29
B   type1   18:06:31

我想基于col1,col2删除重复项

例如(行(0):A类型1,行(2):A类型1)

eg.(row(0): A type1, row(2): A type1)

保持仅保留具有最近小时的行,例如(18:13:46)。

keeping only the row that has the latest hour eg.(18:13:46).

我尝试使用 groupby 返回基于col1的子集,并尝试使用 drop_duplicates 将重复项删除到col2中。我需要找到一种通过条件(最近时间)的方法

I tried using groupby to return subset based on col1, and drop_duplicates to drop the duplicate in col2. I need to find a way to pass the condition (latest hour)

示例代码:

for key, grp in df.groupby('col1'):
  grp.drop_duplicates(subset='col2', keep="LATEST OF HOUR")

预期结果:

col1 col2   hour
A   type1   18:03:30
A   type2   18:00:48
B   type2   18:11:29
B   type1   18:06:31


编辑添加上下文


我的原始数据帧较大,该解决方案还需要工作:

EDIT adding context

my original dataframe is larger, the solution needs to work for also:


col1 col2   other  hour
A   type1   h  18:03:30 # Drop this row as (A type1) already present
A   type2   ss 18:00:48
A   type1   ll 18:13:46 # keep this row as (A type1) already present
B   type2   mm 18:11:29
B   type1   jj 18:06:31

它仍然需要删除lu mn基于小时

it would still need to drop the column based on the hour

推荐答案

按照anky_91的评论,我这样解决了它:

Following anky_91's comment I solved it like this:

df.sort_values('hour').drop_duplicates(['col1','col2'] , keep = 'last')

此选项基于小时列进行排序,因此您可以确保keep ='last'获取最后一个元素

This sorts based on the column 'hour' so that you are sure that keep='last' gets the last element

这篇关于 pandas :根据col [B]的条件,将col [A]中的重复项保留在行中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-19 02:49