我有一个包含学生地址的以下数据框df_address

student_id     address_type     Address          City
 1                R              6th street      MPLS
 1                P              10th street SE  Chicago
 1                E              10th street SE  Chicago
 2                P              Washington ST   Boston
 2                E              Essex St        NYC
 3                E              1040 Taft Blvd  Dallas
 4                R              24th street      NYC
 4                P              8th street SE  Chicago
 5                T              10 Riverside Ave Boston
 6                                20th St          NYC


每个学生可以有多种地址类型:

R代表“住宅”,P代表“永久”,E代表“紧急”,T代表“临时”,addr_type也可以为空

我想根据以下逻辑填充“ IsPrimaryAddress”列:

如果对于特定的学生,如果存在address_type R,则应输入“是”
 在IsPrimaryAddress列中的address_type“ R”前面
对于该特定student_id,应在其他地址类型的前面加上“否”。

如果address_type R不存在但P存在,则IsPrimaryAddress ='Yes'为'P'和'No'
对于其余类型

如果P或R都不存在,但E存在,则IsPrimaryAddress ='Yes'为'E'
如果P,R或E不存在,但'T'存在,则IsPrimaryAddress ='Yes'为'T'
结果数据框如下所示:

student_id     address_type     Address          City      IsPrimaryAddress
 1                R              6th street      MPLS              Yes
 1                P              10th street SE  Chicago           No
 1                E              10th street SE  Chicago           No
 2                P              Washington ST   Boston            Yes
 2                E              Essex St        NYC               No
 3                E              1040 Taft Blvd  Dallas            Yes
 4                R              24th street      NYC             Yes
 4                P              8th street SE  Chicago           No
 5                T              10 Riverside Ave Boston          Yes
 6                                20th St          NYC           Yes


我如何实现此目的?我在address_type上尝试了等级和累加功能,但无法使它们正常工作。

最佳答案

首先使用Categorical使address_type可以排序自定义

df.address_type=pd.Categorical(df.address_type,['R','P','E','T',''],ordered=True)

df=df.sort_values('address_type') # the sort the values
df['new']=(df.groupby('student_id').address_type.transform('first')==df.address_type).map({True:'Yes',False:'No'}) # since we sorted the value , so the first value of each group is the one we need to mark as Yes
df=df.sort_index() # sort the index order back to the original df


   student_id address_type  new
0           1            R  Yes
1           1            P   No
2           1            E   No
3           2            P  Yes
4           2            E   No
5           3            E  Yes
6           4            R  Yes
7           4            P   No
8           5            T  Yes
9           6               Yes

10-04 22:31