我有一个包含学生地址的以下数据框df_address
student_id address_type Address City
1 R 6th street MPLS
1 P 10th street SE Chicago
1 E 10th street SE Chicago
2 P Washington ST Boston
2 E Essex St NYC
3 E 1040 Taft Blvd Dallas
4 R 24th street NYC
4 P 8th street SE Chicago
5 T 10 Riverside Ave Boston
6 20th St NYC
每个学生可以有多种地址类型:
R代表“住宅”,P代表“永久”,E代表“紧急”,T代表“临时”,addr_type也可以为空
我想根据以下逻辑填充“ IsPrimaryAddress”列:
如果对于特定的学生,如果存在address_type R,则应输入“是”
在IsPrimaryAddress列中的address_type“ R”前面
对于该特定student_id,应在其他地址类型的前面加上“否”。
如果address_type R不存在但P存在,则IsPrimaryAddress ='Yes'为'P'和'No'
对于其余类型
如果P或R都不存在,但E存在,则IsPrimaryAddress ='Yes'为'E'
如果P,R或E不存在,但'T'存在,则IsPrimaryAddress ='Yes'为'T'
结果数据框如下所示:
student_id address_type Address City IsPrimaryAddress
1 R 6th street MPLS Yes
1 P 10th street SE Chicago No
1 E 10th street SE Chicago No
2 P Washington ST Boston Yes
2 E Essex St NYC No
3 E 1040 Taft Blvd Dallas Yes
4 R 24th street NYC Yes
4 P 8th street SE Chicago No
5 T 10 Riverside Ave Boston Yes
6 20th St NYC Yes
我如何实现此目的?我在address_type上尝试了等级和累加功能,但无法使它们正常工作。
最佳答案
首先使用Categorical
使address_type可以排序自定义
df.address_type=pd.Categorical(df.address_type,['R','P','E','T',''],ordered=True)
df=df.sort_values('address_type') # the sort the values
df['new']=(df.groupby('student_id').address_type.transform('first')==df.address_type).map({True:'Yes',False:'No'}) # since we sorted the value , so the first value of each group is the one we need to mark as Yes
df=df.sort_index() # sort the index order back to the original df
student_id address_type new
0 1 R Yes
1 1 P No
2 1 E No
3 2 P Yes
4 2 E No
5 3 E Yes
6 4 R Yes
7 4 P No
8 5 T Yes
9 6 Yes