问题描述
我有一个包含患者,日期,药物和诊断的数据框.每个患者都有一个唯一的编号('pid'),并且可能会或可能不会使用不同的药物进行治疗.
I have a dataframe with patients, date, medications, and diagnosis.Each patient has a unique id ('pid'), and may or may not be treated with different drugs.
选择在某个时间点已接受某种药物治疗的所有患者的最佳实践是什么?由于我的数据集非常庞大,因此for循环和if语句是最后的选择.
What is best practice to select all patients that at some point have been treated with a certain drug?Since my dataset is so huge, for-loops and if-statement is the last resort.
示例:
IN:
pid drug
1 A
1 B
1 C
2 A
2 C
2 E
3 B
3 C
3 D
4 D
4 E
4 F
选择所有在某时已接受过药物"B"治疗的患者.请注意,必须包括该患者的所有条目,这不仅意味着用药物B进行治疗,而且还包括所有治疗:
Select all patient who has at some point been treated with drug 'B'. Note that all entries of that patient must to be included, meaning not just treatments with drug B, but all treatments:
OUT:
1 A
1 B
1 C
3 B
3 C
3 D
我当前的解决方案:
1)获取包含药物"B"的行的所有pid
2)从步骤1中获取所有包含pid的行.
这种解决方案的问题是,我需要使用所有pid(百万)创建一个很长的if语句
My current solution:
1) Get all pid for rows that includes drug 'B'
2) Get all rows that include pid from step 1.
Problem with this solution is that I need to make a loooong if-statement with all pid's (millions)
推荐答案
这是一种方法.
s = df.groupby('drug')['pid'].apply(set)
result = df[df['pid'].isin(s['B'])]
# pid drug
# 0 1 A
# 1 1 B
# 2 1 C
# 6 3 B
# 7 3 C
# 8 3 D
说明
- 创建一个映射系列
s
作为单独的初始步骤,以便它不需要为每个结果重新计算. - 为进行比较,请使用
set
进行O(1)复杂度查找.
- Create a mapping series
s
as a separate initial step so that itdoes not need recalculating for each result. - For the comparisons, use
set
for O(1) complexity lookup.
这篇关于 pandas 数据框-如果用户的任何行包含特定值,则选择所有用户的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!