本文介绍了在删除 pandas 数据框中的重复项后替换特定的列值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我是pandas的初学者(如果使用了错误的术语,我深表歉意),目前正在从事基因组计划.使用drop_duplicates()后,在处理数据框列时遇到麻烦.我想更改删除重复项后保留的ID的突变"列中的列值,以表明此ID具有多个突变.
I'm a beginner at pandas (I apologize if i'm using the wrong terminology) and i am currently working on a genomics project. I'm having trouble manipulating dataframes columns after using drop_duplicates(). I want to change the column values in the column 'mutation' of the id that is kept after dropping duplicates to indicate that this id has multiple mutations.
df = pd.DataFrame([
('MYC', 'nonsense', 's1'),
('MYC', 'missense', 's1'),
('MYCL', 'nonsense', 's1'),
('MYCL', 'missense', 's2'),
('MYCN', 'missense', 's3'),
('MYCN', 'UTR', 's1'),
('MYCN', 'nonsense', 's1')
], columns=['id', 'mutation', 'sample'])
print(df)
结果:
id mutation sample
0 MYC nonsense s1
1 MYC nonsense s1
2 MYC missense s1
3 MYCL nonsense s1
4 MYCL missense s2
5 MYCN missense s3
6 MYCN UTR s1
7 MYCN nonsense s1
我尝试使用drop_duplicates(),但我越来越接近想要的东西了.但是,如何将突变"列中的值更改为多"呢?
I tried using drop_duplicates() and i am getting close to what i want. But how do i change the value in the column 'mutation' to 'multi'?
print(df.drop_duplicates(subset=('sample','id')))
id mutation sample
0 MYC nonsense s1
3 MYCL nonsense s1
4 MYCL missense s2
5 MYCN missense s3
6 MYCN UTR s1
我想要什么:
id mutation sample
0 MYC multi s1
3 MYCL nonsense s1
4 MYCL missense s2
5 MYCN missense s3
6 MYCN multi s1
推荐答案
duplicated
mask = df.duplicated(['id', 'sample'], keep=False)
df.assign(mutation=df.mutation.mask(mask, 'multi')).drop_duplicates()
id mutation sample
0 MYC multi s1
2 MYCL nonsens s1
3 MYCL missense s2
4 MYCN missense s3
5 MYCN multi s1
groupby
groupby
df.groupby(['id', 'sample'], sort=False).mutation.pipe(
lambda g: g.first().mask(g.size() > 1, 'multi')
).reset_index().reindex(df.columns, axis=1)
id mutation sample
0 MYC multi s1
1 MYCL nonsens s1
2 MYCL missense s2
3 MYCN missense s3
4 MYCN multi s1
这篇关于在删除 pandas 数据框中的重复项后替换特定的列值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!