本文介绍了如何在 Pandas 数据框中展开一列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有以下熊猫数据框:
将pandas导入为pd将 numpy 导入为 npdf = pd.DataFrame({'fc': [100,100,112,1.3,14,125],'sample_id': ['S1','S1','S1','S2','S2','S2'],'gene_symbol': ['a', 'b', 'c', 'a', 'b', 'c'],})df = df[['gene_symbol', 'sample_id', 'fc']]df
产生这个:
输出[11]:gene_symbol sample_id fc0 一个 S1 100.01 b S1 100.02 c S1 112.03 一个 S2 1.34 b S2 14.05 c S2 125.0
我如何传播 sample_id
以便最终我得到这个:
gene_symbol S1 S2一个 100 1.3b 100 14.0c 112 125.0
解决方案
#df = df[['gene_symbol', 'sample_id', 'fc']]df = df.pivot(index='gene_symbol',columns='sample_id',values='fc')打印 (df)sample_id S1 S2基因符号100.0 1.3100.0 14.0112.0 125.0
df = df.set_index(['gene_symbol','sample_id'])['fc'].unstack(fill_value=0)打印 (df)sample_id S1 S2基因符号100.0 1.3100.0 14.0112.0 125.0
但如果重复,需要pivot_table
或与 groupby
或 聚合,mean
可以更改为 sum
, median
, ...:
df = pd.DataFrame({'fc': [100,100,112,1.3,14,125, 100],'sample_id': ['S1','S1','S1','S2','S2','S2','S2'],'gene_symbol': ['a', 'b', 'c', 'a', 'b', 'c', 'c'],})打印 (df)fc gene_symbol sample_id0 100.0 一个 S11 100.0 b S12 112.0 c S13 1.3 一个 S24 14.0 b S25 125.0 c S2
df = df.pivot(index='gene_symbol',columns='sample_id',values='fc')
ValueError: 索引包含重复条目,无法重塑
df = df.pivot_table(index='gene_symbol',columns='sample_id',values='fc', aggfunc='mean')打印 (df)sample_id S1 S2基因符号100.0 1.3100.0 14.0112.0 112.5
df = df.groupby(['gene_symbol','sample_id'])['fc'].mean().unstack(fill_value=0)打印 (df)sample_id S1 S2基因符号100.0 1.3100.0 14.0112.0 112.5
用于清理将
columns name
设置为 None
和 reset_index
:
df.columns.name = 无df = df.reset_index()打印 (df)基因符号 S1 S20 100.0 1.31 分 100.0 14.02 c 112.0 112.5
I have the following pandas data frame:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'fc': [100,100,112,1.3,14,125],
'sample_id': ['S1','S1','S1','S2','S2','S2'],
'gene_symbol': ['a', 'b', 'c', 'a', 'b', 'c'],
})
df = df[['gene_symbol', 'sample_id', 'fc']]
df
Which produces this:
Out[11]:
gene_symbol sample_id fc
0 a S1 100.0
1 b S1 100.0
2 c S1 112.0
3 a S2 1.3
4 b S2 14.0
5 c S2 125.0
How can I spread
sample_id
so that in the end I get this:
gene_symbol S1 S2
a 100 1.3
b 100 14.0
c 112 125.0
解决方案
#df = df[['gene_symbol', 'sample_id', 'fc']]
df = df.pivot(index='gene_symbol',columns='sample_id',values='fc')
print (df)
sample_id S1 S2
gene_symbol
a 100.0 1.3
b 100.0 14.0
c 112.0 125.0
df = df.set_index(['gene_symbol','sample_id'])['fc'].unstack(fill_value=0)
print (df)
sample_id S1 S2
gene_symbol
a 100.0 1.3
b 100.0 14.0
c 112.0 125.0
But if duplicates, need
pivot_table
or aggregate with groupby
or , mean
can be changed to sum
, median
, ...:
df = pd.DataFrame({
'fc': [100,100,112,1.3,14,125, 100],
'sample_id': ['S1','S1','S1','S2','S2','S2', 'S2'],
'gene_symbol': ['a', 'b', 'c', 'a', 'b', 'c', 'c'],
})
print (df)
fc gene_symbol sample_id
0 100.0 a S1
1 100.0 b S1
2 112.0 c S1
3 1.3 a S2
4 14.0 b S2
5 125.0 c S2 <- same c, S2, different fc
6 100.0 c S2 <- same c, S2, different fc
df = df.pivot(index='gene_symbol',columns='sample_id',values='fc')
df = df.pivot_table(index='gene_symbol',columns='sample_id',values='fc', aggfunc='mean')
print (df)
sample_id S1 S2
gene_symbol
a 100.0 1.3
b 100.0 14.0
c 112.0 112.5
df = df.groupby(['gene_symbol','sample_id'])['fc'].mean().unstack(fill_value=0)
print (df)
sample_id S1 S2
gene_symbol
a 100.0 1.3
b 100.0 14.0
c 112.0 112.5
EDIT:
For cleaning set
columns name
to None
and reset_index
:
df.columns.name = None
df = df.reset_index()
print (df)
gene_symbol S1 S2
0 a 100.0 1.3
1 b 100.0 14.0
2 c 112.0 112.5
这篇关于如何在 Pandas 数据框中展开一列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!