问题描述
我有以下两个数据帧:
import pandas as pd
import scipy.stats
import numpy as np
df_a = pd.DataFrame({
's1': [10,10,12,13,14,15],
's2': [100,100,112,1.3,14,125],
's2': [13,200,10,13,14.5,10.5],
'gene_symbol': ['a', 'b', 'c', 'd', 'e', 'f'],
})
df_b = pd.DataFrame({
's1': [15,20,123,13,14,15,1],
's2': [130,100,72,1.3,14,125,2],
's2': [213,200,35.4,13,414.5,130.5,3],
'gene_symbol': ['a', 'b', 'c', 'd', 'e', 'f','g'],
})
df_a.set_index('gene_symbol', inplace=True)
df_b.set_index('gene_symbol', inplace=True)
看起来像这样:
s1 s2
gene_symbol
a 10 13.0
b 10 200.0
c 12 10.0
d 13 13.0
e 14 14.5
f 15 10.5
In [51]: df_b
Out[51]:
s1 s2
gene_symbol
a 15 213.0
b 20 200.0
c 123 35.4
d 13 13.0
e 14 414.5
f 15 130.5
g 1 3.0
我要做的是逐个基因计算T检验p值基因.例如,对于基因a
,我们将拥有
What I want to do is to calculate T-test p-value gene by gene.For example for gene a
we will have
In [47]: scipy.stats.ttest_ind([ 10,13.0],[15,213.0])
Out[47]: Ttest_indResult(statistic=-1.0352347135782713, pvalue=0.4093249100598676)
我如何将其应用于所有共享两个数据帧共有基因的行(例如,忽略df_b
中的基因g
).
How can I apply that for all rows that shares common genes for two data frames (e.g. ignore gene g
in df_b
).
我尝试过,但是失败了:
I tried this but it failed:
scipy.stats.ttest_ind(df_a, df_b,axis=1)
推荐答案
您可以通过匹配两个数据框或索引来使用gene_symbol
索引来删除g
行.
You can remove g
row using your gene_symbol
index by matching two dataframes, or indexes.
您可以使用 pandas. merge()在匹配的列或索引上连接两个DataFrame,并在ttest_ind上使用合并的DataFrame的列:
You can use pandas.merge() to join two DataFrames on matching columns or indexes, and use the columns of the merged DataFrame on ttest_ind:
# default join is inner
df_m = pd.merge(df_a, df_b, left_index=True, right_index=True)
scipy.stats.ttest_ind(df_m.ix[:, :2], df_m.ix[:, 2:], axis=1)
或者您可以找到交叉点索引,并使用它们来切片数据集:
Or you can find the intersection of the indexes and use them to slice your datasets:
idx = df_a.index.intersection(df_b.index)
scipy.stats.ttest_ind(df_a.loc[idx], df_b.loc[idx], axis=1)
这篇关于如何从两个 pandas 数据帧中逐行计算T检验的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!