我有3个这样的pandas数据框:
#0
A C G T uA uC uG uT cmA cmC cmG cmT
seq_1_0 47.0 47.0 54.0 52.0 100.978723 100.957447 100.370370 99.788462 5147.0 5144.0 5055.0 4968.0
seq_1_50 47.0 47.0 54.0 52.0 101.829787 101.680851 99.092593 99.692308 5279.0 5256.0 4864.0 4953.0
seq_2_0 47.0 47.0 54.0 52.0 100.978723 100.957447 100.370370 99.788462 5147.0 5144.0 5055.0 4968.0
seq_2_50 47.0 47.0 54.0 52.0 101.468085 101.425532 99.000000 100.346154 5223.0 5216.0 4850.0 5052.0
seq_3_0 47.0 47.0 54.0 52.0 100.212766 99.680851 100.870370 101.115385 5030.0 4952.0 5131.0 5169.0
seq_3_50 46.0 47.0 53.0 54.0 100.173913 100.978723 100.924528 99.944444 5026.0 5148.0 5139.0 4990.0
seq_4_0 45.0 47.0 54.0 54.0 99.044444 99.000000 101.407407 102.111111 4856.0 4851.0 5214.0 5323.0
seq_4_50 47.0 47.0 53.0 53.0 101.872340 104.382979 97.849057 98.490566 5285.0 5686.0 4684.0 4776.0
seq_5_0 54.0 34.0 37.0 75.0 90.462963 91.647059 90.756757 116.546667 3700.0 3848.0 3737.0 7915.0
seq_5_50 48.0 33.0 37.0 82.0 94.937500 113.636364 113.162162 92.756098 4277.0 7337.0 7245.0 3990.0
seq_6_0 60.0 50.0 48.0 42.0 98.500000 93.900000 106.125000 104.785714 4777.0 4139.0 5976.0 5752.0
seq_6_50 59.0 46.0 52.0 43.0 98.338983 98.826087 102.615385 102.697674 4754.0 4825.0 5402.0 5415.0
#1
A C G T uA uC uG uT cmA cmC cmG cmT
seq_1_0 47.0 47.0 54.0 52.0 100.978723 100.957447 100.370370 99.788462 5147.0 5144.0 5055.0 4968.0
seq_1_50 47.0 47.0 54.0 52.0 101.829787 101.680851 99.092593 99.692308 5279.0 5256.0 4864.0 4953.0
seq_2_0 47.0 47.0 54.0 52.0 100.978723 100.957447 100.370370 99.788462 5147.0 5144.0 5055.0 4968.0
seq_2_50 47.0 47.0 54.0 52.0 101.468085 101.425532 99.000000 100.346154 5223.0 5216.0 4850.0 5052.0
seq_3_0 47.0 47.0 54.0 52.0 100.212766 99.680851 100.870370 101.115385 5030.0 4952.0 5131.0 5169.0
seq_3_50 46.0 47.0 53.0 54.0 100.173913 100.978723 100.924528 99.944444 5026.0 5148.0 5139.0 4990.0
seq_4_0 45.0 47.0 54.0 54.0 99.044444 99.000000 101.407407 102.111111 4856.0 4851.0 5214.0 5323.0
seq_4_50 47.0 47.0 53.0 53.0 101.872340 104.382979 97.849057 98.490566 5285.0 5686.0 4684.0 4776.0
seq_5_0 54.0 34.0 37.0 75.0 90.462963 91.647059 90.756757 116.546667 3700.0 3848.0 3737.0 7915.0
seq_5_50 48.0 33.0 37.0 82.0 94.937500 113.636364 113.162162 92.756098 4277.0 7337.0 7245.0 3990.0
#2
A C G T uA uC uG uT cmA cmC cmG cmT
seq_1_0 48.0 48.0 53.0 51.0 100.291667 99.208333 101.943396 100.411765 5042.0 4882.0 5297.0 5062.0
seq_1_50 48.0 47.0 54.0 51.0 100.083333 101.680851 99.092593 101.294118 5012.0 5256.0 4864.0 5196.0
seq_2_0 47.0 47.0 54.0 52.0 100.978723 100.957447 100.370370 99.788462 5147.0 5144.0 5055.0 4968.0
seq_2_50 47.0 47.0 54.0 52.0 101.468085 101.425532 99.000000 100.346154 5223.0 5216.0 4850.0 5052.0
seq_3_0 50.0 47.0 53.0 50.0 98.980000 99.680851 101.490566 101.740000 4847.0 4952.0 5226.0 5265.0
seq_3_50 49.0 47.0 52.0 52.0 95.857143 100.978723 102.519231 102.423077 4403.0 5148.0 5387.0 5371.0
我想将第一个数据帧(#0)的所有列与其他两个数据帧(#1和#2)进行比较,以识别哪个索引具有不同的列值(例如,存在索引
seq_6_0
和seq_6_50
在数据帧#0中,而在其他两个数据帧中不存在)。但是我也想对每列进行公差变化以将不同数据帧的列视为相等,例如:
数据帧#0的索引
seq_1_0
具有以下值:A C G T uA uC uG uT cmA cmC cmG cmT
47.0 47.0 54.0 52.0 100.978723 100.957447 100.370370 99.788462 5147.0 5144.0 5055.0 4968.0
daframe#2的索引
seq_1_0
具有:A C G T uA uC uG uT cmA cmC cmG cmT
48.0 48.0 53.0 51.0 100.291667 99.208333 101.943396 100.411765 5042.0 4882.0 5297.0 5062.0
所以我想为每列放置差异容差值,例如对于
["A","C","T","G"]
列,我需要比较值之间的公差值为90%,但是对于其他列,我需要在比较值之间使用不同的百分比。我可以使用任何熊猫功能吗?
最好,
最佳答案
使用np.isclose
,可以精确控制比较的绝对和相对公差。
我假设您只想比较行和两个数据框中都存在的标签。存在于一个而不存在于另一行中的行将被忽略。另外,由于您对A,C,G,T使用相对标准,因此compare(df0,df1)
与compare(df1,df0)
不同。假定第二个参数是参考值。这与np.isclose
的工作方式一致。
def compare(dfa, dfb):
s = pd.Series(['A','C','G','T'])
tmp = dfa.join(dfb, how='inner', lsuffix='_a', rsuffix='_b')
# The A, C, G, T columns: within 90% of dfb
lhs = tmp[s + '_a'].values
rhs = tmp[s + '_b'].values
compare1 = np.isclose(lhs, rhs, atol=0, rtol=0.9)
# The uA, uC, uG, uT columns: within 1e-5
lhs = tmp['u' + s + '_a'].values
rhs = tmp['u' + s + '_b'].values
compare2 = np.isclose(lhs, rhs, atol=1e-5, rtol=0)
# The cmA, cmC, cmG, cmT columns: within 1e-3
lhs = tmp['cm' + s + '_a'].values
rhs = tmp['cm' + s + '_b'].values
compare3 = np.isclose(lhs, rhs, atol=1e-3, rtol=0)
# Assemble the result
data = np.concatenate([compare1, compare2, compare3], axis=1)
cols = pd.concat([s, 'u'+s, 'cm'+s])
result = pd.DataFrame(data, columns=cols, index=tmp.index)
return result
compare(df0, df2)
为了使结果易于可视化:
def highlight_false(cell):
return '' if cell else 'background-color: yellow'
result = compare(df0,df2)
result.style.applymap(highlight_false)
关于python - 按列比较不同的 Pandas 数据帧的容差,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/59733859/