问题描述
首先,很抱歉,如果这有点冗长,但是我想完整地描述一下我遇到的问题以及已经尝试过的事情.
Firstly, sorry if this is a bit lengthy, but I wanted to fully describe what I have having problems with and what I have tried already.
我试图在多个条件下将两个数据框对象连接(合并)在一起.如果要满足的条件都是等于"运算符,我知道该怎么做,但是,我需要使用LESS THAN和MORE THAN.
I am trying to join (merge) together two dataframe objects on multiple conditions. I know how to do this if the conditions to be met are all 'equals' operators, however, I need to make use of LESS THAN and MORE THAN.
数据框代表遗传信息:一个是基因组中的突变列表(称为SNP),另一个是有关人类基因组中基因位置的信息.在这些文件上执行df.head()会返回以下内容:
The dataframes represent genetic information: one is a list of mutations in the genome (referred to as SNPs) and the other provides information on the locations of the genes on the human genome. Performing df.head() on these returns the following:
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 752721
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
这显示SNP参考ID及其位置. "BP"代表基本对"位置.
This shows the SNP reference ID and their locations. 'BP' stands for the 'Base-Pair' position.
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
此数据框显示所有感兴趣基因的位置.
This dataframe shows the locations of all the genes of interest.
我想找出所有落在基因组基因区域内的SNP,并丢弃那些不在这些区域内的SNP.
What I want to find out is all of the SNPs which fall within the gene regions in the genome, and discard those that are outside of these regions.
如果我想基于多个(相等)条件将两个数据框合并在一起,我将执行以下操作:
If I wanted to merge together two dataframes based on multiple (equals) conditions, I would do something like the following:
merged_df = pd.merge(snp_df, gene_df, on=['chromosome', 'other_columns'])
但是,在这种情况下,我需要找到染色体值与基因"数据框中的染色体值相匹配且BP值介于"chr_start"和"chr_stop"之间的SNP.造成挑战的原因是这些数据帧非常大.在当前数据集中,snp_df具有6795021行,gene_df具有34362.
However, in this instance - I need to find the SNPs where the chromosome values match those in the Gene dataframe, and the BP value falls between 'chr_start' and 'chr_stop'. What makes this challenging is that these dataframes are quite large. In this current dataset the snp_df has 6795021 rows, and the gene_df has 34362.
我试图通过分别查看染色体或基因来解决这个问题.由于不使用性染色体,因此共有22个不同的染色体值(第1-22个整数).两种方法都花费很长时间.一种使用pandasql
模块,另一种方法是遍历单独的基因.
I have tried to tackle this by either looking at chromosomes or genes seperately. There are 22 different chromosome values (ints 1-22) as the sex chromosomes are not used. Both methods are taking an extremely long time. One uses the pandasql
module, while the other approach is to loop through the separate genes.
import pandas as pd
import pandasql as psql
pysqldf = lambda q: psql.sqldf(q, globals())
q = """
SELECT s.SNP, g.feature_id
FROM this_snp s INNER JOIN this_genes g
WHERE s.BP >= g.chr_start
AND s.BP <= g.chr_stop;
"""
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
genic_snps = pysqldf(q)
all_dfs.append(genic_snps)
all_genic_snps = pd.concat(all_dfs)
基因迭代法
all_dfs = []
for line in gene_df.iterrows():
info = line[1] # Getting the Series object
this_snp = snp_df.loc[(snp_df['chromosome'] == info['chromosome']) &
(snp_df['BP'] >= info['chr_start']) & (snp_df['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(len(this_snp.columns), 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
任何人都可以提出更有效的方法建议吗?
Can anyone give any suggestions of a more effective way of doing this?
推荐答案
我刚刚想到了一种解决方法-通过结合两种方法:
I've just thought of a way to solve this - by combining my two methods:
首先,关注单个染色体,然后在这些较小的数据框中遍历基因.这也不必使用任何SQL查询.我还提供了一个部分来立即识别没有任何SNP落在其范围内的任何冗余基因.这利用了我通常尝试避免的双for循环-但在这种情况下,效果很好.
First, focus on the individual chromosomes, and then loop through the genes in these smaller dataframes. This also doesn't have to make use of any SQL queries either. I've also included a section to immediately identify any redundant genes that don't have any SNPs that fall within their range. This makes use of a double for-loop which I normally try to avoid - but in this case it works quite well.
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_chr_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
# Getting rid of redundant genes
min_bp = this_chr_snp['BP'].min()
max_bp = this_chr_snp['BP'].max()
this_genes = this_genes.loc[~(this_genes['chr_start'] >= max_bp) &
~(this_genes['chr_stop'] <= min_bp)]
for line in this_genes.iterrows():
info = line[1]
this_snp = this_chr_snp.loc[(this_chr_snp['BP'] >= info['chr_start']) &
(this_chr_snp['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(1, 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
虽然运行速度不是很快,但确实可以运行,因此我实际上可以得到一些答案.我仍然想知道是否有人能使它更有效地运行.
While this doesn't run spectacularly quickly - it does run so that I can actually get some answers. I'd still like to know if anyone has any tips to make it run more efficiently though.
这篇关于在多种条件下合并DataFrames-并非专门针对相等的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!