python - 在多种条件下合并DataFrames-并非专门针对相等的值

首先，很抱歉，如果这有点冗长，但是我想完整地描述一下我遇到的问题以及已经尝试过的东西。

我试图在多个条件下将两个数据框对象连接(合并)在一起。我知道如果要满足的条件都是“等于”运算符，那么我该怎么做，但是，我需要使用LESS THAN和MORE THAN。

数据框代表遗传信息:一个是基因组中的突变列表(称为SNP)，另一个是有关人类基因组中基因位置的信息。在这些文件上执行df.head()将返回以下内容:

SNP数据帧(snp_df):

   chromosome        SNP      BP
0           1  rs3094315  752566
1           1  rs3131972  752721
2           1  rs2073814  753474
3           1  rs3115859  754503
4           1  rs3131956  758144

这显示了SNP引用ID及其位置。 “BP”代表“基本对”位置。

基因数据框(gene_df):

   chromosome  chr_start  chr_stop        feature_id
0           1      10954     11507  GeneID:100506145
1           1      12190     13639  GeneID:100652771
2           1      14362     29370     GeneID:653635
3           1      30366     30503  GeneID:100302278
4           1      34611     36081     GeneID:645520

该数据框显示了所有目的基因的位置。

我想找出的是所有SNP，它们都位于基因组的基因区域之内，而丢弃那些不在这些区域之内的SNP。

如果我想基于多个(相等)条件将两个数据帧合并在一起，我将执行以下操作:

merged_df = pd.merge(snp_df, gene_df, on=['chromosome', 'other_columns'])

但是，在这种情况下-我需要找到SNP，其染色体值与Gene数据框中的染色体值匹配，并且BP值介于'chr_start'和'chr_stop'之间。造成挑战的原因是这些数据帧非常大。在此当前数据集中，snp_df具有6795021行，gene_df具有34362。

我试图通过分别查看染色体或基因来解决这个问题。由于不使用性染色体，因此共有22个不同的染色体值(第1-22个整数)。两种方法都花费很长时间。一种使用pandasql模块，另一种方法是遍历单独的基因。

SQL方法

import pandas as pd
import pandasql as psql

pysqldf = lambda q: psql.sqldf(q, globals())

q           = """
SELECT s.SNP, g.feature_id
FROM this_snp s INNER JOIN this_genes g
WHERE s.BP >= g.chr_start
AND s.BP <= g.chr_stop;
"""

all_dfs = []

for chromosome in snp_df['chromosome'].unique():
    this_snp    = snp_df.loc[snp_df['chromosome'] == chromosome]
    this_genes  = gene_df.loc[gene_df['chromosome'] == chromosome]
    genic_snps  = pysqldf(q)
    all_dfs.append(genic_snps)

all_genic_snps  = pd.concat(all_dfs)

基因迭代法

all_dfs = []
for line in gene_df.iterrows():
    info    = line[1] # Getting the Series object
    this_snp = snp_df.loc[(snp_df['chromosome'] == info['chromosome']) &
            (snp_df['BP'] >= info['chr_start']) & (snp_df['BP'] <= info['chr_stop'])]
    if this_snp.shape[0] != 0:
        this_snp = this_snp[['SNP']]
        this_snp.insert(len(this_snp.columns), 'feature_id', info['feature_id'])
        all_dfs.append(this_snp)


all_genic_snps = pd.concat(all_dfs)

任何人都可以提出任何更有效的方法建议吗？

最佳答案

我刚刚想出一种解决此问题的方法-通过结合两种方法:

首先，关注单个染色体，然后遍历这些较小数据框中的基因。这也不必使用任何SQL查询。我还提供了一个部分来立即识别没有任何SNP落在其范围内的任何冗余基因。这利用了我通常尝试避免的双for循环-但在这种情况下，它工作得很好。

all_dfs = []

for chromosome in snp_df['chromosome'].unique():
    this_chr_snp    = snp_df.loc[snp_df['chromosome'] == chromosome]
    this_genes      = gene_df.loc[gene_df['chromosome'] == chromosome]

    # Getting rid of redundant genes
    min_bp      = this_chr_snp['BP'].min()
    max_bp      = this_chr_snp['BP'].max()
    this_genes  = this_genes.loc[~(this_genes['chr_start'] >= max_bp) &
            ~(this_genes['chr_stop'] <= min_bp)]

    for line in this_genes.iterrows():
        info     = line[1]
        this_snp = this_chr_snp.loc[(this_chr_snp['BP'] >= info['chr_start']) &
                (this_chr_snp['BP'] <= info['chr_stop'])]
        if this_snp.shape[0] != 0:
            this_snp    = this_snp[['SNP']]
            this_snp.insert(1, 'feature_id', info['feature_id'])
            all_dfs.append(this_snp)

all_genic_snps  = pd.concat(all_dfs)

尽管此方法运行得并不很快，但它确实可以运行，因此我实际上可以得到一些答案。我仍然想知道是否有人可以提高其运行效率。

关于python - 在多种条件下合并DataFrames-并非专门针对相等的值，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/31410356/