如何使用来自另一个数据帧的部分匹配过滤数据帧

如何使用来自另一个数据帧的部分匹配过滤数据帧

本文介绍了如何使用来自另一个数据帧的部分匹配过滤数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据帧,我想使用其中一个数据帧来过滤另一个数据帧,并创建一个新的数据帧。两个数据帧具有包含类似信息的列,但不是精确匹配。我一直在尝试使用 str.contains 但到目前为止我一直得到 TypeError:'Series'对象是可变的,因此他们不能哈希当我尝试。这里是我的数据框架和我试过的代码示例。

I have two dataframe and I want to use one of the dataframes to filter the other and make a new dataframe. The two dataframes have a column with similar information but it is not an exact match. I have been trying to use str.contains but so far I keep getting TypeError: 'Series' objects are mutable, thus they cannot be hashed when I try. Here is a sample of my dataframes and the code I have tried.

promoter = pd.read_csv('promoter_coordinate.csv')
print(promoter.head())

AssociatedGeneName            B      C    D E                                   F
            plexB_1  NC_004353.3  64381  - Drosophila melanogaster (Fruit fly)  region
               ci_1  NC_004353.3  76925  - Drosophila melanogaster (Fruit fly)  region
             RS3A_1  NC_004353.3  87829  - Drosophila melanogaster (Fruit fly)  region
              pan_1  NC_004353.3  89986  + Drosophila melanogaster (Fruit fly)  region
              pan_2  NC_004353.3  90281  + Drosophila melanogaster (Fruit fly)  region

data = pd.read_csv('FBgn with gene name.csv')
print(data.head())
Gene AssociatedGeneName   FBgn Number     timepoint
CG10002        fkh        FBgn0000659          2
CG10002        fkh        FBgn0000659          2
CG10002        fkh        FBgn0000659          2
CG10002        fkh        FBgn0000659          2
CG10006    CG10006        FBgn0036461          2

x = promoter[promoter['AssociatedGeneName'].str.contains(data['AssociatedGeneName'])]

两个列表的头没有匹配,但基本上理想的结果将是类似于以下,其中名为AssociatedGeneName的两列将被比较。

The heads of both list don't have a match but basically the ideal outcome would be something similar to the following, where the two columns that are named 'AssociatedGeneName' would be compared.

AssociatedGeneName            B      C    D  E                                    F
             fkh_1  NT_033777.2  24410805 -  Drosophila melanogaster (Fruit fly)  region

本质上我想要一个数据框架中的所有值在启动子部分匹配 data ['AssociatedGeneName']中的值如果有人可以指出正确的方向,我将不胜感激。我相对较新的编码,我一直在使用python和pandas,并宁愿继续使用python来解决这个问题。这是我不断得到的错误。

Essentially I want a dataframe with all of the values in promoter that have a partial match to the values in data['AssociatedGeneName'] If someone could point me the right direction I would be grateful. I am relatively new to coding, I have been using python and pandas and would prefer to keep using python to solve this problem. Here is the error I keep getting.

x = promoter[promoter['AssociatedGeneName'].str.contains(data['AssociatedGeneName'])]

Traceback (most recent call last):
  File "<pyshell#15>", line 1, in <module>
    x = promoter[promoter['AssociatedGeneName'].str.contains(data['Associated Gene Name'])]
  File "C:\Python34\lib\site-packages\pandas\core\strings.py", line 1226, in contains
na=na, regex=regex)
  File "C:\Python34\lib\site-packages\pandas\core\strings.py", line 203, in str_contains
regex = re.compile(pat, flags=flags)
  File "C:\Python34\lib\re.py", line 219, in compile
return _compile(pattern, flags)
  File "C:\Python34\lib\re.py", line 278, in _compile
return _cache[type(pattern), pattern, flags]
  File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 663, in __hash__
    ' hashed'.format(self.__class__.__name__))
TypeError: 'Series' objects are mutable, thus they cannot be hashed


推荐答案

首先创建一个函数,检查启动符的值是否与 data 这将检查数据中的每个值

First create a function that checks if values from promoter has a partial match from data this will check for each value in data

def contain_partial(x , y = data.AssociatedGeneName):
        res = []
        for z in y:
            res.append(z in x)
        return res

这将是函数的结果

contains = promoter.AssociatedGeneName.apply(contain_partial)


$ b b

然后在结束检查如果至少一个值是真的然后返回true和过滤
启动

promoter[contains.apply(any)]

这篇关于如何使用来自另一个数据帧的部分匹配过滤数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-31 03:27