问题描述
我有两个数据帧,我想使用其中一个数据帧来过滤另一个数据帧,并创建一个新的数据帧。两个数据帧具有包含类似信息的列,但不是精确匹配。我一直在尝试使用 str.contains
但到目前为止我一直得到 TypeError:'Series'对象是可变的,因此他们不能哈希
当我尝试。这里是我的数据框架和我试过的代码示例。
I have two dataframe and I want to use one of the dataframes to filter the other and make a new dataframe. The two dataframes have a column with similar information but it is not an exact match. I have been trying to use str.contains
but so far I keep getting TypeError: 'Series' objects are mutable, thus they cannot be hashed
when I try. Here is a sample of my dataframes and the code I have tried.
promoter = pd.read_csv('promoter_coordinate.csv')
print(promoter.head())
AssociatedGeneName B C D E F
plexB_1 NC_004353.3 64381 - Drosophila melanogaster (Fruit fly) region
ci_1 NC_004353.3 76925 - Drosophila melanogaster (Fruit fly) region
RS3A_1 NC_004353.3 87829 - Drosophila melanogaster (Fruit fly) region
pan_1 NC_004353.3 89986 + Drosophila melanogaster (Fruit fly) region
pan_2 NC_004353.3 90281 + Drosophila melanogaster (Fruit fly) region
data = pd.read_csv('FBgn with gene name.csv')
print(data.head())
Gene AssociatedGeneName FBgn Number timepoint
CG10002 fkh FBgn0000659 2
CG10002 fkh FBgn0000659 2
CG10002 fkh FBgn0000659 2
CG10002 fkh FBgn0000659 2
CG10006 CG10006 FBgn0036461 2
x = promoter[promoter['AssociatedGeneName'].str.contains(data['AssociatedGeneName'])]
两个列表的头没有匹配,但基本上理想的结果将是类似于以下,其中名为AssociatedGeneName的两列将被比较。
The heads of both list don't have a match but basically the ideal outcome would be something similar to the following, where the two columns that are named 'AssociatedGeneName' would be compared.
AssociatedGeneName B C D E F
fkh_1 NT_033777.2 24410805 - Drosophila melanogaster (Fruit fly) region
本质上我想要一个数据框架中的所有值在启动子
部分匹配 data ['AssociatedGeneName']中的值
如果有人可以指出正确的方向,我将不胜感激。我相对较新的编码,我一直在使用python和pandas,并宁愿继续使用python来解决这个问题。这是我不断得到的错误。
Essentially I want a dataframe with all of the values in promoter
that have a partial match to the values in data['AssociatedGeneName']
If someone could point me the right direction I would be grateful. I am relatively new to coding, I have been using python and pandas and would prefer to keep using python to solve this problem. Here is the error I keep getting.
x = promoter[promoter['AssociatedGeneName'].str.contains(data['AssociatedGeneName'])]
Traceback (most recent call last):
File "<pyshell#15>", line 1, in <module>
x = promoter[promoter['AssociatedGeneName'].str.contains(data['Associated Gene Name'])]
File "C:\Python34\lib\site-packages\pandas\core\strings.py", line 1226, in contains
na=na, regex=regex)
File "C:\Python34\lib\site-packages\pandas\core\strings.py", line 203, in str_contains
regex = re.compile(pat, flags=flags)
File "C:\Python34\lib\re.py", line 219, in compile
return _compile(pattern, flags)
File "C:\Python34\lib\re.py", line 278, in _compile
return _cache[type(pattern), pattern, flags]
File "C:\Python34\lib\site-packages\pandas\core\generic.py", line 663, in __hash__
' hashed'.format(self.__class__.__name__))
TypeError: 'Series' objects are mutable, thus they cannot be hashed
推荐答案
首先创建一个函数,检查启动符
的值是否与 data
这将检查数据中的每个值
First create a function that checks if values from promoter
has a partial match from data
this will check for each value in data
def contain_partial(x , y = data.AssociatedGeneName):
res = []
for z in y:
res.append(z in x)
return res
这将是函数的结果
contains = promoter.AssociatedGeneName.apply(contain_partial)
$ b b
然后在结束检查如果至少一个值是真的然后返回true和过滤
启动
promoter[contains.apply(any)]
这篇关于如何使用来自另一个数据帧的部分匹配过滤数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!