问题描述
我希望能够在Seq对象中搜索考虑了歧义代码的子序列Seq对象.例如,以下内容应为真:
I want to be able to search a Seq object for a subsequnce Seq object accounting for ambiguity codes. For example, the following should be true:
from Bio.Seq import Seq
from Bio.Alphabet.IUPAC import IUPACAmbiguousDNA
amb = IUPACAmbiguousDNA()
s1 = Seq("GGAAAAGG", amb)
s2 = Seq("ARAA", amb) # R = A or G
print s1.find(s2)
如果考虑到歧义码,答案应该为
If ambiguity codes were taken into account, the answer should be
>>> 2
但是我得到的答案是找不到匹配项,或者
But the answer i get is that no match is found, or
>>> -1
查看biopython源代码,似乎没有考虑到歧义代码,因为使用私有的_get_seq_str_and_check_alphabet方法将子序列转换为字符串,然后使用内置的字符串方法find().当然,如果是这种情况,则将"R"歧义代码视为文字"R",而不是A或G.
Looking at the biopython source code, it doesnt appear that ambiguity codes are taken into account, as the subseqeunce is converted to a string using the private _get_seq_str_and_check_alphabet method, then the built in string method find() is used. Of course if this is the case, the "R" ambiguity code will be taken as a literal "R", not an A or G.
我可以弄清楚如何使用自制方法来执行此操作,但是似乎应该在biopython程序包中使用其Seq对象来解决这些问题.这里有我想念的东西吗?
I could figure out how to do this with a home made method, but it seems like something that should be taken care of in the biopython packages using its Seq objects. Is there something I am missing here.
是否有一种方法可以搜索含歧义代码的子序列成员身份?
Is there a way to search for sub sequence membership accounting for ambiguity codes?
推荐答案
从这里的Seq.find
文档中可以读到的内容:
From what I can read from the documentation for Seq.find
here:
http://biopython.org/DIST/docs/api/Bio.Seq.Seq-class.html#find
此方法看起来与str.find
方法类似,因为它查找完全匹配.因此,虽然dna序列可以包含歧义代码,但Seq.find()
方法仅在 exact 子序列匹配时才返回匹配项.
It appears that this method works similar to the str.find
method in that it looks for exact match. So, while the dna sequence can contain ambiguity codes, the Seq.find()
method will only return a match when the exact subsequence matches.
要做您想做的事,也许ntsearch
功能会起作用:
To do what you want maybe the ntsearch
function will work:
这篇关于Biopython是否可以执行Seq.find()解决歧义代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!