问题描述
假设您具有这样的DNA序列:
Let's say you have a DNA sequence like this :
AATCRVTAA
其中 R
和 V
是DNA核苷酸的歧义值,其中 R
代表 A
或 G
和 V
代表 A
, C
或 G
.
where R
and V
are ambiguous values of DNA nucleotides, where R
represents either A
or G
and V
represents A
, C
or G
.
是否存在Biopython方法来生成可以由上述歧义序列表示的序列的所有不同组合?
Is there a Biopython method to generate all the different combinations of sequences that could be represented by the above ambiguous sequence ?
例如,在这里,输出将是:
Here for instance, the output would be :
AATCAATAA
AATCACTAA
AATCAGTAA
AATCGATAA
AATCGCTAA
AATCGGTAA
推荐答案
也许是一种更短,更快的方法,因为很可能该函数将用于非常大的数据:
Perhaps a little shorter and faster way, since by all odds this function is going to be used on very large data:
from Bio import Seq
from itertools import product
def extend_ambiguous_dna(seq):
"""return list of all possible sequences given an ambiguous DNA input"""
d = Seq.IUPAC.IUPACData.ambiguous_dna_values
return [ list(map("".join, product(*map(d.get, seq)))) ]
使用 map
允许您的循环在C中而不是在Python中执行.这应该比使用普通循环甚至列表理解要快得多.
Using map
allows your loops to be executed in C rather than in Python. This should prove much faster than using plain loops or even list comprehensions.
使用简单的dict作为 d
,而不是 ambiguous_na_values
With a simple dict as d
instead of the one returned by ambiguous_na_values
from itertools import product
import time
d = { "N": ["A", "G", "T", "C"], "R": ["C", "A", "T", "G"] }
seq = "RNRN"
# using list comprehensions
lst_start = time.time()
[ "".join(i) for i in product(*[ d[j] for j in seq ]) ]
lst_end = time.time()
# using map
map_start = time.time()
[ list(map("".join, product(*map(d.get, seq)))) ]
map_end = time.time()
lst_delay = (lst_end - lst_start) * 1000
map_delay = (map_end - map_start) * 1000
print("List delay: {} ms".format(round(lst_delay, 2)))
print("Map delay: {} ms".format(round(map_delay, 2)))
输出:
# len(seq) = 2:
List delay: 0.02 ms
Map delay: 0.01 ms
# len(seq) = 3:
List delay: 0.04 ms
Map delay: 0.02 ms
# len(seq) = 4
List delay: 0.08 ms
Map delay: 0.06 ms
# len(seq) = 5
List delay: 0.43 ms
Map delay: 0.17 ms
# len(seq) = 10
List delay: 126.68 ms
Map delay: 77.15 ms
# len(seq) = 12
List delay: 1887.53 ms
Map delay: 1320.49 ms
显然 map
更好,但只是2或3倍.可以肯定的是,它可以进一步优化.
Clearly map
is better, but just by a factor of 2 or 3. It's certain it could be further optimised.
这篇关于如何扩展不明确的dna序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!