我第一次使用biopython。如果这是一个基本问题,请原谅我。
我想输入序列,然后对它们进行比对,并能够引用原始序列的索引位置(空位)和对齐的序列(空位)。
我的真实示例是烯醇酶(Uniprot P37869和P0A6P9)。底物结合赖氨酸在大肠杆菌中为392,在枯草芽孢杆菌中为389。如果一个人对两个人进行肌肉对齐,则该赖氨酸在对齐中的索引为394。
实施例1:大肠杆菌残基#392的比对指数是多少? (按对齐顺序回答394)。
实施例2我在394处的比对中发现了一个保守的残基。原始(无缺口)序列中的残基在哪里? (在大肠杆菌中为392,在枯草芽孢杆菌中为389)。
>>>cline = MuscleCommandline(input="enolase.txt", out="enolase.aln")
>>>cline()
>>> alignment = AlignIO.read("enolase.aln","fasta")
>>> print(alignment[:,390:])
SingleLetterAlphabet() alignment with 2 rows and 45 columns
AGQIKTGAPSRTDRVAKYNQLLRIEDQLAETAQYHGINSFYNLNK sp|P37869|ENO_BACSU
AGQIKTGSMSRSDRVAKYNQLIRIEEALGEKAPYNGRKEIKGQA- sp|P0A6P9|ENO_ECOLI
>>> print(alignment[:,394])
KK
最佳答案
好玩的问题!据我所知,BioPython
中没有内置的东西。这是我要解决的方法。
让我们从示例2开始。如果您将两个文件enolase.txt
和enolase.aln
分别使用原始未插接序列和FASTA格式的对齐序列,那么我们可以遍历压缩记录,计算对齐序列中的间隔数序列并计算残基在未去序列中的索引:
from Bio import SeqIO, AlignIO
def find_in_original(index, original_path, alignment_path):
alignment = AlignIO.read(alignment_path, 'fasta')
original = SeqIO.parse(original_path, 'fasta')
for original_record, alignment_record in zip(original, alignment):
alignment_seq = str(alignment_record.seq)
original_seq = str(original_record.seq)
gaps = alignment_seq[:index].count('-')
original_index = len(alignment_seq[:index]) - gaps
assert alignment_seq[index] == original_seq[original_index]
yield ("The conserved residue {} at location {} in the alignment can be"
" found at location {} in {}.".format(alignment_seq[index],
index, original_index, original_record.id.split('|')[-1]))
结果是:
>>> for result in find_in_original(394, 'enolase.txt', 'enolase.aln'):
... print result
The conserved residue K at location 394 in the alignment can be found at location 392 in ENO_ECOLI.
The conserved residue K at location 394 in the alignment can be found at location 389 in ENO_BACSU.
对于相反的操作,我们查看比对中所有可能的索引,并减去差值,看看哪一个等于未取代的序列:
def find_in_alignment(index, organism, original_path, alignment_path):
alignment = AlignIO.read(alignment_path, 'fasta')
original = SeqIO.parse(original_path, 'fasta')
for original_record, alignment_record in zip(original, alignment):
if organism in original_record.id:
alignment_seq = str(alignment_record.seq)
original_seq = str(original_record.seq)
residue = original_seq[index]
for i in range(index, len(alignment_seq)):
if alignment_seq[i] == residue and \
alignment_seq[:i].replace('-', '') == original_seq[:index]:
return ("The residue {} at location {} in {} is at location"
" {} in the alignment.".format(residue, index,
organism, i))
结果是:
>>> print find_in_alignment(392, 'ENO_ECOLI', 'enolase.txt', 'enolase.aln')
The residue K at location 392 in ENO_ECOLI is at location 394 in the alignment.
>>> print find_in_alignment(389, 'ENO_BACSU', ungapped_path, alignment_path)
The residue K at location 389 in ENO_BACSU is at location 394 in the alignment.
关于python - 生物python比对的无索引,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/46535260/