我试图根据另一个包含外显子所属基因信息的文件来折叠外显子。
一个文件包含外显子的名称和它们对应的基因。
例如:
Exon1 Gene4
Exon2 Gene5
Exon3 Gene8
Exon4 Gene8
另一个文件是包含外显子名称和序列的快速文件,例如:
>Exon1
ACGTCGATTTCGATCA
>Exon2
ACGTCGATTTCGATCAACGTAA
>Exon3
ACGTCGATTTCAGCGGATCA
>Exon4
CCCACGTCGATTTCGATCAACGTAA
所以我的目标是输出一个带有基因名和相应外显子折叠成一个序列(CDS)的aFASTA file。例如:
>Gene4
ACGTCGATTTCGATCA
>Gene5
ACGTCGATTTCGATCAACGTAA
>Gene8
ACGTCGATTTCAGCGGATCACCCACGTCGATTTCGATCAACGTAA
到目前为止,我已经创建了一个dict,其中基因名作为密钥,外显子名作为密钥列表,如下所示:
exon_to_gene_dict = {}
with open('gene_to_exon', 'r') as file_gene, open('exon.fasta', 'r') as file_exons:
for line in file_gene:
line = line.rstrip().split()
scaffold = line[0]
gene = line[1]
exon_to_gene_dict.setdefault(gene, []).append(scaffold)
然后我用外显子id和相应的序列创建了一个dict:
exon_sequence_dict = {}
with open(exons.fasta,'r') as file_exons:
for line in file_exons:
line = line.rstrip()
if line.startswith('>'):
exon_id = line[1:]
else:
sequence = line
exon_sequence_dict[exon_id] = sequence
gene_to_cds_dict = {}
for exon, seq in exon_sequence_dict.items():
for gene_id, exon1 in gene_to_exon_dict.items():
if exon in exon1:
gene_to_cds_dict.setdefault(gene_id, []).append(seq)
gene_id_to_sequence =[]
for gene_id, sequence in sorted(gene_to_cds_dict.items()):
print(gene_id, file=f_out)
print(''.join(sequence), file=f_out)
所以最后我还是按照上面的方法解决了
示例的输出为:
Gene4
ACGTCGATTTCGATCA
Gene5
ACGTCGATTTCGATCAACGTAA
Gene8
ACGTCGATTTCAGCGGATCACCCACGTCGATTTCGATCAACGTAA
谢谢你的帮助
最佳答案
这是一个完整的解决方案,使用BioPython。BioPython会自动处理the issue that Cory Weller raised(如果FASTA的序列被多行分割)。这个解决方案还将来自同一基因的序列连接到CDS中。
from Bio import SeqIO
from Bio import Seq
from collections import defaultdict
with open('gene_to_exon.txt') as f:
lines = f.read().splitlines()
exon_to_gene = dict(tuple(line.split()) for line in lines)
exon_records = SeqIO.to_dict(SeqIO.parse('exon.fasta', 'fasta'))
gene_records = defaultdict(list)
for exon_id, record in exon_records.items():
gene_id = exon_to_gene[exon_id]
gene_records[gene_id].append(record)
cds_records = []
for gene_id in gene_records:
gene_record = gene_records[gene_id][0]
if len(gene_records[gene_id]) > 1:
sequence = ''.join([str(gene_record.seq) for gene_record in gene_records[gene_id]])
gene_record.seq = Seq.Seq(sequence)
gene_record.id = gene_id
gene_record.description = gene_id # see https://www.biostars.org/p/156428/
cds_records.append(gene_record)
with open('output.fasta', 'w') as g:
SeqIO.write(cds_records, g, 'fasta')
在您的输入文件中,这将给出结果:
>Gene8
ACGTCGATTTCAGCGGATCACCCACGTCGATTTCGATCAACGTAA
>Gene5
ACGTCGATTTCGATCAACGTAA
>Gene4
ACGTCGATTTCGATCA
编辑:这里还有一个纯Python实现。FASTA解析代码来自this code monk blog post(并适用于Python 3),在这里作者详细描述了使用
itertools.groupby
比使用for
循环的优势。import itertools
from collections import defaultdict
def isheader(line):
return line.startswith('>')
def aspairs(filehandle):
for header,group in itertools.groupby(filehandle, isheader):
if header:
line = next(group)
ensembl_id = line[1:].split()[0]
else:
sequence = ''.join(line.strip() for line in group)
yield ensembl_id, sequence
with open('exon.fasta') as f:
exon_records = dict(aspairs(f))
with open('gene_to_exon.txt') as f:
lines = f.read().splitlines()
exon_to_gene = dict(tuple(line.split()) for line in lines)
gene_records = defaultdict(str)
for exon_id, seq in exon_records.items():
gene_id = exon_to_gene[exon_id]
gene_records[gene_id] += seq
with open('output.fasta', 'w') as f:
for gene_id, seq in gene_records.items():
f.write('>{}\n{}\n'.format(gene_id, seq))
关于python - 使用python折叠外显子,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/42999643/