首先,对于我的pythonic无知我深表歉意。我需要将我的多序列fasta文件分解为相等大小的块,以供下游管道使用。我没有遇到任何容易做到这一点或以我正在寻找的格式进行的事情。
快速文件输入示例:
fas
> contig1
交流协会
> contig2
GGGATAGTCA
> contig3
GACTACTTTT
上面的示例fasta具有25bp。如果我将“块数”参数设置为“ 4”,那么我希望我的输出文件全部具有7个碱基对,最后一个文件除外,该文件具有其余的4bp。它看起来像这样:
chunk1.fas
> contig1:0-4
交流协会
> contig2:0-1
GG
chunk2.fas
> contig2:2-7
加泰格
chunk3.fas
> contig2:9-9
一种
> contig3:0-5
加塔克
chunk4.fas
> contig3:6-9
TTTT
请注意,除chunk4.fas中剩余的碱基对外,每个生成的chunk * .fas均包含7个碱基对。而且,块文件中的每个结果序列头均已与原始序列有所不同,因此它们包括从原始序列派生的“:”以及“开始”和“停止”位置。
biopython食谱有一个非常不错的批处理大小迭代器工具,我想我的答案就在操纵此代码内,但我不知道如何进行此操作。
任何帮助表示赞赏。干杯。
def batch_iterator(iterator, batch_size):
"""Returns lists of length batch_size.
This can be used on any iterator, for example to batch up
SeqRecord objects from Bio.SeqIO.parse(...), or to batch
Alignment objects from Bio.AlignIO.parse(...), or simply
lines from a file handle.
This is a generator function, and it returns lists of the
entries from the supplied iterator. Each list will have
batch_size entries, although the final list may be shorter.
"""
entry = True # Make sure we loop once
while entry:
batch = []
while len(batch) < batch_size:
try:
entry = next(iterator)
except StopIteration:
entry = False
if not entry:
# End of file
break
batch.append(entry)
if batch:
yield batch
record_iter = SeqIO.parse('aVan.fa', 'fasta')
for i, batch in enumerate(batch_iterator(record_iter, 1000), start=1):
filename = 'group_{}.fasta'.format(i)
count = SeqIO.write(batch, filename, 'fasta')
print('Wrote {} records to {}'.format(count, filename))
最佳答案
这不是一件容易的事,但请看一下此实现:
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
chunk_number = 4
records = list(SeqIO.parse("input.fasta", "fasta"))
chunk_size = sum(len(r) for r in records) // chunk_number + 1
def create_batch(records, chunk_size):
record_it = iter(records)
record = next(record_it)
current_base = 0
batch = []
batch_size = 0
# While there are new records, keep creating new batches.
while record:
# Loop over records untill the batch is full. (or no new records)
while batch_size != chunk_size and record:
end = current_base + chunk_size - batch_size
seq = record[current_base:end]
end_of_slice = current_base + len(seq) - 1
fasta_header = record.id + ":{}-{}".format(current_base, end_of_slice)
seq.id = seq.name = fasta_header
seq.description = ''
batch.append(seq)
current_base += len(seq)
batch_size += len(seq)
# Current record is exhausted, get a new one.
if current_base >= len(record):
record = next(record_it, None)
current_base = 0
# We have a batch with the correct size (or no new bathces)
yield batch
batch = []
batch_size = 0
for i, batch in enumerate(create_batch(records, chunk_size)):
filename = "chunk{}.fasta".format(i)
SeqIO.write(batch, filename, "fasta")
关于python - 如何使用biopython将多fasta文件拆分为相等序列长度的块并更改标题,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/58305093/