我有一个ID为1,786,916记录的数据文件,我想从另一个包含约480万条记录的文件(在这种情况下为DNA序列,但基本上只是纯文本)中检索相应的记录。我写了一个python脚本来做到这一点,但是要花很多时间才能运行(第3天,它只完成了12%)。由于我是python的相对新手,所以我想知道是否有人建议加快此操作。
这是带有ID的数据文件的示例(示例中的ID为ANICH889-10):
ANICH889-10 k__Animalia; p__Arthropoda; c__Insecta; o__Lepidoptera; f__Psychidae; g__Ardiosteres; s__Ardiosteres sp. ANIC9
ARONW984-15 k__Animalia; p__Arthropoda; c__Arachnida; o__Araneae; f__Clubionidae; g__Clubiona; s__Clubiona abboti
这是包含记录的第二个文件的示例:
>ASHYE2081-10|Creagrura nigripesDHJ01|COI-5P|HM420985
ATTTTATACTTTTTATTAGGAATATGATCAGGAATAATTGGTCTTTCAATAAGAATCATTATCCGTATTGAATTAAGAAATCCAGGATCTATTATTAATAATGACCAAATTTATAATTCATTAATTACTATACACGCACTATTAATAATTTTTTTTTTAGTTATACCTGTAATAATTGGAGGATTTGGAAATTGATTAATTCCTATTATAATTGGAGCCCCAGATATAGCATTTCCACGAATAAACAATCTTAGATTTTGATTATTAATCCCATCAATTTTCATATTAATATTAAGATCAATTACTAATCAAGGTGTAGGAACAGGATGAACAATATATCCCCCATTATCATTAAATATAAATCAAGAAGGAATATCAATAGATATATCAATTTTTTCTTTACATTTAGCAGGAATATCCTCAATTTTAGGATCAATTAATTTCATTTCAACTATTTTAAATATAAAATTTATTAATTCTAATTATGATCAATTAACTTTATTTTCATGATCAATTCTAATTACTACTATTTTATTATTACTAGCAGTCCCTGTATTAGCAGGAGCAATTACTATAATTTTAACTGATCGAAATTTAAATACTTCTTTTTTTGATCCTAGAGGAGGAGGAGATCCAATTT-----------------
>BCISA145-10|Hemiptera|COI-5P
AACTCTATACTTTTTACTAGGATCCTGGGCAGGAATAGTAGGAACATCATTAAGATGAATAATCCGAATTGAACTAGGACAACCTGGATCTTTTATTGGAGATGACCAAACTTATAATGTAATTGTAACTGCCCACGCATTTGTAATAATTTTCTTTATAGTTATACCAATTATAATTGGAGGATTTGGAAATTGATTAATTCCCTTAATAATTGGAGCACCCGATATAGCATTCCCACGAATGAATAACATAAGATTTTGATTGCTACCACCGTCCCTAACACTTCTAATCATAAGTAGAATTACAGAAAGAGGAGCAGGAACAGGATGAACAGTATACCCTCCATTATCCAGAAACATCGCCCATAGAGGAGCATCTGTAGATTTAGCAATCTTTTCCCTACATCTAGCAGGAGTATCATCAATTTTAGGAGCAGTTAACTTCATTTCAACAATTATTAATATACGACCAGCAGGAATAACCCCAGAACGAATCCCATTATTTGTATGATCTGTAGGAATTACAGCACTACTACTCCTACTTTCATTACCCGTACTAGCAGGAGCCATTACCATACTCTTAACTGACCGAAACTTCAATACTTCTTTTTTTGACCCTGCTGGAGGAGGAGATCCCATCCTATATCAACATCTATTC
但是,在第二个文件中,DNA序列分为多行而不是单行,并且它们的长度并不总是相同。
编辑
这是我想要的输出:
>ANICH889-10
GGGATTTGGTAATTGATTAGTTCCTTTAATA---TTGGGGGCCCCTGACATAGCTTTTCCTCGTATAAATAATATAAGATTTTGATTATTACCTCCCTCTCTTACATTATTAATTTCAAGAAGAATTGTAGAAAATGGAGCTGGGACTGGATGAACTGTTTACCCTCCTTTATCTTCTAATATCGCCCATAGAGGAAGCTCTGTAGATTTA---GCAATTTTCTCTTTACATTTAGCAGGAATTTCTTCTATTTTAGGAGCAATTAATTTTATTACAACAATTATTAATATACGTTTAAATAATTTATCTTTCGATCAAATACCTTTATTTGTTTGAGCAGTAGGAATTACAGCATTTTTACTATTACTTTCTTTACCTGTATTAGCTGGA---GCTATTACTATATTATTAACT---------------------------------------------------------------------------
>ARONW984-15
TGGTAACTGATTAGTTCCATTAATACTAGGAGCCCCTGATATAGCCTTCCCCCGAATAAATAATATAAGATTTTGACTTTTACCTCCTTCTCTAATTCTTCTTTTATCAAGGTCTATTATNGAAAATGGAGCA---------GGAACTGGCTGAACAGTTTACCCTCCCCTTTCTTNTAATATTTCCCATGCTGGAGCTTCTGTAGATCTTGCAATCTTTTCCCTACACCTAGCAGGTATTTCCTCAATCCTAGGGGCAGTTAAT------TTTATCACAACCGTAATTAACATACGCTCTAGAGGAATTACATTTGATCGAATGCCTTTATTTGTATGATCTGTATTAATTACAGCTATTCTTCTACTACTCTCCCTCCCAGTATTAGCAGGGGCTATTACAATACTACTCACAGACCGAAATTTAAAT-----------------------------------
这是我为此编写的python脚本:
from Bio import SeqIO
from Bio.Seq import Seq
import csv
import sys
#Name of the datafile
Taxonomyfile = "02_Arthropoda_specimen_data_less.txt"
#Name of the original sequence file
OrigTaxonSeqsfile = "00_Arthropoda_specimen.fasta"
#Name of the output sequence file
f4 = open("02_Arthropoda_specimen_less.fasta", 'w')
#Reading the datafile and extracting record IDs
TaxaKeep = []
with open(Taxonomyfile, 'r') as f1:
datareader = csv.reader(f1, delimiter='\t')
for item in datareader:
TaxaKeep.append(item[0])
print(len(TaxaKeep))
#Filtering sequence file to keep only those sequences with the desired IDs
datareader = SeqIO.parse(OrigTaxonSeqsfile, "fasta")
for seq in datareader:
for item in TaxaKeep:
if item in seq.id:
f4.write('>' + str(item) + '\n')
f4.write(str(seq.seq) + '\n')
我认为这里的麻烦在于,我正在遍历480万条记录中的每条170万条记录的名称列表。我考虑过要为480万条记录制作字典或类似的东西,但我不知道该怎么做。有什么建议(包括非python建议)吗?
谢谢!
最佳答案
我认为您可以通过改进外观来大幅提高性能。
使用set()
可以帮助您。集合被设计为可以非常快速地查找数据,并且它们不存储重复的值,这使它们成为筛选数据的理想选择。因此,让我们将输入文件中的所有分类ID都存储在一个集中。
from Bio import SeqIO
from Bio.Seq import Seq
import csv
import sys
taxonomy_file = "02_Arthropoda_specimen_data_less.txt"
orig_taxon_sequence_file = "00_Arthropoda_specimen.fasta"
output_sequence_file = "02_Arthropoda_specimen_less.fasta"
# build a set for fast look-up of IDs
with open(taxonomy_file, 'r', newline='') as fp:
datareader = csv.reader(fp, delimiter='\t')
first_column = (row[0] for row in datareader)
taxonomy_ids = set(first_column)
# use the set to speed up filtering the input FASTA file
with open(output_sequence_file, 'w') as fp:
for seq in SeqIO.parse(orig_taxon_sequence_file, "fasta"):
if seq.id in taxonomy_ids:
fp.write('>')
fp.write(seq.id)
fp.write(seq.seq)
fp.write('\n')
我已经重命名了您的一些变量。仅在上面的注释中写入“ #Name of output sequence file”只是命名变量
f4
是完全没有意义的。为什么不删除注释并直接将变量命名为output_sequence_file
?(row[0] for row in datareader)
是生成器理解。生成器是一个可迭代的对象,这意味着它还不计算ID列表-它只知道要做什么。通过不建立临时列表来节省时间和内存。一行之后,接受可迭代对象的set()
构造函数将根据第一列中的所有ID构建一个集合。在第二个块中,我们使用
if seq.id in taxonomy_ids
检查是否应该输出序列ID。 in
在设置上非常快。我要四次调用
.write()
而不是从四个项目中构建一个临时字符串。我假设seq.id
和seq.seq
已经是字符串,所以实际上没有必要在它们上调用str()
。我对FASTA文件格式了解不多,但是快速浏览the BioPython documentation表示使用
SeqIO.write()
是创建格式的更好方法。关于python - 从大型文件检索行的更有效方法,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/49474214/