问题描述
我有一个FASTA文件,可以很容易地通过 SeqIO.parse
进行解析.
I have a FASTA file that can easily be parsed by SeqIO.parse
.
我对提取序列ID和序列长度感兴趣.我用这些行来做,但是我感觉太沉重了(两次迭代,转换等)
I am interested in extracting sequence ID's and sequence lengths. I used these lines to do it, but I feel it's waaaay too heavy (two iterations, conversions, etc.)
from Bio import SeqIO
import pandas as pd
# parse sequence fasta file
identifiers = [seq_record.id for seq_record in SeqIO.parse("sequence.fasta",
"fasta")]
lengths = [len(seq_record.seq) for seq_record in SeqIO.parse("sequence.fasta",
"fasta")]
#converting lists to pandas Series
s1 = Series(identifiers, name='ID')
s2 = Series(lengths, name='length')
#Gathering Series into a pandas DataFrame and rename index as ID column
Qfasta = DataFrame(dict(ID=s1, length=s2)).set_index(['ID'])
我只需要一个迭代就可以做到,但是我得到了一个字典:
I could do it with only one iteration, but I get a dict :
records = SeqIO.parse(fastaFile, 'fasta')
而我却无法以某种方式使DataFrame.from_dict
正常工作...
and I somehow can't get DataFrame.from_dict
to work...
我的目标是迭代FASTA文件,并在每次迭代中将ID和序列长度获取为DataFrame
.
My goal is to iterate the FASTA file, and get ids and sequences lengths into a DataFrame
through each iteration.
这是一个简短的FASTA文件,用于希望提供帮助的人.
Here is a short FASTA file for those who want to help.
推荐答案
您很清楚-您绝对不应该解析文件两次,并且将数据存储在字典中是一种稍后将其转换为numpy
数组时会浪费计算资源.
You're spot on - you definitely shouldn't be parsing the file twice, and storing the data in a dictionary is a waste of computing resources when you'll just be converting it to numpy
arrays later.
SeqIO.parse()
返回一个生成器,因此您可以逐条记录地迭代,建立一个像这样的列表:
SeqIO.parse()
returns a generator, so you can iterate record-by-record, building a list like so:
with open('sequences.fasta') as fasta_file: # Will close handle cleanly
identifiers = []
lengths = []
for seq_record in SeqIO.parse(fasta_file, 'fasta'): # (generator)
identifiers.append(seq_record.id)
lengths.append(len(seq_record.seq))
请参阅 Peter Cock的答案,以更有效的方式解析FASTA文件中的ID和序列.
See Peter Cock's answer for a more efficient way of parsing just ID's and sequences from a FASTA file.
您的其余代码对我来说似乎还不错.但是,如果您真的想针对与pandas
一起使用进行优化,则可以阅读以下内容:
The rest of your code looks pretty good to me. However, if you really want to optimize for use with pandas
, you can read below:
咨询 panda.Series
,我们可以看到data
在内部存储为numpy
ndarray
:
Consulting the source of panda.Series
, we can see that data
is stored interally as a numpy
ndarray
:
class Series(np.ndarray, Picklable, Groupable):
"""Generic indexed series (time series or otherwise) object.
Parameters
----------
data: array-like
Underlying values of Series, preferably as numpy ndarray
如果将identifiers
设为ndarray
,则可以直接在Series
中使用它,而无需构造新的数组(参数copy
,默认值为False
)将防止在以下情况下创建新的ndarray
并不需要.通过将序列存储在列表中,您将迫使Series将所述列表强制为ndarray
.
If you make identifiers
an ndarray
, it can be used directly in Series
without constructing a new array (the parameter copy
, default False
) will prevent a new ndarray
being created if not needed. By storing your sequences in a list, you'll force Series to coerce said list to an ndarray
.
如果您事先确切知道有多少序列(以及最长的ID将有多长时间),则可以初始化一个空的ndarray
来保存标识符,如下所示:
If you know in advance exactly how many sequences you have (and how long the longest ID will be), you could initialize an empty ndarray
to hold identifiers like so:
num_seqs = 50
max_id_len = 60
numpy.empty((num_seqs, 1), dtype='S{:d}'.format(max_id_len))
当然,要确切地知道您将要拥有多少个序列或最大的ID是非常困难的,因此让numpy
从现有列表中进行转换是最简单的.但是,从技术上来讲,这是存储数据以供在pandas
中使用的最快方法.
Of course, it's pretty hard to know exactly how many sequences you'll have, or what the largest ID is, so it's easiest to just let numpy
convert from an existing list. However, this is technically the fastest way to store your data for use in pandas
.
这篇关于Biopython SeqIO转Pandas数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!