

本文介绍了Biopython SeqIO转Pandas数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!


我有一个FASTA文件,可以很容易地通过 SeqIO.parse 进行解析.

I have a FASTA file that can easily be parsed by SeqIO.parse.


I am interested in extracting sequence ID's and sequence lengths. I used these lines to do it, but I feel it's waaaay too heavy (two iterations, conversions, etc.)

from Bio import SeqIO
import pandas as pd

# parse sequence fasta file
identifiers = [ for seq_record in SeqIO.parse("sequence.fasta",
lengths = [len(seq_record.seq) for seq_record in SeqIO.parse("sequence.fasta",
#converting lists to pandas Series
s1 = Series(identifiers, name='ID')
s2 = Series(lengths, name='length')
#Gathering Series into a pandas DataFrame and rename index as ID column
Qfasta = DataFrame(dict(ID=s1, length=s2)).set_index(['ID'])


I could do it with only one iteration, but I get a dict :

records = SeqIO.parse(fastaFile, 'fasta')


and I somehow can't get DataFrame.from_dict to work...


My goal is to iterate the FASTA file, and get ids and sequences lengths into a DataFrame through each iteration.


Here is a short FASTA file for those who want to help.



You're spot on - you definitely shouldn't be parsing the file twice, and storing the data in a dictionary is a waste of computing resources when you'll just be converting it to numpy arrays later.


SeqIO.parse() returns a generator, so you can iterate record-by-record, building a list like so:

with open('sequences.fasta') as fasta_file:  # Will close handle cleanly
    identifiers = []
    lengths = []
    for seq_record in SeqIO.parse(fasta_file, 'fasta'):  # (generator)

请参阅 Peter Cock的答案,以更有效的方式解析FASTA文件中的ID和序列.

See Peter Cock's answer for a more efficient way of parsing just ID's and sequences from a FASTA file.


The rest of your code looks pretty good to me. However, if you really want to optimize for use with pandas, you can read below:

咨询 panda.Series ,我们可以看到data在内部存储为numpy ndarray:

Consulting the source of panda.Series, we can see that data is stored interally as a numpy ndarray:

class Series(np.ndarray, Picklable, Groupable):
    """Generic indexed series (time series or otherwise) object.

    data:  array-like
        Underlying values of Series, preferably as numpy ndarray


If you make identifiers an ndarray, it can be used directly in Series without constructing a new array (the parameter copy, default False) will prevent a new ndarray being created if not needed. By storing your sequences in a list, you'll force Series to coerce said list to an ndarray.


If you know in advance exactly how many sequences you have (and how long the longest ID will be), you could initialize an empty ndarray to hold identifiers like so:

num_seqs = 50
max_id_len = 60
numpy.empty((num_seqs, 1), dtype='S{:d}'.format(max_id_len))


Of course, it's pretty hard to know exactly how many sequences you'll have, or what the largest ID is, so it's easiest to just let numpy convert from an existing list. However, this is technically the fastest way to store your data for use in pandas.

这篇关于Biopython SeqIO转Pandas数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-20 10:14