我实际上有一个巨大的multifasta seq文件,例如:

>Seq_1_0035_0035
ATTGGAT
>Seq_2_0042_0035
ATTGAGGA
>EOGWX56TR_0035_0042 (busco)
ATGGAGAT
>EOGWX56TR_0042_0042 (busco)
ATGGATGG
>Seq6_035_0042
ATGGGAATAG
>EOG55FTG_0035_0042 (busco)
AATGGATA
>EOG5GFFTA_0042_0042 (busco)
ATGGAGATA
>Seq56_0035_0042
ATGGAGATAT
>EOGATTT_0035_0042  (busco)
AAATGAGATA
>EOGATTT_0042_0042  (busco)
ATGGAAT
>EOGATTA_0042_0042  (busco)
ATAGGAGAT


我实际上想计算文件中有多少Busco基因(它们都以>EOG开头)。为此,我有一个脚本:

count=1
for record in SeqIO.parse("concatenate_with_busco_names_0035_0042_aa.fa", "fasta"):
    count+=1
print(count)

set_of_labels = set()

with open("concatenate_with_busco_names_0035_0042_aa.fa") as f:
  for line in f:
    if line.startswith('>EOG'):
      label = line[4:].split('_')[0]
      set_of_labels.add(label)

print("Total number of Busco genes: " + str(len(set_of_labels)))


但是我也想知道每个对应的组件之间我有多少个基因。我会更好地解释;

如您所见,每个seqID such _number_number中都有两个数字
这些数字是特定的,第一个_number对应于该序列所属的物种,第二个_number对应其特定编号。
无论如何,我想是否可以像我一样将第一个数字_0035_0042的seq获得多少个Busco基因
并且
seq ID多少个:

_0035_0042
_0035_0042
_0042_0042
_0042_0035


在以上示例中,它将是:

Total busco: 5 (I count only once if the >busco is present even if _number are different)
Total busco for the specie _0035 (_0035_0042 and _0035_0035) : 3
Total busco for the specie _0042 (_0042_0042 and _0042_0035) : 4
Total busco for the specific specie  _0035_0042 : 3
Total busco for the specific specie  _0042_0035 : 0
Total busco for the specific specie  _0042_0042 : 4
Total busco for the specific specie  _0035_0035 : 0


您好希望很清楚,实际上第一部分(total busco:)已经由我的脚本完成,我只需要数一下其他7种方式。

这是真实数据data

最佳答案

除了busco计数器外,您还可以使用多个计数器来获取物种和特定物种的单独计数,例如:

import collections

busco = collections.defaultdict(int)  # busco counter
species = collections.defaultdict(int)  # species counter
specific_species = collections.defaultdict(int)  # specific species counter

with open("concatenate_with_busco_names_0035_0042_aa.fa", "r") as f:
    for line in f:
        if line[:4] == ">EOG":
            entry = line.split()[0][4:].split('_')
            busco[entry[0]] += 1
            species[entry[1]] += 1
            specific_species[entry[1] + "_" + entry[2]] += 1

print("Total busco: {}".format(len(busco)))
for specie, total in species.items():
    print("Total busco for the specie {}: {}".format(specie, total))
for specie, total in specific_species.items():
    print("Total busco for the specific specie {}: {}".format(specie, total))


哪个应该产生:

Total busco: 5
Total busco for the specie 0035: 3
Total busco for the specie 0042: 4
Total busco for the specific specie 0035_0042: 3
Total busco for the specific specie 0042_0042: 4

Non-listed (specific) species won't show up but if you really want to print them out you can combine them from the species counter and print their values (default is 0):

import itertools

print("Total busco: {}".format(len(busco)))
for specie, total in species.items():
    print("Total busco for the specie {}: {}".format(specie, total))
for specie in itertools.product(species, species):
    s = "_".join(specie)
    print("Total busco for the specific specie {}: {}".format(s, specific_species[s]))


产生:

Total busco: 5
Total busco for the specie 0035: 3
Total busco for the specie 0042: 4
Total busco for the specific specie 0035_0035: 0
Total busco for the specific specie 0035_0042: 3
Total busco for the specific specie 0042_0035: 0
Total busco for the specific specie 0042_0042: 4

UPDATE: If you're after unique counts per busco then you need to inverse the count to index on specie / specific specie and collect busco values in a set as their value. Then all you need is to get the length of each set, something like:

import collections
import itertools

busco = set()
species = collections.defaultdict(set)
specific_species = collections.defaultdict(set)

with open("concatenate_with_busco_names_0035_0042_aa.fa", "r") as f:
    for line in f:
        if line[:4] == ">EOG":
            entry = line.split()[0][4:].split('_')
            busco.add(entry[0])
            species[entry[1]].add(entry[0])
            specific_species[entry[1] + "_" + entry[2]].add(entry[0])

print("Total busco: {}".format(len(busco)))
for specie, buscos in species.items():
    print("Total busco for the specie {}: {}".format(specie, len(buscos)))
for specie in itertools.product(species, species):
    s = "_".join(specie)
    print("Total busco for the specific specie {}: {}".format(s, len(specific_species[s])))


为您的完整数据打印:

巴士总数:421
现金0035的总busco:402
现金0042的总busco:397
特定品种的总busco 0035_0035:392
特定品种的总巴士总费用0035_0042:262
特定品种的总巴士总票数0042_0035:305
特定品种的总巴士总票数0042_0042:383

关于python - 计数seqID python中的特定模式,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/50349288/

10-12 14:27
查看更多