问题描述
基本上,GenBank文件由基因条目(由"gene"宣布,然后是其对应的"CDS"条目(每个基因只有一个))组成,就像我在下面显示的两个一样.制表符分隔的两列文件,'gene'和'CDS'始终在其前后,如果使用可以使用的工具可以轻松地执行此任务,请告诉我.
Basically, a GenBank file consists on gene entries (announced by 'gene' followed by its corresponding 'CDS' entry (only one per gene) like the two I show here below. I would like to get locus_tag vs product in a tab-delimited two column file. 'gene' and 'CDS' are always preceded and followed by spaces. If this task can be easily performed using an already available tool, please let me know.
输入文件:
gene complement(8972..9094)
/locus_tag="HAPS_0004"
/db_xref="GeneID:7278619"
CDS complement(8972..9094)
/locus_tag="HAPS_0004"
/codon_start=1
/transl_table=11
/product="hypothetical protein"
/protein_id="YP_002474657.1"
/db_xref="GI:219870282"
/db_xref="GeneID:7278619"
/translation="MYYKALAHFLPTLSTMQNILSKSPLSLDFRLLFLAFIDKR"
gene 9632..11416
/gene="frdA"
/locus_tag="HAPS_0005"
/db_xref="GeneID:7278620"
CDS 9632..11416
/gene="frdA"
/locus_tag="HAPS_0005"
/note="part of four member fumarate reductase enzyme
complex FrdABCD which catalyzes the reduction of fumarate
to succinate during anaerobic respiration; FrdAB are the
catalytic subcomplex consisting of a flavoprotein subunit
and an iron-sulfur subunit, respectively; FrdCD are the
membrane components which interact with quinone and are
involved in electron transfer; the catalytic subunits are
similar to succinate dehydrogenase SdhAB"
/codon_start=1
/transl_table=11
/product="fumarate reductase flavoprotein subunit"
/protein_id="YP_002474658.1"
/db_xref="GI:219870283"
/db_xref="GeneID:7278620"
/translation="MQTVNVDVAIVGAGGGGLRAAIAAAEANPNLKIALISKVYPMRS
HTVAAEGGAAAVAKEEDSYDKHFHDTVAGGDWLCEQDVVEYFVEHSPVEMTQLERWGC
PWSRKADGDVNVRRFGGMKIERTWFAADKTGFHLLHTLFQTSIKYPQIIRFDEHFVVD
ILVDDGQVRGCVAMNMMEGTFVQINANAVVIATGGGCRAYRFNTNGGIVTGDGLSMAY
RHGVPLRDMEFVQYHPTGLPNTGILMTEGCRGEGGILVNKDGYRYLQDYGLGPETPVG
KPENKYMELGPRDKVSQAFWQEWRKGNTLKTAKGVDVVHLDLRHLGEKYLHERLPFIC
ELAQAYEGVDPAKAPIPVRPVVHYTMGGIEVDQHAETCIKGLFAVGECASSGLHGANR
LGSNSLAELVVFGKVAGEMAAKRAVEATARNQAVIDAQAKDVLERVYALARQEGEESW
SQIRNEMGDSMEEGCGIYRTQESMEKTVAKIAELKERYKRIKVKDSSSVFNTDLLYKI
ELGYILDVAQSISSSAVERKESRGAHQRLDYVERDDVNYLKHTLAFYNADGTPTIKYS
DVKITKSQPAKRVYGAEAEAQEAAAKKE"
所需的输出(在制表符分隔的两个列文件中,locus_tag与产品):
Desired output (locus_tag vs product in a tab-delimited two columnfile):
HAPS_0004 hypothetical protein
HAPS_0005 fumarate reductase flavoprotein subunit
实际上,具有此输出将是理想的,每个基因一行(仅显示一个基因):
In fact, having this output would be ideal, one line for each gene (shown for only one gene):
locus_tag="HAPS_0004" db_xref="GeneID:7278619" complement(8972..9094) codon_start=1 transl_table=11 product="hypothetical protein" protein_id="YP_002474657.1" db_xref="GI:219870282" db_xref="GeneID:7278619" translation="MYYKALAHFLPTLSTMQNILSKSPLSLDFRLLFLAFIDKR"
推荐答案
perl -nE'
BEGIN{ ($/, $") = ("CDS", "\t") }
say "@r[0,1]" if @r= m!/(?:locus_tag|product)="(.+?)"!g and @r>1
' file
输出
HAPS_0004 hypothetical protein
HAPS_0005 fumarate reductase flavoprotein subunit
这篇关于解析GenBank文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!