问题描述
我在此标记了python和perl只是因为到目前为止,我一直在使用它.如果有人知道解决此问题的更好方法,我当然愿意尝试一下.无论如何,我的问题:
I tagged python and perl in this only because that's what I've used thus far. If anyone knows a better way to go about this I'd certainly be willing to try it out. Anyway, my problem:
我需要为遵循以下格式的基因预测程序创建一个输入文件:
I need to create an input file for a gene prediction program that follows the following format:
seq1 5 15
seq1 20 34
seq2 50 48
seq2 45 36
seq3 17 20
其中seq#是GeneID,右边的数字是开放阅读框中外显子的位置.现在,我在具有许多其他信息的.gff3文件中有了此信息.我可以使用excel打开它,并轻松删除不相关数据的列.现在是这样安排的:
Where seq# is the geneID and the numbers to the right are the positions of exons within an open reading frame. Now I have this information, in a .gff3 file that has a lot of other information. I can open this with excel and easily delete the columns with non-relevant data. Here's how it's arranged now:
PITG_00002 . gene 2 397 . + . ID=g.1;Name=ORF%
PITG_00002 . mRNA 2 397 . + . ID=m.1;
**PITG_00002** . exon **2 397** . + . ID=m.1.exon1;
PITG_00002 . CDS 2 397 . + . ID=cds.m.1;
PITG_00004 . gene 1 1275 . + . ID=g.3;Name=ORF%20g
PITG_00004 . mRNA 1 1275 . + . ID=m.3;
**PITG_00004** . exon **1 1275** . + . ID=m.3.exon1;P
PITG_00004 . CDS 1 1275 . + . ID=cds.m.3;P
PITG_00004 . gene 1397 1969 . + . ID=g.4;Name=
PITG_00004 . mRNA 1397 1969 . + . ID=m.4;
**PITG_00004** . exon **1397 1969** . + . ID=m.4.exon1;
PITG_00004 . CDS 1397 1969 . + . ID=cds.m.4;
所以我只需要粗体的数据.例如,
So I need only the data that is in bold. For example,
PITG_0002 2 397
PITG_00004 1 1275
PITG_00004 1397 1969
非常感谢您能提供的任何帮助,
Any help you could give would be greatly appreciated, thanks!
好吧,我弄乱了格式. **之间的所有内容都是我所需要的.
Well I messed up the formatting. Anything that is between the **'s is what I need lol.
推荐答案
您的数据看起来像是用制表符分隔的.
It looks like your data is tab-separated.
此Perl程序将从第三列中所有具有exon
的所有记录中打印第1、4和5列.您需要将open
语句中的文件名更改为您的实际文件名.
This Perl program will print columns 1, 4 and 5 from all records that have exon
in the third column. You need to change the file name in the open
statement to your actual file name.
use strict;
use warnings;
open my $fh, '<', 'genes.gff3' or die $!;
while (<$fh>) {
chomp;
my @fields = split /\t/;
next unless @fields >= 5 and $fields[2] eq 'exon';
print join("\t", @fields[0,3,4]), "\n";
}
输出
PITG_00002 2 397
PITG_00004 1 1275
PITG_00004 1397 1969
这篇关于从文件中提取特定数据并将其写入另一个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!