从文件中提取特定数据并将其写入另一个文件

从文件中提取特定数据并将其写入另一个文件

本文介绍了从文件中提取特定数据并将其写入另一个文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在此标记了python和perl只是因为到目前为止,我一直在使用它.如果有人知道解决此问题的更好方法,我当然愿意尝试一下.无论如何,我的问题:

I tagged python and perl in this only because that's what I've used thus far. If anyone knows a better way to go about this I'd certainly be willing to try it out. Anyway, my problem:

我需要为遵循以下格式的基因预测程序创建一个输入文件:

I need to create an input file for a gene prediction program that follows the following format:

seq1 5 15
seq1 20 34

seq2 50 48
seq2 45 36

seq3 17 20

其中seq#是GeneID,右边的数字是开放阅读框中外显子的位置.现在,我在具有许多其他信息的.gff3文件中有了此信息.我可以使用excel打开它,并轻松删除不相关数据的列.现在是这样安排的:

Where seq# is the geneID and the numbers to the right are the positions of exons within an open reading frame. Now I have this information, in a .gff3 file that has a lot of other information. I can open this with excel and easily delete the columns with non-relevant data. Here's how it's arranged now:

PITG_00002  .   gene    2   397 .   +   .   ID=g.1;Name=ORF%
PITG_00002  .   mRNA    2   397 .   +   .   ID=m.1;
**PITG_00002**  .   exon    **2 397**   .   +   .   ID=m.1.exon1;
PITG_00002  .   CDS 2   397 .   +   .   ID=cds.m.1;

PITG_00004  .   gene    1   1275    .   +   .   ID=g.3;Name=ORF%20g
PITG_00004  .   mRNA    1   1275    .   +   .   ID=m.3;
**PITG_00004**  .   exon    **1 1275**  .   +   .   ID=m.3.exon1;P
PITG_00004  .   CDS 1   1275    .   +   .   ID=cds.m.3;P

PITG_00004  .   gene    1397    1969    .   +   .   ID=g.4;Name=
PITG_00004  .   mRNA    1397    1969    .   +   .   ID=m.4;
**PITG_00004**  .   exon    **1397  1969**  .   +   .   ID=m.4.exon1;
PITG_00004  .   CDS 1397    1969    .   +   .   ID=cds.m.4;


所以我只需要粗体的数据.例如,


So I need only the data that is in bold. For example,

PITG_0002 2 397

PITG_00004 1 1275
PITG_00004 1397 1969

非常感谢您能提供的任何帮助,

Any help you could give would be greatly appreciated, thanks!

好吧,我弄乱了格式. **之间的所有内容都是我所需要的.

Well I messed up the formatting. Anything that is between the **'s is what I need lol.

推荐答案

您的数据看起来像是用制表符分隔的.

It looks like your data is tab-separated.

此Perl程序将从第三列中所有具有exon的所有记录中打印第1、4和5列.您需要将open语句中的文件名更改为您的实际文件名.

This Perl program will print columns 1, 4 and 5 from all records that have exon in the third column. You need to change the file name in the open statement to your actual file name.

use strict;
use warnings;

open my $fh, '<', 'genes.gff3' or die $!;

while (<$fh>) {
  chomp;
  my @fields = split /\t/;
  next unless @fields >= 5 and $fields[2] eq 'exon';
  print join("\t", @fields[0,3,4]), "\n";
}

输出

PITG_00002  2 397
PITG_00004  1 1275
PITG_00004  1397  1969

这篇关于从文件中提取特定数据并将其写入另一个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-20 10:15