本文介绍了如何使用不同文件中的序列 ID 从文件中提取 FASTA 序列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个文件:

sequence.fasta - 包含多个 FASTA 序列的大文件

sequence.fasta - a big file with multiple FASTA sequences

ids.txt - 由制表符分隔格式的序列 ID 组成.

ids.txt - consisting of sequence IDs in a tab-delimited format.

我想将这些序列从 sequence.fasta 中提取到另一个文件中,其 ID 在 ids.txt 中匹配.

I want to extract those sequences into another file from sequence.fasta whose IDs matched in ids.txt.

sequence.fasta

>AUP4056.1
MFKSLIQFFKSKSNTSNIKKENAVQRQERQDIEGWITPYSGQELLNTELRQHHLGLLWQQVSMTREMFEH
LYQKPIERYAEMVQLLPASESHHHSHLGGMLDHGLEVISFAAKLRQNYVLPLNAAPEDQAKQKDAWTAAV
IYLALVHDIGKSIVDIEIQLQDGKRWLAWHGIPTLPYKFRYIKQRDYELHPVLGGFIANQLIAKETFDWL
ATYPEVFSALMYAMAGHYDKANVLAEIVQKADQNSVALALGGDITKLVQKPVISFAKQLI`

>XIM5213.2
FKISSKGPGDGWLTEDGLWLMSKTTADQIRAYLMGQGISVPSDNRKLFDEMQAHRVIESTSEGNAIWYCQ
LSADAGWKPKDKFSLLRIKPEVIWDNIDDRPELFAGTICVVEKENEAEEKISNTVNEVQDTVPINKKENI
ELTSNLQEENTALQSLNPSQNPEVVVENCDNNSVDFLLNMFSDNNEQQVMNIPSADAEAGTTMILKSEPE
NLNTHIEVEANAIPKLPTNDDTHLKSEGQKFVDWLKD

ids.txt

AUP4056.1 GUP5213.2 ARD5364.5 HAE6893.7
JIK6023.5 YUP7086.9

我需要如下输出

>AUP4056.1
MFKSLIQFFKSKSNTSNIKKENAVQRQERQDIEGWITPYSGQELLNTELRQHHLGLLWQQVSMTREMFEH
LYQKPIERYAEMVQLLPASESHHHSHLGGMLDHGLEVISFAAKLRQNYVLPLNAAPEDQAKQKDAWTAAV
IYLALVHDIGKSIVDIEIQLQDGKRWLAWHGIPTLPYKFRYIKQRDYELHPVLGGFIANQLIAKETFDWL
ATYPEVFSALMYAMAGHYDKANVLAEIVQKADQNSVALALGGDITKLVQKPVISFAKQLI

>GUP5213.2
ELTSNLQEENTALQSLNPSQNPEVVVENCDNNSVDFLLNMFSDNNEQQVMNIPSADAEAGTTMILKSEPE
NLNTHIEVEANAIPKLPTNDDTHLKSEGQKFVDWLKDKLFKKQLTFNDRTAKVHIVNDCLFIVSPSSFEL
YLQEKGESYDEECINNLQYEFQALGLHRKRIIKNDTINFWRCKVIGPKKESFLVGYLVPNTRLFFGDKIL
INNRHLLLEE

我尝试过 Perl 单线,但这不起作用.既不给出任何错误,也不给出任何输出.

I have tried a Perl one-liner, but this is not working. Neither giving any error nor any output.

perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' ids.txt sequence.fasta

有人能帮我更正这段代码吗,或者有没有其他 Perl 脚本?

Could anybody help me correct this code or if there is any other Perl script?

推荐答案

这里的问题是单行代码很难理解、理解和解开.

The problem here is that one-liners are very hard to follow, understand and untangle.

所以写出来长手":

#!/usr/bin/env perl

use strict;
use warnings;

open ( my $id_file, '<', 'ids.txt' ) or die $!;
#use split here, to split any lines on whitespace.
chomp ( my @ids = map { split } <$id_file> );
close ( $id_file );

my %sequences;

open ( my $input, '<', 'sequence.fasta' ) or die $!;
{
   local $/ = '';    #paragraph mode; Read until blank line

   while ( <$input> ) {
      my ( $id, $sequence ) = m/>\s*(\S+)\n(.*)/ms;
      $sequences{$id} = $sequence;
   }
}

foreach my $id (@ids) {
   if ( $sequences{$id} ) {
      print ">$id\n";
      print "$sequences{$id}\n";
   }
}

如果你想从 @ARGV 读取文件名:

If you want to read the filenames from @ARGV:

my ( $ids_file, $sequence_file ) = @ARGV;

我不会尝试将它压缩回单衬里 - 你可能可以,但是当你回到它时会很难理解.

I wouldn't try and compress this back into a one liner - you probably could, but it'll be quite hard to understand when you come back to it.

这篇关于如何使用不同文件中的序列 ID 从文件中提取 FASTA 序列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-26 09:41