perl - 在蛋白质序列中寻找基序？

我写了以下脚本来搜索蛋白质序列（字符串）中的基序（子字符串）。我是初学者，写这篇文章对我来说很难。关于相同，我有两个问题：
1.错误：以下脚本几乎没有错误。我已经有一段时间了，但是还没有弄清楚是什么，为什么？
2.编写了以下脚本来搜索蛋白质序列（字符串）中的一个基序（子字符串）。我的下一个任务涉及在相同的蛋白质序列（字符串）中以特定顺序搜索多个主题（例如：motif1motif2，motif3这个主题不能更改）。

        use strict;
        use warnings;

        my @file_data=();
        my $motif ='';
        my $protein_seq='';
        my $h= '[VLIM]';
        my $s= '[AG]';
        my $x= '[ARNDCEQGHILKMFPSTWYV]';
        my $regexp = "($h){4}D($x){4}D"; #motif to be searched is hhhhDxxxxD
        my @locations=();

        @file_data= get_file_data("seq.txt");

        $protein_seq= extract_sequence(@file_data);

    #searching for a motif hhhhDxxxxD in each protein sequence in the give file

        foreach my $line(@file_data){
        if ($motif=~ /$regexp/){
        print "found motif \n\n";
        }
        else {
        print "not found \n\n";
        }
        }
#recording the location/position of motif to be outputed

        @locations= match_position($regexp,$seq);
        if (@locations){
        print "Searching for motifs $regexp \n";
        print "Catalytic site is at location:\n";
        }
        else{
        print "motif not found \n\n";
        }
        exit;

        sub get_file_data{
        my ($filename)=@_;
        use strict;
        use warnings;
        my $sequence='';

        foreach my $line(@file_data){

        if ($line=~ /^\s*$/){
        next;
                }
        elsif ($line=~ /^\s*#/){
        next;
        }
        elsif ($line=~ /^>/){
        next;
        }
        else {
        $sequence.=$line;
        }
        }
        $sequence=~ s/\s//g;
        return $sequence;
        }

        sub(match_positions) {
        my ($regexp, $sequence)=@_;
        use strict;
        my @position=();
        while ($sequence=~ /$regexp/ig){
        push (@position, $-[0]);
        }
        return @position;
        }

最佳答案

首先，关键字是elsif，第二，您不需要它。您可以将get_file_data循环中的代码压缩为：

next if $line =~ /^\s*$|^>/;
$sequence .= $line;

只要您要使用正则表达式（除非太笨拙），您还可以搜索要忽略的所有情况。如果找到实际的第二种情况，则可以将其添加为另一种情况。假设您要排除以#-开头的行。然后，您可以像这样添加它：/^\s*$|^>|^#-/
另一件事是，my position=();需要在定位之前具有@标记，否则，perl认为您正在尝试通过调用position()进行一些棘手的操作。
您需要进行以下更改：

 my $h= '[VLIM]';
 my $s= '[AG]';
 my $x= '[ARNDCEQGHILKMFPSTWYV]';

否则，您只是将$h分配给具有单个插槽的数组引用，该插槽由子VLIM返回的内容填充。
第三，不要使用$&。替换pos($sequence)-length($&)+1

push @positions, $-[0];

或更妙的是，使用English：

use English qw<-no_match_vars>;
...
push @positions, $LAST_MATCH_START[0];

我建议阅读以下文件：

use IO::File;
...
# Use real file handles
my $fh = IO::File->new( "<seq.txt" );
get_file_data( $fh ); # They can be passed
...
sub get_file_data{
    my $file_handle = shift;
    ...
    # while loop conserves resources
    while ( my $line = <$file_handle> ) {
        next if $line =~ /^\s*$|^>/;
        $sequence .= $line;
    }

前进的建议-它极大地帮助了我：

A.安装Smart::Comments

B.将其放在脚本的顶部：

 use Smart::Comments;

C.每当您不确定目前为止的情况时，例如，如果您想查看$sequence的当前内容，请在代码中放置以下内容：

### $sequence
exit 0;

只是显示它并退出。如果打印输出过多，请删除它们。

关于perl - 在蛋白质序列中寻找基序？，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/831640/