perl - 查找并打印所有重叠的k-mers

我正在尝试编写一个perl程序，该程序读取fasta文件并打印出一个文本文件，其中包含来自序列(fasta)文件的所有可用(重叠)长度15 k-mers。当我搜索不重叠的k-mers时，此程序运行良好，但是当我对其进行编码以查找重叠的k-mers时，它将花很长时间执行，而Cygwin在12小时后最终终止了程序。 (我把match_count留在那里计算总数，请随时忽略该行)

#!/usr/bin/perl
use strict;
use warnings;

my $k = 15;
my $input = 'fasta.fasta';
my $output = 'text.txt';
my $match_count = 0;

#Open File
unless (open(FASTA, "<", $input)){
    die "Unable to open fasta file", $!;
    }

    #Unwraps the FASTA format file
    $/=">";
    #Separate header and sequence
    #Remove spaces
unless (open(OUTPUT, ">", $output)){
die "Unable to open file", $!;
}

    while (my $line = <FASTA>){
            my($header, @seq) = split(/\n/, $line);
                    my $sequence = join '', @seq;

    while (length($sequence) >= $k){
        $sequence =~ m/(.{$k})/;
        print OUTPUT "$1\n";
        $sequence = substr($sequence, 1, length($sequence)-1);
    }
}

我正在寻找的结果是:

A total of 20938309 k-mers printed in the text file when I use the wc -l command.

提前致谢!

最佳答案

不知道为什么您没有得到想要的结果。

我以为我会按照您的问题描述发布我使用过的2个程序。

第一个只是在我用于测试的文件(fasta_dat.txt)中计算kmers。它不会打印出来，而只是检查看看有多少个kmers。

#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;

my $in  = Bio::SeqIO->new( -file   => "fasta_dat.txt" ,
                           -format => 'fasta');

my $count_kmers;
my $k = 15;
while ( my $seq = $in->next_seq) {
    $count_kmers += $seq->length - $k + 1;
}

print $count_kmers;

__END__
C:\Old_Data\perlp>perl t9.pl
18657

您可以看到计数(在__END__令牌之后)，18657。当我使用您的代码打印出kmers时，该计数与kmers的计数一致。

#!/usr/bin/perl
use strict;
use warnings;
use 5.014;
use Devel::Size 'total_size';

my $k = 15;
my $input = 'fasta_dat.txt';
my $output = 'kmers.txt';
my $match_count = 0;

#Open File
unless (open(FASTA, "<", $input)){
    die "Unable to open fasta file", $!;
    }

    #Unwraps the FASTA format file
    $/=">";
    #Separate header and sequence
    #Remove spaces
unless (open(OUTPUT, ">", $output)){
    die "Unable to open file", $!;
}

<FASTA>; # discard 'first' 'empty' record

my %seen;
while (my $line = <FASTA>){
    chomp $line;
    my($header, @seq) = split(/\n/, $line);
    my $sequence = join '', @seq;

    for my $i (0 .. length($sequence) - $k) {
        my $kmer = substr($sequence, $i, $k);
        print OUTPUT $kmer, "\n" unless $seen{$kmer}++;
    }
}
print total_size(\%seen);

更新我进行的测试显示，哈希大小的内存增加了大约100倍。在我的测试中，kmers的数量约为18500。这导致哈希值为1.8MB。

对于您的数据，kmers为22M，将导致散列大小〜2.2GB。不知道这是否会超出您的内存容量。