问题描述
我有数百万对相同长度的字符串,我想比较并找到它们不匹配的位置.
I have a millions of pairs of string of same length which I want to compare and find the position where it has mismatches.
例如对于每个 $str1
和 $str2
我们想要找到不匹配$str_source
的位置:
For example for each $str1
and $str2
we want to find mismatchposition with $str_source
:
$str_source = "ATTCCGGG";
$str1 = "ATTGCGGG"; # 1 mismatch with Str1 at position 3 (0-based)
$str2 = "ATACCGGC"; # 2 mismatches with source at position 2 and 7
有没有快速的方法来做到这一点.目前我有我循环的C风格方法使用 'substr' 函数在两个字符串中的每个位置.但是这种方法非常慢.
Is there a fast way to do it. Currently I have the C style method which I loopevery position in both strings using 'substr' function. But this approach is horribly slow.
my @mism_pos;
for $i (0 .. length($str_source)) {
$source_base = substr($str_source,$i,1);
$str_base = substr($str2,$i,$1);
if ($source_base ne $str_base) {
push @mism_pos,$i;
}
}
推荐答案
Inline::C
计算很简单,使用 Inline::C(阅读 perldoc Inline::C-Cookbook 和 perldoc Inline::C 文档):
Inline::C
The computation is easy, do it with Inline::C (read perldoc Inline::C-Cookbook and perldoc Inline::C for documentation):
use Inline C => << '...';
void find_diffs(char* x, char* y) {
int i;
Inline_Stack_Vars;
Inline_Stack_Reset;
for(i=0; x[i] && y[i]; ++i) {
if(x[i] != y[i]) {
Inline_Stack_Push(sv_2mortal(newSViv(i)));
}
}
Inline_Stack_Done;
}
...
@diffs= find_diffs("ATTCCGGG","ATTGCGGG"); print "@diffs\n";
@diffs= find_diffs("ATTCCGGG","ATACCGGC"); print "@diffs\n";
这是这个脚本的输出:
> script.pl
3
2 7
PDL
如果您想在 Perl 中快速处理大量数据,请学习 PDL (文档):
use PDL;
use PDL::Char;
$PDL::SHARE=$PDL::SHARE; # keep stray warning quiet
my $source=PDL::Char->new("ATTCCGGG");
for my $str ( "ATTGCGGG", "ATACCGGC") {
my $match =PDL::Char->new($str);
my @diff=which($match!=$source)->list;
print "@diff\n";
}
(与第一个脚本的输出相同.)
(Same output as first script.)
注意:我在基因组数据处理中非常愉快地使用了 PDL.连同对存储在磁盘上的数据的内存映射访问,可以快速处理大量数据:所有处理都在高度优化的 C 循环中完成.此外,您可以通过 Inline::C 轻松访问相同的数据以获取任何缺失的功能在 PDL 中.
Notes: I used PDL very happily in genomic data processing. Together with memory mapped access to data stored on the disk, huge amounts of data can be processed quickly: all processing is done in highly optimized C loops. Also, you can easily access the same data through Inline::C for any features missing in PDL.
但是请注意,创建一个 PDL 向量非常缓慢(恒定时间,对于大型数据结构来说是可以接受的).因此,您更愿意一次性创建一个包含所有输入数据的大型 PDL 对象,而不是遍历单个数据元素.
Note however, that the creation of one PDL vector is quite slow (constant time, it's acceptable for large data structures). So, you rather want to create one large PDL object with all your input data in one go than looping over individual data elements.
这篇关于查找相同长度的两个字符串之间不匹配位置的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!