问题描述
我希望能够匹配高效数千正则表达式的出文本得知GB的大多数,这些正则表达式将是相当简单的,如:
I want to be able to match efficiently thousands of regexps out of GBs of text knowing that most of these regexps will be fairly simple, like:
\bBarack\s(Hussein\s)?Obama\b
\b(John|J\.)\sBoehner\b
等。
我目前的想法是尝试提取出每个正则表达式的某种最长子的,然后用阿霍Corasick来匹配这些子和消除大部分的正则表达式,然后匹配所有剩余的正则表达式相结合。谁能想到更好的东西?
My current idea is to try to extract out of each regexp some kind of longest substring, then use Aho-Corasick to match these substrings and eliminate most of the regexp and then match all the remaining regexp combined. Can anyone think of something better?
推荐答案
您可以使用(F)法生成DFA,这在并行的承认所有的文字的。这可能变得棘手,如果有太多的通配符present,但它适用于高达约100面值(为4信alfabet,可能更自然文本)。你可能想晚饭preSS的默认操作(ECHO),并且只打印匹配的行+列号。
You can use (f)lex to generate a DFA, which recognises all the literals in parallel. This might get tricky if there are too many wildcards present, but it works for upto about 100 literals (for a 4 letter alfabet; probably more for natural text). You may want to suppress the default action (ECHO), and only print the line+column numbers of the matches.
[我假设的grep -F达到同样的效果]
[ I assume grep -F does about the same ]
%{
/* C code to be copied verbatim */
#include <stdio.h>
%}
%%
"TTGATTCACCAGCGCGTATTGTC" { printf("@%d: %d:%s\n", yylineno, yycolumn, "OMG! the TTGA pattern again" ); }
"AGGTATCTGCTTCAATCAGCG" { printf("@%d: %d:%s\n", yylineno, yycolumn, "WTF?!" ); }
...
more lines
...
[bd-fh-su-z]+ {;}
[ \t\r\n]+ {;}
. {;}
%%
int main(void)
{
/* Call the lexer, then quit. */
yylex();
return 0;
}
像一个以上可以生成TXT格式输入使用awk或任何其他的脚本语言。脚本
A script like the one above can be generated form txt input with awk or any other script language.
这篇关于快速算法来提取成千上万的简单的模式出了大量的文字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!