问题描述
我不确定我是否完全理解以下正则表达式搜索的内容:
>>>进口重新>>>template = re.compile("(\w+)+\.")>>>目标 = "a" * 30>>>模板搜索(目标)search()
调用需要几分钟才能完成,CPU 使用率达到 100%.该行为可在 2.7.5 和 3.3.3 Python 版本中重现.
有趣的事实是,如果字符串的长度小于 20-25 个字符,search()
会立即返回.
发生了什么?
理解这个问题需要理解 NFA 在 RegExp 下是如何工作的.
详细阐述 NFA 的定义对我来说可能是一项太重的任务.在 wiki 上搜索 NFA 它会给你一个更好的解释.这里只是认为 NFA 是一个机器人寻找你给出的模式.
粗略实现的 NFA 有点愚蠢,它只是预测您提供的一两个令牌.因此,在您给出的综合示例中,NFA 起初只是查看 \w+
(不是用于分组的括号).
因为+
是贪婪量词,即匹配尽可能多的字符,所以NFA傻傻的继续消耗target
中的字符.在 30 个 a
s 之后,NFA 遇到字符串的结尾.然后NFA意识到他需要匹配template
中的其他token.下一个标记是 +
.NFA 已匹配它,因此它继续到 \.
.这次失败了.
NFA 接下来要做的是后退一个字符,尝试通过截断 \w+
的子匹配来匹配模式.因此 NFA 将 target
分成两组,29 个 a
用于一个 \w+
,一个尾随 a
.NFA 首先尝试通过将尾随的 a 与第二个 +
进行匹配来使用它,但是当 NFA 遇到 \.
时它仍然失败.NFA 继续上述过程,直到获得完全匹配,否则它将尝试所有可能的分区.
所以 (\w+)+\.
指示 NFA 以这样的方式对 target
进行分组:目标被分成一个或多个组,每组至少一个字符,目标以句点."结尾.只要期间不匹配.NFA 尝试所有可能的分区.那么有多少个分区?2^n,2 的指数.(想想在 a
之间插入分隔符).如下图
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa啊啊啊啊啊...........aa a ... aa a a a .... a
如果 NFA 匹配 \.
,则不会有太大伤害.但是当它匹配失败时,这个表达式就注定是永无止境的指数.
我不是在打广告,但掌握正则表达式是了解正则表达式机制的好书.
I'm not sure I completely understand what is going on with the following regular expression search:
>>> import re
>>> template = re.compile("(\w+)+\.")
>>> target = "a" * 30
>>> template.search(target)
search()
call takes minutes to complete, CPU usage goes to 100%. The behavior is reproduceable for both 2.7.5 and 3.3.3 Python versions.
Interesting fact that if the string is less than 20-25 characters in length, search()
returns like in no time.
What is happening?
Understanding this problem requires understanding how NFA works under RegExp.
Elaborating the definition of NFA may be a mission too heavy for me. Search NFA on wiki it will gives you a better explanation. Here just think NFA is a robot finding patterns you give.
Crudely implemented NFA is somewhat dumb, it just looks ahead one or two tokens you give. So in the synthetic example you give, NFA just looks \w+
at first (not parenthesis for grouping).
Because +
is a greedy quantifier, that is, matches as many characters as possible, so NFA dumbly continues to consume characters in target
. After 30 a
s, NFA encounters the end of string. After then does NFA realize that he needs to match other tokens in template
.The next token is +
. NFA has matched it so it proceeds to \.
. This time it fails.
What NFA does next is to step one character back, trying to match the pattern by truncating the submatching of \w+
. So NFA split the target
in to two groups, 29 a
s for one \w+
, and one trailing a
. NFA first tries to consume the trailing a by matching it against the second +
, but it still fails when NFA meeting \.
. NFA continues the process above until it gets a full match, otherwise it will tries all possible partitions.
So (\w+)+\.
instructs NFA to group target
in such manner: target is partitioned into one or more groups, at least one character per group, and target is end with a period '.'. As long as the period is not matched. NFA tries all partitions possible. So how many partitions are there? 2^n, the exponential of 2. (JUst think inserting separator between a
). Like below
aaaaaaa a
aaaaaa aa
aaaaaa a a
.....
.......
aa a a ... a
a a a a a .... a
If NFA matches \.
, it won't hurt much. But when it fails to match, this expression is doomed to be never-ending exponential .
I'm not advertising but Mastering Regular Expression is a good book to understand mechanism under RegExp.
这篇关于非常慢的正则表达式搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!