本文介绍了Python 的正则表达式源字符串长度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Python 正则表达式中,

re.compile("x"*50000)

给我OverflowError:超出正则表达式代码大小限制

但是跟随一个没有得到任何错误,但是它达到了 100% CPU,并且在我的电脑上花了 1 分钟

>>>re.compile(".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000)<_sre.SRE_Pattern 对象在 0x03FB0020>

这正常吗?

我应该假设 ".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000 更短比 "x"*50000?

在 Python 2.6、Win32 上测试

更新 1:

看起来 ".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000 可以减少.*?

那么,这个怎么样?

re.compile(".*?x"*50000)

它确实可以编译,如果那个也可以简化为 ".*?x",它应该匹配字符串 "abcx""x" 单独,但不匹配.

那么,我是不是遗漏了什么?

更新 2:

我的观点是不知道正则表达式源字符串的最大限制,我想知道溢出处理程序捕获的 "x"*50000 的一些原因/概念,但不是在 " 上.*?x"*50000.

这对我来说没有意义,这就是为什么.

溢出检查中缺少某些东西,或者它很好,或者它真的溢出了什么?

任何提示/意见将不胜感激.

解决方案

区别在于 ".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000 可以简化为 ".*?",而 "x"*50000 在 FSM 中必须生成 50000 个节点(或正则表达式引擎使用的类似结构).

好吧,我错了.它没有那么聪明."x"*50000 失败而 ".*?x"*50000 没有的原因是一个代码项"的大小有限制."x"*50000 将生成一个长项目,".*?x"*50000 将生成许多小项目.如果你可以在不改变正则表达式的含义的情况下以某种方式拆分字符串文字,它会起作用,但我想不出办法做到这一点.

In Python Regular Expressions,

re.compile("x"*50000)

gives me OverflowError: regular expression code size limit exceeded

but following one does not get any error, but it hits 100% CPU, and took 1 minute in my PC

>>> re.compile(".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000)
<_sre.SRE_Pattern object at 0x03FB0020>

Is that normal?

Should I assume, ".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000 is shorter than "x"*50000?

Tested on Python 2.6, Win32

UPDATE 1:

It Looks like ".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000 could be reduce to .*?

So, how about this one?

re.compile(".*?x"*50000)

It does compile, and if that one also can reduce to ".*?x", it should match to string "abcx" or "x" alone, but it does not match.

So, Am I missing something?

UPDATE 2:

My Point is not to know max limit of regex source strings, I like to know some reasons/concepts of "x"*50000 caught by overflow handler, but not on ".*?x"*50000.

It does not make sense for me, thats why.

It is something missing on overflow checking or Its just fine or its really overflowing something?

Any Hints/Opinions will be appreciated.

解决方案

The difference is that ".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000 can be reduced to ".*?", while "x"*50000 has to generate 50000 nodes in the FSM (or a similar structure used by the regex engine).

EDIT: Ok, I was wrong. It's not that smart. The reason why "x"*50000 fails, but ".*?x"*50000 doesn't is that there is a limit on size of one "code item". "x"*50000 will generate one long item and ".*?x"*50000 will generate many small items. If you could split the string literal somehow without changing the meaning of the regex, it would work, but I can't think of a way to do that.

这篇关于Python 的正则表达式源字符串长度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-19 02:59