问题描述
我正在通过Doug Hellman的Python标准库示例,并遇到了这个问题:
I'm working through Doug Hellman's "The Python Standard Library by Example" and came across this:
1.3.2编译表达式
re模块级函数,用于将正则表达式作为文本字符串使用,但是对于程序经常使用的表达式进行编译更为高效。
"1.3.2 Compiling Expressionsre includes module-level functions for working with regular expressions as text strings, but it is more efficient to compile the expressions a program uses frequently."
我不能跟随他的解释为什么是这样的情况。他说模块级函数维护编译表达式的缓存,并且由于缓存大小有限,使用编译表达式直接避免了缓存查找开销。
I couldn't follow his explanation for why this is the case. He says that the "module-level functions maintain a cache of compiled expressions" and that since the "size of the cache" is limited, "using compiled expressions directly avoids the cache lookup overhead."
我非常感谢,如果有人可以请解释或指示我一个解释,我可以更好地理解为什么更有效地编译程序常用的正则表达式,以及这个过程如何实际工作。
I'd greatly appreciate it if someone could please explain or direct me to an explanation that I could better understand for why it is more efficient to compile the regular expressions a program uses frequently, and how this process actually works.
推荐答案
Hm。这很奇怪。我的知识迄今为止(从其他来源获得,) )建议我的初始答案:
Hm. This is strange. My knowledge so far (gained, among other source, from this question) suggested my initial answer:
Python缓存您使用的最后100个正则表达式,因此即使您不明确编译它们,也不必在每次使用时重新编译。
Python caches the last 100 regexes that you used, so even if you don't compile them explicitly, they don't have to be recompiled at every use.
但是,有两个缺点:当达到100个正则表达式的限制时,整个缓存被拒绝,因此如果在一行中使用101个不同的正则表达式,每次都将重新编译。
However, there are two drawbacks: When the limit of 100 regexes is reached, the entire cache is nuked, so if you use 101 different regexes in a row, each one will be recompiled every time. Well, that's rather unlikely, but still.
其次,为了找出regex是否已经编译,解释器需要在缓存中查找正则表达式每次都需要一点额外的时间(但不是很多,因为字典查找非常快)。
Second, in order to find out if a regex has been compiled already, the interpreter needs to look up the regex in the cache every time which does take a little extra time (but not much since dictionary lookups are very fast).
因此,如果您明确编译正则表达式,就可以避免这个额外的查找步骤。
So, if you explicitly compile your regexes, you avoid this extra lookup step.
我只是做了一些测试(Python 3.3):
I just did some testing (Python 3.3):
>>> import timeit
>>> timeit.timeit(setup="import re", stmt='''r=re.compile(r"\w+")\nfor i in range(10):\n r.search(" jkdhf ")''')
18.547793477671938
>>> timeit.timeit(setup="import re", stmt='''for i in range(10):\n re.search(r"\w+"," jkdhf ")''')
106.47892003890324
所以看来没有进行缓存。也许这是 timeit.timeit()
运行的特殊条件的一个怪癖?
So it would appear that no caching is being done. Perhaps that's a quirk of the special conditions under which timeit.timeit()
runs?
另一方面,在Python 2.7中,区别不是很明显:
On the other hand, in Python 2.7, the difference is not as noticeable:
>>> import timeit
>>> timeit.timeit(setup="import re", stmt='''r=re.compile(r"\w+")\nfor i in range(10):\n r.search(" jkdhf ")''')
7.248294908492429
>>> timeit.timeit(setup="import re", stmt='''for i in range(10):\n re.search(r"\w+"," jkdhf ")''')
18.26713670282241
这篇关于在Python中编译正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!