问题描述
如果我在正则表达式交替中有另一个字符串或模式的子字符串(或子模式"),如下所示:
If I have a substring (or 'subpattern') of another string or pattern in a regex alternation, like so:
r'abcd|bc'
re.compile(r'abcd|bc').findall('abcd bcd bc ab')
的预期行为是什么?
尝试一下,我得到(如预期)
Trying it out, I get (as expected)
['abcd', 'bc', 'bc']
所以我认为 re.compile(r'bc|abcd').findall('abcd bcd bc ab')
可能会产生 ['bc', 'bc', 'bc']
而是它再次返回
so I thought re.compile(r'bc|abcd').findall('abcd bcd bc ab')
might yield ['bc', 'bc', 'bc']
but instead it again returns
['abcd', 'bc', 'bc']
有人能解释一下吗?我的印象是 findall
会贪婪地返回匹配项,但显然,它会回溯并尝试匹配会产生更长标记的替代模式.
Can someone explain this? I was under the impression that findall
would greedily return matches but apparently, it backtracks and tries to match alternate patterns what would yield longer tokens.
推荐答案
根本不发生回溯.您的模式匹配两种不同类型的字符串;|
表示或.每个模式都在每个位置进行尝试.
No backtracking takes place at all. Your pattern matches two different types of strings; |
means or. Each pattern is tried out at each position.
因此,当表达式在您输入的开头找到 abcd
时,该文本与您的模式匹配得很好,它适合 (bc
或 abcd
) 模式.
So when the expression finds abcd
at the start of your input, that text matches your pattern just fine, it fits the abcd
part of the (bc
or abcd
) pattern you gave it.
替代部分的排序在这里不起作用,就正则表达式引擎而言,abcd|bc
与 bc 是 相同的东西|abcd
.abcd
不会仅仅因为 bc
可能在字符串中稍后匹配而被忽略.
Ordering of the alternative parts doesn't play here, as far as the regular expression engine is concerned, abcd|bc
is the same thing as bc|abcd
. abcd
is not disregarded just because bc
might match later on in the string.
这篇关于python re.findall() 交替使用子字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!