问题描述
使用后向引用在正则表达式中匹配重复的字符很简单:
(。)。1
但是,我想在之后匹配字符,所以我想我可以将其放在后面:
(?< =(。)\1) 。
这是为什么?在其他类型中,由于对lookbehinds的限制很严格,所以我不会感到惊讶,但是.NET通常在lookbehinds内部支持任意复杂的模式。
简短的版本:后面的注释从右到左匹配。这意味着当正则表达式引擎遇到 \1
时,它并没有尚未将任何东西捕获到该组中,因此正则表达式始终会失败。解决方案非常简单:
(?< = \1(。))。
不幸的是,全文一旦开始使用更复杂的模式,就会变得更加微妙。所以这里是...
阅读.NET中正则表达式的指南
首先,一些重要的确认。教我让后视从右到左匹配(并通过大量实验自己弄清楚)的人是。不幸的是,我当时提出的问题是一个非常复杂的例子,对于这样一个简单的问题并没有很好的参考意义。因此,我们认为制作一个新的和更规范的帖子以供将来参考并作为合适的欺骗对象是有意义的。但是请考虑给予Kobi否决权,以弄清.NET的正则表达式引擎的一个非常重要的方面,该方面实际上是没有记载的(据我所知,MSDN在单句中提到了它)。
请注意,会以不同的方式解释.NET外观的内部工作原理(在反转字符串,正则表达式和任何潜在的捕获方面)。尽管这不会对比赛结果产生影响,但我发现这种方法很难推理,并且可以很明显地看出,这不是该实现实际执行的操作。
所以。第一个问题是,为什么它实际上比上面的黑体字更微妙。让我们尝试使用不区分大小写的局部修饰符匹配以 a
或 A
开头的字符。给定从右到左的匹配行为,可能会希望它起作用:
(?< = a(?i ))。
不过,,这似乎根本没有使用修饰符。确实,如果我们将修饰符放在前面:
(?< =(?i)a)。
。
另一个示例,考虑到从右到左的匹配,这可能会令人惊讶:
(?< = \2(。)(。))。
\2
是指左或右捕获组?它指的是正确的。
最后一个示例:与 abc
,是否捕获了 b
或 ab
?
(?< =(b | a。))c
(您可以在表格选项卡上看到捕获的内容。)再次,从右到左应用后视不是全部内容。
因此,由于我不知道任何这样的资源,因此本文章试图全面地介绍有关.NET中正则表达式的方向性。在.NET中读取复杂正则表达式的技巧是通过三遍或四遍完成。除了最后一遍,其他所有内容都是从左到右,而不管后面是什么还是 RegexOptions.RightToLeft
。我相信是这样,因为.NET在解析和编译正则表达式时会处理这些。
首遍:内联修饰符
基本上这是上面的示例所示。如果您在正则表达式中的任何地方,都有以下代码段:
... a(b(?i)c)d。 。
无论模式中的哪个位置或是否使用RTL选项, c
不区分大小写,而 a
, b
和 d
不会(只要它们不受其他某些前置或全局修饰符的影响)。这可能是最简单的规则。
第二遍:组号[未命名的组]
对于此遍,您应该完全忽略模式中的任何命名组,即(?< a> ...)
形式的组。请注意,这不包括带有显式数字的组,例如(?< 2&...;)
(在.NET中是这样)。
捕获组从左到右编号。正则表达式有多复杂,无论您使用的是RTL选项,还是是否嵌套了数十个先行和后退都无所谓。当您仅使用未命名的捕获组时,它们会根据其左括号的位置从左到右编号。示例:
(a)(?< =(b)(?=(。))。((c) 。(d)))(e)
└1┘└2┘└3┘│└5┘┘6┘│└7┘
└───4───┘
在将未标记的组与显式编号的组混合使用时,这会变得有些棘手。您仍然应该从左到右阅读所有这些内容,但是规则有些棘手。您可以按以下方式确定组的数量:
- 如果该组具有明确的数字,则其数量显然是(仅)号。请注意,这可能会向现有的组号添加其他捕获,也可能会创建新的组号。另请注意,当您提供明确的组号时,它们不必是连续的。
(?< 1>。)(?< 5>。)
是组号为2
到4
未使用。 - 如果该组未标记,它将采用第一个未使用的数字。由于我刚才提到的差距,这个数字可能小于已经使用的最大数字。
以下是一个示例(为了简单起见,不进行嵌套;请记住在嵌套时按其右括号对其进行排序):
(a)(?< ; 1 b)(?2 c)(d)(e)(?6 f)(g)(h)
└1┘└──1──┘└── 2──┘└3┘└4┘└──6──┘└5┘└7┘
请注意显式组 6
是如何创建间隙的,然后捕获 g
的组将使用组之间未使用的间隙 4
和 6
,而捕获 h
的组则花费 7
,因为已经使用了 6
。请记住,在它们之间的任何地方都可能有命名组,我们现在完全忽略了它们。
如果您想知道重复的组(例如组 1 是您可能需要阅读 。
第三遍:组号[命名组]
当然,如果正则表达式中没有命名组。
这是一个鲜为人知的功能,命名组在.NET中也具有(隐式)组号,可用于反向引用和 Regex.Replace
的替换模式。一旦处理完所有未命名的组,它们将通过单独的通道获取其编号。为其赋予编号的规则如下:
- 首次出现名称时,组将获得第一个未使用的编号。同样,如果正则表达式使用显式数字,则这可能是已用数字中的空白,或者可能比迄今为止的最大组号大一个。 这会将这个新号码与当前名称永久关联。
- 因此,当名称再次出现在正则表达式中时,该组将具有与使用的相同号码。
一个更完整的示例,其中包含所有三种类型的组,显式显示第2和第3遍: p>
(?a)。)(。)(。)(?b<)(?a 。)(?< 5>。)(。)(?< c> ;.)
Pass 2:││└1┘└2┘││││└──5──┘└3┘ ││
Pass 3:└──4──┘└──6──┘└──4──┘└──7──┘
最终通行证:遵循正则表达式引擎
现在我们知道哪些修饰符适用于哪些令牌以及哪些组中有哪些数字,我们最终到达与正则表达式引擎的 execution 实际对应的部分,并在此处开始来回移动。
.NET的正则表达式引擎可以在两个方向上处理正则表达式和字符串:从左到右模式(LTR)及其独特的从右到左模式(RTL)。您可以使用 RegexOptions.RightToLeft
为整个正则表达式激活RTL模式。在这种情况下,引擎将开始尝试在字符串的末尾查找匹配项,并通过正则表达式和字符串向左移动。例如,简单的正则表达式
a。* b
将匹配 b
,则它将尝试匹配。*
的左侧(必要时回溯),以便在其左侧某处有 a
。当然,在这个简单的示例中,LTR和RTL模式之间的结果是相同的,但是它有助于在跟踪引擎时做出有意识的努力。对于诸如贪婪修饰符之类的简单操作而言,它可能会有所作为。考虑正则表达式
a。*?b
。我们正在尝试匹配 axxbxxb
。在LTR模式下,您将按预期获得匹配 axxb
,因为 xx
对不满意的量词感到满意。但是,在RTL模式下,实际上您会匹配整个字符串,因为在字符串的末尾找到了第一个 b
,但随后是。 *?
需要匹配所有 xxbxx
,以匹配 a
。
很明显,这对于反向引用也有所不同,如问题和此答案顶部的示例所示。在LTR模式下,我们使用(。)\1
来匹配重复的字符,而在RTL模式下,我们使用 \1(。)
,因为我们需要确保正则表达式引擎在尝试引用捕获之前先将其捕获。
请牢记这一点,我们可以在新光。当正则表达式引擎遇到后退时,它将按以下方式处理:
- 它会记住其当前位置
x
以及目标字符串的当前处理方向。 - 现在,它强制 RTL模式,而不管它当前处于哪种模式。 / li>
- 然后,从当前位置
x
开始,从右向左匹配后面的内容。 - 一旦完全解决了后面的问题,如果通过了,则正则表达式引擎的位置将重置为位置
x
并恢复原始处理方向。
虽然前瞻似乎更加无害(因为我们几乎从来没有遇到过像他们这样的问题),但实际上它的行为实际上是相同,除了它强制执行LTR模式。当然,在大多数仅LTR的模式中,这从未被注意到。但是,如果正则表达式本身在RTL模式下匹配,或者我们正在做的事就像在先行行为中加入先行行为一样,那么先行行为将像后行行为一样改变处理方向。
那么,您应该如何真正阅读正则表达式呢?第一步是将其拆分为单独的组件,这些组件通常是单独的标记及其相关的量词。然后根据正则表达式是LTR还是RTL,分别从上到下或从下到上开始。每当您在过程中遇到环顾四周时,请检查其朝向,然后跳到正确的末端并从那里阅读环顾四周。完成环视后,请继续执行周围的模式。
当然,还有另外一个问题……当您遇到轮换( .. | .. | ..)
,即使在RTL匹配期间,也总是从左向右尝试 。当然,在每个 内,引擎从右向左前进。
以下是一个人为设计的示例,用于显示此内容:
。+(?=。(?< = a。+)。)。(?< =。(?< = b。| c。)..(?= d。|。+(?< = ab *?))))。
这是我们可以将其拆分的方法。如果正则表达式处于LTR模式,则左侧的数字显示读取顺序。右侧的数字显示RTL模式下的阅读顺序:
LTR RTL
1。+ 18
(?=
2。14
(? 4 a 16
3。+ 17
)
5。 13
)
6。 13
(? 17。12
(? 14 b 9
13。8
|
16 c 11
15。10
)
12 .. 7
(?=
7 d 2
8。3
|
9。+ 4
(?==
11 a 6
10 b *?5
)
)
)
18。 1
我衷心希望您在生产代码中永远不要使用像这样疯狂的东西,但也许有一天,一个友好的同事会在被解雇之前将一些疯狂的只写正则表达式保留在您公司的代码库中,这一天,我希望本指南可以帮助您了解到底发生了什么。
高级部分:平衡组
为完整起见,本部分说明了正则表达式引擎的方向性如何影响平衡组。如果您不知道什么是平衡组,则可以放心地忽略它。如果您想知道平衡组是什么,请,并且本节假定您至少对它们有足够的了解。
有三种类型的组语法与平衡组相关。
- 明确命名或编号的组,例如
(?< a> ...)
或(?< 2&...;)
(甚至是隐式编号的组),
- 从捕获堆栈之一中弹出的组,例如
(?< -a> ...)
和(?< -2> ...)
。它们的行为符合您的期望。遇到它们时(按照上述正确的处理顺序),它们只是从相应的捕获堆栈中弹出。可能值得注意的是,这些不获取隐式组号。 - 适当的平衡组
(?< b -a> ...)
,通常用于捕获字符串,因为最后一个b
。当与从右到左模式混合使用时,他们的行为变得很奇怪,这就是本节的内容。
(?< b-a&...;)
功能在从右到左模式下实际上无法使用。但是,经过大量的实验,(奇怪的)行为实际上似乎遵循了一些规则,我将在此处概述。
首先,让我们看一个示例说明了环顾情况为何使情况复杂化。我们正在匹配字符串 abcde ... wvxyz
。考虑以下正则表达式:
(?a&f; fgh)。{8}(?< =(?< b-a&.; {3})。{2})
按以下顺序读取正则表达式如上所示,我们可以看到:
- 正则表达式将
fgh
捕获到组a
。 - 引擎随后向右移动8个字符。
- 后视切换到RTL模式。
-
。{2}
向左移动两个字符。 - 最后,
(?< b-a&.. {3})
是平衡组,它从捕获组a
,然后将某物推送到组b
中。在这种情况下,组匹配lmn
,我们将ijk
推入组b
如预期。
但是,从此示例可以清楚地看出,通过更改数值参数,我们可以更改两组匹配的子字符串的相对位置。我们甚至可以通过将 3
变小或变大,使那些子串相交,或将一个子串完全包含在另一个子串中。在这种情况下,不再清楚在两个匹配的子字符串之间推送所有内容的含义。
事实证明,要区分三种情况。
案例1:(?< a ......)
与(?< b -a> ...)
这是正常情况。顶部捕获从 a
中弹出,两个组匹配的子字符串之间的所有内容都被推送到 b
上。考虑这两个组的以下两个子字符串:
abcdefghijklmnopqrstuvwxyz
└──< a>────┘ ──< b-a>──┘
您可能会使用正则表达式
(?ad。{8})。+ $(?< =(?< b-a>)。 {11})。)
然后 mn
会被推到 b
上。
情况2:(?a ..)
和(?< b-a&...;)
相交
这包括两个子字符串接触但不包含任何公共字符(仅字符之间的公共边界)的情况。如果一组中的一个在环视范围内,而另一组不在或在不同的环视范围内,则可能会发生这种情况。在这种情况下,两个子控件的交集将被推送到 b
上。当子字符串完全包含在另一个字符串中时,情况仍然如此。
这里有几个示例显示了这一点:
示例:压入< b> ;:可能的正则表达式:
abcdefghijklmnopqrstuvwxyz(?(< a&d; d。{8}))。+ $(? < =(?< b-a>。{11})...)
└──< a>──┘└──< b-a>──┘
abcdefghijklmnopqrstuvwxyz jkl(?ad。{8})。+ $(?< =(?< b-a&。{11})。{6})
└──< a>┼─┘│
└──< b-a>──┘
abcdefghijklmnopqrstuvwxyz klmnopq(?< a> k。{8} )(?< =(?< b-a>。{11} ..))
│└──<a─┼─┘
└──< b-a> ──┘
abcdefghijklmnopqrstuvwxyz(?< =(?< b-a&。{7})(?< a>。{4} o))
└ < b-a>┘└>┘
abcdefghijklmnopqrstuvwxyz fghijklmn(?ad。{12})(?< =(?< b-a> 。{9})..)
└─┼──< a>─ ──┼─┘
└─< b-a>─b
$ b abcdefghijklmnopqrstuvwxyz cdefg(?a&c; {4})..(?< =( ?< b-a>。{9}))
│└< a>┘│
└──&
情况3:(?< a> ...)
匹配<$ c的右边$ c>(?< b-a> ...)
这种情况我不太了解,会认为是一个错误:当与(?< b-a&...;)
匹配的子字符串正确地位于与(?< a> ; ...)
(它们之间至少有一个字符,以使它们不具有公共边界),什么也不会被强制执行 b
。我的意思是什么也没有,甚至没有空字符串-捕获堆栈本身仍然是空的。但是,匹配组仍然成功,并且从 a
组中弹出相应的捕获。
什么特别令人讨厌关于此问题的原因是,这种情况可能比情况2更为常见,因为如果您尝试按平衡方式使用平衡组,则会发生这种情况,但是使用的是从右到左的纯正则表达式。 / p>
案例3的更新:经过证明某事发生在堆栈 b
中。似乎什么也没有推送,因为 m.Groups [ b]。成功
将为 False
和 m.Groups [ b]。Captures.Count
将为 0
。但是,在正则表达式中,条件(?(b)true | false)
现在将使用 true
分支。同样在.NET中,之后似乎可以执行(?< -b>)
(之后访问 m.Groups [ b ]
将引发异常),而Mono在匹配正则表达式时立即引发异常。确实是错误。
Matching a repeated character in regex is simple with a backreference:
(.)\1
However, I would like to match the character after the pair of characters, so I thought I could simply put this in a lookbehind:
(?<=(.)\1).
Unfortunately, this doesn't match anything.
Why is that? In other flavours I wouldn't be surprised because there are strong restrictions on lookbehinds, but .NET usually supports arbitrarily complicated patterns inside lookbehinds.
The short version: Lookbehinds are matched from right to left. That means when the regex engine encounters the \1
it hasn't captured anything into that group yet, so the regex always fails. The solution is quite simple:
(?<=\1(.)).
Unfortunately, the full story once you start using more complex patterns is a lot more subtle. So here is...
A guide to reading regular expressions in .NET
First, some important acknowledgements. The person who taught me that lookbehinds are matched from right to left (and figured this out on his own through a lot of experimentation), was Kobi in this answer. Unfortunately, the question I asked back then was a very convoluted example which doesn't make for a great reference for such a simple problem. So we figured it would make sense to make a new and more canonical post for future reference and as a suitable dupe target. But please consider giving Kobi an upvote for figuring out a very important aspect of .NET's regex engine that is virtually undocumented (as far as I know, MSDN mentions it in a single sentence on a non-obvious page).
Note that rexegg.com explains the inner workings of .NET's lookbehinds differently (in terms of reversing the string, the regex, and any potential captures). Although that wouldn't make a difference to the result of the match, I find that approach much harder to reason about, and from looking at the code it's fairly clear that this is not what the implementation actually does.
So. The first question is, why is it actually more subtle than the bolded sentence above. Let's try matching a character that is preceded by either a
or A
using a local case-insensitive modifier. Given the right-to-left matching behaviour, one might expect this to work:
(?<=a(?i)).
However, as you can see here this doesn't seem to use the modifier at all. Indeed, if we put the modifier in front:
(?<=(?i)a).
Another example, that might be surprising with right-to-left matching in mind is the following:
(?<=\2(.)(.)).
Does the \2
refer to the left or right capturing group? It refers to the right one, as this example shows.
A final example: when matched against abc
, does this capture b
or ab
?
(?<=(b|a.))c
It captures b
. (You can see the captures on the "Table" tab.) Once again "lookbehinds are applied from right to left" isn't the full story.
Hence, this post tries to be a comprehensive reference on all things regarding directionality of regex in .NET, as I'm not aware of any such resource. The trick to reading a complicated regex in .NET is doing so in three or four passes. All but the last pass are left-to-right, regardless of lookbehinds or RegexOptions.RightToLeft
. I believe this is the case, because .NET processes these when parsing and compiling the regex.
First pass: inline modifiers
This is basically what the above example shows. If anywhere in your regex, you had this snippet:
...a(b(?i)c)d...
Regardless of where in the pattern that is or whether you're using the RTL option, c
will be case-insensitive while a
, b
and d
will not (provided they aren't affected by some other preceding or global modifier). That is probably the simplest rule.
Second pass: group numbers [unnamed groups]
For this pass you should completely ignore any named groups in the pattern, i.e. those of the form (?<a>...)
. Note this does not include groups with explicit numbers like (?<2>...)
(which are a thing in .NET).
Capturing groups are numbered from left to right. It doesn't matter how complicated your regex is, whether you're using the RTL option or whether you nest dozens of lookbehinds and lookaheads. When you're only using unnamed capturing groups, they are numbered from left to right depending on the position of their opening parenthesis. An example:
(a)(?<=(b)(?=(.)).((c).(d)))(e)
└1┘ └2┘ └3┘ │└5┘ └6┘│ └7┘
└───4───┘
This gets a bit trickier when mixing unlabelled groups with explicitly numbered groups. You should still read all of these from left to right, but the rules are a bit trickier. You can determine the number of a group as follows:
- If the group has an explicit number, its number is obviously that (and only that) number. Note that this may either add an additional capture to an already existing group number, or it may create a new group number. Also note that when you're giving explicit group numbers, they don't have to be consecutive.
(?<1>.)(?<5>.)
is a perfectly valid regex with group number2
to4
unused. - If the group is unlabelled, it takes the first unused number. Due to the gaps I just mentioned, this may be smaller than the maximum number that has already been used.
Here is an example (without nesting, for simplicity; remember to order them by their opening parentheses when they are nested):
(a)(?<1>b)(?<2>c)(d)(e)(?<6>f)(g)(h)
└1┘└──1──┘└──2──┘└3┘└4┘└──6──┘└5┘└7┘
Notice how the explicit group 6
creates a gap, then the group capturing g
takes that unused gap between groups 4
and 6
, whereas the group capturing h
takes 7
because 6
is already used. Remember that there might be named groups anywhere in between these, which we're completely ignoring for now.
If you're wondering what the purpose of repeated groups like group 1
in this example is, you might want to read about balancing groups.
Third pass: group numbers [named groups]
Of course, you can skip this pass entirely if there are no named groups in the regex.
It's a little known feature that named groups also have (implicit) group numbers in .NET, which can be used in backreferences and substitution patterns for Regex.Replace
. These get their numbers in a separate pass, once all the unnamed groups have been processed. The rules for giving them numbers are as follows:
- When a name appears for the first time, the group gets the first unused number. Again, this might be a gap in the used numbers if the regex uses explicit numbers, or it might be one greater than the greatest group number so far. This permanently associates this new number with the current name.
- Consequently, when a name appears again in the regex, the group will have the same number that was used for that name the last time.
A more complete example with all three types of groups, explicitly showing passes two and three:
(?<a>.)(.)(.)(?<b>.)(?<a>.)(?<5>.)(.)(?<c>.)
Pass 2: │ │└1┘└2┘│ ││ │└──5──┘└3┘│ │
Pass 3: └──4──┘ └──6──┘└──4──┘ └──7──┘
Final pass: following the regex engine
Now that we know which modifiers apply to which tokens and which groups have which numbers, we finally get to the part that actually corresponds to the execution of the regex engine, and where we start going back and forth.
.NET's regex engine can process regex and string in two directions: the usual left-to-right mode (LTR) and its unique right-to-left mode (RTL). You can activate RTL mode for the entire regex with RegexOptions.RightToLeft
. In that case, the engine will start trying to find a match at the end of the string and will go left through the regex and the string. For example, the simple regex
a.*b
Would match a b
, then it would try to match .*
to the left of that (backtracking as necessary) such that there's an a
somewhere to the left of it. Of course, in this simple example, the result between LTR and RTL mode is identical, but it helps to make a conscious effort to follow the engine in its backtracking. It can make a difference for something as simple as ungreedy modifiers. Consider the regex
a.*?b
instead. We're trying to match axxbxxb
. In LTR mode, you get the match axxb
as expected, because the ungreedy quantifier is satisfied with the xx
. However, in RTL mode, you'd actually match the entire string, since the first b
is found at the end of the string, but then .*?
needs to match all of xxbxx
for a
to match.
And clearly it also makes a difference for backreferences, as the example in the question and at the top of this answer shows. In LTR mode we use (.)\1
to match repeated characters and in RTL mode we use \1(.)
, since we need to make sure that the regex engine encounters the capture before it tries to reference it.
With that in mind, we can view lookarounds in a new light. When the regex engine encounters a lookbehind, it processes it as follows:
- It remembers its current position
x
in the target string as well as its current processing direction. - Now it enforces RTL mode, regardless of the mode it's currently in.
- Then the contents of the lookbehind are matched from right to left, starting from the current position
x
. - Once the lookbehind is processed completely, if it passed, the position of the regex engine resets to position
x
and the original processing direction is restored.
While a lookahead seems a lot more innocuous (since we almost never encounter problems like the one in the question with them), its behaviour is actually virtually the same, except that it enforces LTR mode. Of course in most patterns which are LTR only, this is never noticed. But if the regex itself is matched in RTL mode, or we're doing something as crazy as putting a lookahead inside a lookbehind, then the lookahead will change the processing direction just like the lookbehind does.
So how should you actually read a regex that does funny stuff like this? The first step is to split it into separate components, which are usually individual tokens together with their relevant quantifiers. Then depending on whether the regex is LTR or RTL, start going from top to bottom or bottom to top, respectively. Whenever you encounter a lookaround in the process, check which way its facing and skip to the correct end and read the lookaround from there. When you're done with the lookaround, continue with the surrounding pattern.
Of course there's another catch... when you encounter an alternation (..|..|..)
, the alternatives are always tried from left to right, even during RTL matching. Of course, within each alternative, the engine proceeds from right to left.
Here is a somewhat contrived example to show this:
.+(?=.(?<=a.+).).(?<=.(?<=b.|c.)..(?=d.|.+(?<=ab*?))).
And here is how we can split this up. The numbers on the left show the reading order if the regex is in LTR mode. The numbers on the right show the reading order in RTL mode:
LTR RTL
1 .+ 18
(?=
2 . 14
(?<=
4 a 16
3 .+ 17
)
5 . 13
)
6 . 13
(?<=
17 . 12
(?<=
14 b 9
13 . 8
|
16 c 11
15 . 10
)
12 .. 7
(?=
7 d 2
8 . 3
|
9 .+ 4
(?<=
11 a 6
10 b*? 5
)
)
)
18 . 1
I sincerely hope that you'll never use something as crazy as this in production code, but maybe one day a friendly colleague will leave some crazy write-only regex in your company's code base before being fired, and on that day I hope that this guide might help you figure out what the hell is going on.
Advanced section: balancing groups
For the sake of completeness, this section explains how balancing groups are affected by the directionality of the regex engine. If you don't know what balancing groups are, you can safely ignore this. If you want to know what balancing groups are, I've written about it here, and this section assumes that you know at least that much about them.
There are three types of group syntax that are relevant for balancing groups.
- Explicitly named or numbered groups like
(?<a>...)
or(?<2>...)
(or even implicitly numbered groups), which we've dealt with above. - Groups that pop from one of the capture stacks like
(?<-a>...)
and(?<-2>...)
. These behave as you'd expect them to. When they're encountered (in the correct processing order described above), they simply pop from the corresponding capture stack. It might be worth noting that these don't get implicit group numbers. - The "proper" balancing groups
(?<b-a>...)
which are usually used to capture the string since the last ofb
. Their behaviour gets weird when mixed with right-to-left mode, and that's what this section is about.
The takeaway is, the (?<b-a>...)
feature is effectively unusable with right-to-left mode. However, after a lot of experimentation, the (weird) behaviour actually appears to follow some rules, which I'm outlining here.
First, let's look at an example which shows why lookarounds complicate the situation. We're matching the string abcde...wvxyz
. Consider the following regex:
(?<a>fgh).{8}(?<=(?<b-a>.{3}).{2})
Reading the regex in the order I presented above, we can see that:
- The regex captures
fgh
into groupa
. - The engine then moves 8 characters to the right.
- The lookbehind switches to RTL mode.
.{2}
moves two characters to the left.- Finally,
(?<b-a>.{3})
is the balancing group which pops the capture off groupa
and pushes something onto groupb
. In this case, the group matcheslmn
and we pushijk
onto groupb
as expected.
However, it should be clear from this example, that by changing the numerical parameters, we can change the relative position of the substrings matched by the two groups. We can even make those substrings intersect, or have one contained completely inside the other by making the 3
smaller or larger. In this case it's no longer clear what it means to push everything between the two matched substrings.
It turns out that there are three cases to distinguish.
Case 1: (?<a>...)
matches left of (?<b-a>...)
This is the normal case. The top capture is popped from a
and everything between the substrings matched by the two groups is pushed onto b
. Consider the following two substrings for the two groups:
abcdefghijklmnopqrstuvwxyz
└──<a>──┘ └──<b-a>──┘
Which you might get with the regex
(?<a>d.{8}).+$(?<=(?<b-a>.{11}).)
Then mn
would be pushed onto b
.
Case 2: (?<a>...)
and (?<b-a>...)
intersect
This includes the case where the two substrings touch, but don't contain any common characters (only a common boundary between characters). This can happen if one of the groups is inside a lookaround and the other one is not or is inside a different lookaround. In this case the intersection of both subtrings will be pushed onto b
. This is still true when substring is completely contained inside the other.
Here are several examples to show this:
Example: Pushes onto <b>: Possible regex:
abcdefghijklmnopqrstuvwxyz "" (?<a>d.{8}).+$(?<=(?<b-a>.{11})...)
└──<a>──┘└──<b-a>──┘
abcdefghijklmnopqrstuvwxyz "jkl" (?<a>d.{8}).+$(?<=(?<b-a>.{11}).{6})
└──<a>┼─┘ │
└──<b-a>──┘
abcdefghijklmnopqrstuvwxyz "klmnopq" (?<a>k.{8})(?<=(?<b-a>.{11})..)
│ └──<a>┼─┘
└──<b-a>──┘
abcdefghijklmnopqrstuvwxyz "" (?<=(?<b-a>.{7})(?<a>.{4}o))
└<b-a>┘└<a>┘
abcdefghijklmnopqrstuvwxyz "fghijklmn" (?<a>d.{12})(?<=(?<b-a>.{9})..)
└─┼──<a>──┼─┘
└─<b-a>─┘
abcdefghijklmnopqrstuvwxyz "cdefg" (?<a>c.{4})..(?<=(?<b-a>.{9}))
│ └<a>┘ │
└─<b-a>─┘
Case 3: (?<a>...)
matches right of (?<b-a>...)
This case I don't really understand and would consider a bug: when the substring matched by (?<b-a>...)
is properly left of the substring matched by (?<a>...)
(with at least one character between them, such that they don't share a common boundary), nothing is pushed b
. By that I really mean nothing, not even an empty string — the capture stack itself remains empty. However, matching the group still succeeds, and the corresponding capture is popped off the a
group.
What's particularly annoying about this is that this case would likely be a lot more common than case 2, since this is what happens if you try to use balancing groups the way they were meant to be used, but in a plain right-to-left regex.
Update on case 3: After some more testing done by Kobi it turns out that something happens on stack b
. It appears that nothing is pushed, because m.Groups["b"].Success
will be False
and m.Groups["b"].Captures.Count
will be 0
. However, within the regex, the conditional (?(b)true|false)
will now use the true
branch. Also in .NET it seems to be possible to do (?<-b>)
afterwards (after which accessing m.Groups["b"]
will throw an exception), whereas Mono throws an exception immediately while matching the regex. Bug indeed.
这篇关于为什么此反向引用在后面的内部不起作用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!