问题描述
常识和使用 gregexpr()
的健全性检查表明,下面的后视和前瞻断言都应该在 testString
中的一个位置完全匹配:
Common sense and a sanity-check using gregexpr()
indicate that the look-behind and look-ahead assertions below should each match at exactly one location in testString
:
testString <- "text XX text"
BB <- "(?<= XX )"
FF <- "(?= XX )"
as.vector(gregexpr(BB, testString, perl=TRUE)[[1]])
# [1] 9
as.vector(gregexpr(FF, testString, perl=TRUE)[[1]][1])
# [1] 5
但是,
strsplit()
以不同的方式使用这些匹配位置,在使用后视断言时在 one 位置拆分 testString
,但在 两个位置 - 其中第二个位置似乎不正确.
strsplit()
, however, uses those match locations differently, splitting testString
at one location when using the lookbehind assertion, but at two locations -- the second of which seems incorrect -- when using the lookahead assertion.
strsplit(testString, BB, perl=TRUE)
# [[1]]
# [1] "text XX " "text"
strsplit(testString, FF, perl=TRUE)
# [[1]]
# [1] "text" " " "XX text"
我有两个问题:(Q1)这里发生了什么?而(Q2)如何让 strsplit()
表现得更好?
I have two questions: (Q1) What's going on here? And (Q2) how can one get strsplit()
to be better behaved?
更新: Theodore Lytras 的出色回答解释了正在发生的事情,因此解决了(Q1).我的回答建立在他确定补救措施的基础上,解决了(Q2).
Update: Theodore Lytras' excellent answer explains what's going on, and so addresses (Q1). My answer builds on his to identify a remedy, addressing (Q2).
推荐答案
我不确定这是否属于错误,因为我相信这是基于 R 文档的预期行为.来自 ?strsplit
:
I am not sure whether this qualifies as a bug, because I believe this is expected behaviour based on the R documentation. From ?strsplit
:
应用于每个输入字符串的算法是
repeat {
if the string is empty
break.
if there is a match
add the string to the left of the match to the output.
remove the match and all to the left of it.
else
add the string to the output.
break.
}
请注意,这意味着如果开头有匹配一个(非空)字符串,输出的第一个元素是‘""’,但是如果字符串末尾有匹配项,则输出为与删除匹配项相同.
Note that this means that if there is a match at the beginning of a (non-empty) string, the first element of the output is ‘""’, but if there is a match at the end of the string, the output is the same as with the match removed.
问题在于前瞻(和后视)断言的长度为零.因此,例如在这种情况下:
The problem is that lookahead (and lookbehind) assertions are zero-length. So for example in this case:
FF <- "(?=funky)"
testString <- "take me to funky town"
gregexpr(FF,testString,perl=TRUE)
# [[1]]
# [1] 12
# attr(,"match.length")
# [1] 0
# attr(,"useBytes")
# [1] TRUE
strsplit(testString,FF,perl=TRUE)
# [[1]]
# [1] "take me to " "f" "unky town"
会发生什么是孤独的前瞻 (?=funky)
在位置 12 处匹配.所以第一个拆分包括位置 11(匹配的左侧)之前的字符串,并将其从字符串以及匹配项,但长度为零.
What happens is that the lonely lookahead (?=funky)
matches at position 12. So the first split includes the string up to position 11 (left of the match), and it is removed from the string, together with the match, which -however- has zero length.
现在剩下的字符串是funky town
,前瞻匹配位置1.但是没有什么可以删除,因为匹配的左边没有任何东西,而且匹配本身的长度为零.所以算法陷入了无限循环.显然,R 通过拆分单个字符来解决这个问题,顺便说一下,当 strsplit
使用空正则表达式时(当参数 split=""
时),这是记录的行为.在此之后剩余的字符串是 unky town
,由于没有匹配,它作为最后一个分割返回.
Now the remaining string is funky town
, and the lookahead matches at position 1. However there's nothing to remove, because there's nothing at the left of the match, and the match itself has zero length. So the algorithm is stuck in an infinite loop. Apparently R resolves this by splitting a single character, which incidentally is the documented behaviour when strsplit
ing with an empty regex (when argument split=""
). After this the remaining string is unky town
, which is returned as the last split since there's no match.
Lookbehinds 没有问题,因为每个匹配项都被拆分并从剩余的字符串中移除,因此算法永远不会卡住.
Lookbehinds are no problem, because each match is split and removed from the remaining string, so the algorithm is never stuck.
诚然,这种行为乍一看很奇怪.然而,否则行为将违反前瞻为零长度的假设.鉴于 strsplit
算法被记录在案,我相信这不符合错误的定义.
Admittedly this behaviour looks weird at first glance. Behaving otherwise however would violate the assumption of zero length for lookaheads. Given that the strsplit
algorithm is documented, I belive this does not meet the definition of a bug.
这篇关于为什么 strsplit 使用积极的前瞻和后视断言匹配不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!