regex - strsplit与gregexpr不一致

我对A comment的回答中的this question应该使用strsplit给出期望的结果，即使它似乎正确匹配字符向量中的第一个和最后一个逗号，也没有。可以使用gregexpr和regmatches证明这一点。

那么，即使strsplit仅返回同一正则表达式的两个匹配项，在本示例中，为什么regmatches在每个逗号上分开？

#  We would like to split on the first comma and
#  the last comma (positions 4 and 13 in this string)
x <- "123,34,56,78,90"

#  Splits on every comma. Must be wrong.
strsplit( x , '^\\w+\\K,|,(?=\\w+$)' , perl = TRUE )[[1]]
#[1] "123" "34"  "56"  "78"  "90"


#  Ok. Let's check the positions of matches for this regex
m <- gregexpr( '^\\w+\\K,|,(?=\\w+$)' , x , perl = TRUE )

# Matching positions are at
unlist(m)
[1]  4 13

#  And extracting them...
regmatches( x , m )
[[1]]
[1] "," ","

??!到底是怎么回事？

最佳答案

@aprillion的理论很精确，来自R documentation:

repeat {
    if the string is empty
        break.
    if there is a match
        add the string to the left of the match to the output.
        remove the match and all to the left of it.
    else
        add the string to the output.
        break.
}

换句话说，在每次迭代中，^将匹配新字符串的开头(不包含先前的项目)。

为了简单地说明这种行为:

> x <- "12345"
> strsplit( x , "^." , perl = TRUE )
[[1]]
[1] "" "" "" "" ""

Here，您可以使用超前断言作为分隔符来查看此行为的结果(感谢@ JoshO'Brien提供链接)。