r - 正则表达式替换R中字符串的部分/组

尝试对书本文档的LaTeX（pdf_book输出）进行后处理，以折叠biblatex引用，以便以后可以使用\usepackage[sortcites]{biblatex}按时间顺序对其进行排序。因此，我需要在}{之后找到\\autocites并将其替换为,。我正在尝试gsub()，但找不到正确的咒语。

# example input
testcase <- "text \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990} text {keep}{separate}"

# desired output
"text \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}"

一种简单的方法是替换所有}{

> gsub('\\}\\{', ',', testcase, perl=TRUE)
[1] "text \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep,separate}"

但这也会使{keep}{separate}崩溃。

然后，我试图通过使用不同的组来替换以}{开头的'word'（不带whitspace的字符串）中的\\autocites，但失败了：

> gsub('(\\\\autocites)([^ \f\n\r\t\v}{}]+)((\\}\\{})+)', '\\1\\2\\3', testcase, perl=TRUE)
[1] "text \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990} some text {keep}{separate}"

附录：
实际文档比上面的测试用例包含更多的行/元素。并非所有元素都包含\\autocites，在极少数情况下，一个元素具有多个\\autocites。我最初并不认为这是相关的。一个更现实的测试用例：

testcase2 <- c("some text",
"text \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990} text {keep}{separate}",
"text \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}{runkleGap1990} text {keep}{separate} \\autocites[cf.~][]{foxMapping2000}{wattPattern1947}")

最佳答案

只需一个gsub调用就足够了：

gsub("(?:\\G(?!^)|\\\\autocites)\\S*?\\K}{", ",", testcase, perl=TRUE)
## => [1] "text \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}"

请参见regex demo。在这里，(?:\G(?!^)|\\autocites)匹配上一个匹配项或\autocites字符串的末尾，然后匹配任何0个或多个非空白字符，但要尽可能少，然后\K丢弃当前匹配缓冲区中的文本并消耗}{子字符串，最终将其替换为逗号。

还有一个非常易读的解决方案，其中一个正则表达式和一个使用stringr::str_replace_all的固定文本替换：

library(stringr)
str_replace_all(testcase, "\\\\autocites\\S+", function(x) gsub("}{", ",", x, fixed=TRUE))
# => [1] "text \\autocites[cf.~][]{foxMapping2000,wattPattern1947,runkleGap1990} text {keep}{separate}"

在这里，\\autocites\S+匹配\autocites，然后匹配1+个非空白字符，并且gsub("}{", ",", x, fixed=TRUE)用匹配的文本中的}{（非常快）替换每个,。