我希望只能在给定的定界符集之外使用grepl()gsub(),例如,我希望能够忽略引号之间的文本。

这是我想要的输出:

grepl2("banana", "'banana' banana \"banana\"", escaped =c('""', "''"))
#> [1] TRUE
grepl2("banana", "'banana' apple \"banana\"", escaped =c('""', "''"))
#> [1] FALSE
grepl2("banana", "{banana} banana {banana}", escaped = "{}")
#> [1] TRUE
grepl2("banana", "{banana} apple {banana}", escaped = "{}")
#> [1] FALSE

gsub2("banana", "potatoe", "'banana' banana \"banana\"")
#> [1] "'banana' potatoe \"banana\""
gsub2("banana", "potatoe", "'banana' apple \"banana\"")
#> [1] "'banana' apple \"banana\""
gsub2("banana", "potatoe", "{banana} banana {banana}", escaped = "{}")
#> [1] "{banana} potatoe {banana}"
gsub2("banana", "potatoe", "{banana} apple {banana}", escaped = "{}")
#> [1] "{banana} apple {banana}"


实际案例中可能以不同的数量和顺序引用了子字符串。

我已经编写了以下适用于这些情况的函数,但它们很笨拙,gsub2()根本不健壮,因为它会暂时用占位符替换定界的内容,并且这些占位符可能会受到后续操作的影响。

regex_escape <-
function(string,n = 1) {
  for(i in seq_len(n)){
    string <- gsub("([][{}().+*^$|\\?])", "\\\\\\1", string)
  }
  string
}

grepl2 <-
  function(pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE,
           useBytes = FALSE, escaped =c('""', "''")){
    escaped <- strsplit(escaped,"")
    # TODO check that "escaped" delimiters are balanced and don't cross each other
    for(i in 1:length(escaped)){
      close <- regex_escape(escaped[[i]][[2]])
      open <- regex_escape(escaped[[i]][[1]])
      pattern_i <- sprintf("%s.*?%s", open, close)
      x <- gsub(pattern_i,"",x)
    }
    grepl(pattern, x, ignore.case, perl, fixed, useBytes)
  }

gsub2 <- function(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
                   fixed = FALSE, useBytes = FALSE, escaped =c('""', "''")){
  escaped <- strsplit(escaped,"")
  # TODO check that "escaped" delimiters are balanced and don't cross each other
  matches <- character()
  for(i in 1:length(escaped)){
    close <- regex_escape(escaped[[i]][[2]])
    open <- regex_escape(escaped[[i]][[1]])
    pattern_i <- sprintf("%s.*?%s", open, close)
    ind <- gregexpr(pattern_i,x)
    matches_i <- regmatches(x, ind)[[1]]
    regmatches(x, ind)[[1]] <- paste0("((",length(matches) + seq_along(matches_i),"))")
    matches <- c(matches, matches_i)
  }
  x <- gsub(pattern, replacement, x, ignore.case, perl, fixed, useBytes)
  for(i in seq_along(matches)){
    pattern <- sprintf("\\(\\(%s\\)\\)", i)
    x <- gsub(pattern, matches[[i]], x)
  }
  x
}


是否有使用正则表达式而不使用占位符的解决方案?请注意,我当前的函数支持多对定界符,但我对仅支持一对定界符的解决方案感到满意,并且不会尝试在简单引号之间匹配子字符串。

也可以使用不同的分隔符,例如{},而不是2 "或2 '(如果有帮助)。

我也可以强加perl = TRUE

最佳答案

您可以使用start/end_escape参数来提供匹配的定界符(例如{})的LHS和RHS,而不必在错误的位置进行匹配(}作为LHS定界符)

perl = TRUE允许环顾断言。它们评估其中的语句的有效性,而不会在模式中捕获它们。 This post很好地涵盖了它们。

您会在perl = FALSE中遇到错误,因为R的默认正则表达式引擎TRE不支持它们。

  gsub3 <- function(pattern, replacement, x, escape = NULL, start_escape = NULL, end_escape = NULL) {
      if (!is.null(escape) || !is.null(start_escape))
      left_escape <- paste0("(?<![", paste0(escape, paste0(start_escape, collapse = ""), collapse = ""), "])")
      if (!is.null(escape) || !is.null(end_escape))
      right_escape <- paste0("(?![", paste0(escape, paste0(end_escape, collapse = ""), collapse = ""), "])")
      patt <- paste0(left_escape, "(", pattern, ")", right_escape)
      gsub(patt, replacement, x, perl = TRUE)
    }
    gsub3("banana", "potatoe", "'banana' banana \"banana\"", escape = "'\"")
    #> [1] "'banana' potatoe \"banana\""
    gsub3("banana", "potatoe", "'banana' apple \"banana\"", escape = '"\'')
    #> [1] "'banana' apple \"banana\""
    gsub3("banana", "potatoe", "{banana} banana {banana}", escape = "{}")
    #> [1] "{banana} potatoe {banana}"
    gsub3("banana", "potatoe", "{banana} apple {banana}", escape = "{}")
    #> [1] "{banana} apple {banana}"


下面是grepl3-请注意,这不需要perl = TRUE,因为我们不在乎模式所捕获的内容,只要它匹配即可。

grepl3 <- function(pattern, x, escape = "'", start_escape = NULL, end_escape = NULL) {
  if (!is.null(escape) || !is.null(start_escape))
  start_escape <- paste0("[^", paste0(escape, paste0(start_escape, collapse = ""), collapse = ""), "]")
  if (!is.null(escape) || !is.null(end_escape))
  end_escape <- paste0("[^", paste0(escape, paste0(end_escape, collapse = ""), collapse = ""), "]")
  patt <- paste0(start_escape, pattern, end_escape)
  grepl(patt, x)
}

grepl3("banana", "'banana' banana \"banana\"", escape =c('"', "'"))
#> [1] TRUE
grepl3("banana", "'banana' apple \"banana\"", escape =c('""', "''"))
#> [1] FALSE
grepl3("banana", "{banana} banana {banana}", escape = "{}")
#> [1] TRUE
grepl3("banana", "{banana} apple {banana}", escape = "{}")
#> [1] FALSE


编辑:

只要您可以使用一组成对的运算符,就可以解决gsub,而不会遇到Andrew提到的问题。我认为您可以修改它以允许使用多个定界符。感谢令人着迷的问题,在regmatches中找到了一个新宝石!

gsub4 <-
  function(pattern,
           replacement,
           x,
           left_escape = "{",
           right_escape = "}") {
    # `regmatches()` takes a character vector and
    # output of `gregexpr` and friends and returns
    # the matching (or unmatching, as here) substrings
    string_pieces <-
      regmatches(x,
                 gregexpr(
                   paste0(
                     "\\Q",  # Begin quote, regex will treat everything after as fixed.
                     left_escape,
                     "\\E(?>[^", # \\ ends quotes.
                     left_escape,
                     right_escape,
                     "]|(?R))*", # Recurses, allowing nested escape characters
                     "\\Q",
                     right_escape,
                     "\\E",
                     collapse = ""
                   ),
                   x,
                   perl = TRUE
                 ), invert =NA) # even indices match pattern (so are escaped),
                                # odd indices we want to perform replacement on.
for (k in seq_along(string_pieces)) {
    n_pieces <- length(string_pieces[[k]])
  # Due to the structure of regmatches(invert = NA), we know that it will always
  # return unmatched strings at odd values, padding with "" as needed.
  to_replace <- seq(from = 1, to = n_pieces, by = 2)
  string_pieces[[k]][to_replace] <- gsub(pattern, replacement, string_pieces[[k]][to_replace])
}
    sapply(string_pieces, paste0, collapse = "")
  }
gsub4('banana', 'apples', "{banana's} potatoes {banana} banana", left_escape = "{", right_escape = "}")
#> [1] "{banana's} potatoes {banana} apples"
gsub4('banana', 'apples', "{banana's} potatoes {banana} banana", left_escape = "{", right_escape = "}")
#> [1] "{banana's} potatoes {banana} apples"
gsub4('banana', 'apples',  "banana's potatoes", left_escape = "{", right_escape = "}")
#> [1] "apples's potatoes"
gsub4('banana', 'apples', "{banana's} potatoes", left_escape = "{", right_escape = "}")
#> [1] "{banana's} potatoes"

关于r - 扩展gsub和grepl以忽略给定定界符之间的子字符串,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/58775471/

10-12 20:39