r - 在使用`stringr::str_replace_all`时，“>”与“[[:punct:]]”不匹配？

This question already has answers here:

R/regex with stringi/ICU: why is a '+' considered a non-[:punct:] character?

（2个答案）

2年前关闭。

我觉得这真的很奇怪：

pattern <- "[[:punct:][:digit:][:space:]]+"
string  <- "a . , > 1 b"

gsub(pattern, " ", string)
# [1] "a b"

library(stringr)
str_replace_all(string, pattern, " ")
# [1] "a > b"

str_replace_all(string, "[[:punct:][:digit:][:space:]>]+", " ")
# [1] "a b"

这是预期的吗？

最佳答案

仍在处理此问题，但?"stringi-search-charclass"说：

提防使用POSIX字符类，例如‘[：punct：]’。重症监护病房
用户指南（请参阅下文）指出，一般而言，它们不是
定义明确，因此最终可能会与您有所不同
期望。

特别是在类似POSIX的正则表达式引擎中，“ [：punct：]”代表
对应于“ ispunct（）”的字符类
分类功能（在类似UNIX的系统上检查“ man 3 ispunct”
系统）。根据ISO / IEC 9899：1990（ISO C90），
“ ispunct（）”功能测试除以下字符外的所有打印字符
空格或“ isalnum（）”为真的字符。但是，在
POSIX设置，哪些字符属于哪个的详细信息
类取决于当前的语言环境。因此，“ [：punct：]”类
不会导致可移植代码（同样，在类似POSIX的正则表达式引擎中）。

因此，POSIX风格的[[：punct：]]在其中更像是[[\ p {P} \ p {S}]]
“ ICU”。你被警告了。

复制以上发布的问题，

string  <- "a . , > 1 b"
mypunct <- "[[\\p{P}][\\p{S}]]"
stringr::str_remove_all(string, mypunct)

我可以欣赏特定于语言环境的内容，但仍然让我感到惊讶的是，[:punct:]甚至在C语言环境中都不起作用...

Punct

r - 在使用`stringr::str_replace_all`时，“>”与“[[:punct:]]”不匹配？