问题描述
这是我的示例数据集:
Name <- c("apple firm","苹果 firm","Ãpple firm")
Rank <- c(1,2,3)
data <- data.frame(Name,Rank)
我想删除包含非英文字符的名称.对于此示例,应仅保留苹果公司".
I would like to delete the Name containing non-English character. For this sample, only "apple firm" should stay.
我尝试使用 tm
包,但它只能帮助我删除非英文字符而不是整个查询.
I tried to use the tm
package, but it can only help me delete the non-english characters instead of the whole queries.
推荐答案
我会查看这篇相关的 Stack Overflow 帖子以在 javascript 中做同样的事情.匹配非英文字符的正则表达式?
I would check out this related Stack Overflow post for doing the same thing in javascript. Regular expression to match non-English characters?
要将其转换为 R,您可以执行以下操作(匹配非 ASCII):
To translate this into R, you could do (to match non-ASCII):
res <- data[which(!grepl("[^\x01-\x7F]+", data$Name)),]
res
# A tibble: 1 × 2
# Name Rank
# <chr> <dbl>
#1 apple firm 1
并根据相同的 SO 帖子匹配非 unicode:
And to match non-unicode per that same SO post:
res <- data[which(!grepl("[^\u0001-\u007F]+", data$Name)),]
res
# A tibble: 1 × 2
# Name Rank
# <chr> <dbl>
#1 apple firm 1
注意 - 我们必须去掉 NUL
字符才能使其工作.因此,不是从 \u0000
或 x00
开始,而是从 \u0001
和 \x01
开始.
Note - we had to take out the NUL
character for this to work. So instead of starting at \u0000
or x00
we start at \u0001
and \x01
.
这篇关于删除包含非英文字符的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!