本文介绍了删除包含非英文字符的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的示例数据集:

Name <- c("apple firm","苹果 firm","Ãpple firm")
Rank <- c(1,2,3)
data <- data.frame(Name,Rank)

我想删除包含非英文字符的名称.对于此示例,应仅保留苹果公司".

I would like to delete the Name containing non-English character. For this sample, only "apple firm" should stay.

我尝试使用 tm 包,但它只能帮助我删除非英文字符而不是整个查询.

I tried to use the tm package, but it can only help me delete the non-english characters instead of the whole queries.

推荐答案

我会查看这篇相关的 Stack Overflow 帖子以在 javascript 中做同样的事情.匹配非英文字符的正则表达式?

I would check out this related Stack Overflow post for doing the same thing in javascript. Regular expression to match non-English characters?

要将其转换为 R,您可以执行以下操作(匹配非 ASCII):

To translate this into R, you could do (to match non-ASCII):

res <- data[which(!grepl("[^\x01-\x7F]+", data$Name)),]

res
# A tibble: 1 × 2
#        Name  Rank
#       <chr> <dbl>
#1 apple firm     1

并根据相同的 SO 帖子匹配非 unicode:

And to match non-unicode per that same SO post:

  res <- data[which(!grepl("[^\u0001-\u007F]+", data$Name)),]

  res
# A tibble: 1 × 2
#        Name  Rank
#       <chr> <dbl>
#1 apple firm     1

注意 - 我们必须去掉 NUL 字符才能使其工作.因此,不是从 \u0000x00 开始,而是从 \u0001\x01 开始.

Note - we had to take out the NUL character for this to work. So instead of starting at \u0000 or x00 we start at \u0001 and \x01.

这篇关于删除包含非英文字符的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 17:20