如何识别/删除 R 中的非 UTF-8 字符

本文介绍了如何识别/删除 R 中的非 UTF-8 字符的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

当我在 R(使用 foreign 包)中导入 Stata 数据集时，导入有时包含无效的UTF-8.这本身就够令人不快了，但是一旦我尝试将对象转换为 JSON(使用 rjson 包)，它就会破坏一切.

When I import a Stata dataset in R (using the foreign package), the import sometimes contains characters that are not valid UTF-8. This is unpleasant enough by itself, but it breaks everything as soon as I try to transform the object to JSON (using the rjson package).

如何识别字符串中无效的UTF-8字符并在此之后删除它们?

How I can identify non-valid-UTF-8-characters in a string and delete them after that?

推荐答案

另一种使用 iconv 和它的参数 sub: 字符串的解决方案.如果不是 NA(这里我将其设置为 '')，则用于替换输入中的任何不可转换字节.

Another solution using iconv and it argument sub: character string. If not NA(here I set it to ''), it is used to replace any non-convertible bytes in the input.

x <- "faxE7ile"
Encoding(x) <- "UTF-8"
iconv(x, "UTF-8", "UTF-8",sub='') ## replace any non UTF-8 by ''
"faile"

这里注意，如果我们选择正确的编码:

Here note that if we choose the right encoding:

x <- "faxE7ile"
Encoding(x) <- "latin1"
xx <- iconv(x, "latin1", "UTF-8",sub='')
facile

这篇关于如何识别/删除 R 中的非 UTF-8 字符的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！