本文介绍了如何识别/删除 R 中的非 UTF-8 字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!


当我在 R(使用 foreign 包)中导入 Stata 数据集时,导入有时包含无效的UTF-8.这本身就够令人不快了,但是一旦我尝试将对象转换为 JSON(使用 rjson 包),它就会破坏一切.

When I import a Stata dataset in R (using the foreign package), the import sometimes contains characters that are not valid UTF-8. This is unpleasant enough by itself, but it breaks everything as soon as I try to transform the object to JSON (using the rjson package).


How I can identify non-valid-UTF-8-characters in a string and delete them after that?


另一种使用 iconv 和它的参数 sub: 字符串的解决方案.如果不是 NA(这里我将其设置为 ''),则用于替换输入中的任何不可转换字节.

Another solution using iconv and it argument sub: character string. If not NA(here I set it to ''), it is used to replace any non-convertible bytes in the input.

x <- "faxE7ile"
Encoding(x) <- "UTF-8"
iconv(x, "UTF-8", "UTF-8",sub='') ## replace any non UTF-8 by ''


Here note that if we choose the right encoding:

x <- "faxE7ile"
Encoding(x) <- "latin1"
xx <- iconv(x, "latin1", "UTF-8",sub='')

这篇关于如何识别/删除 R 中的非 UTF-8 字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-28 12:17