问题描述
当我在 R(使用 foreign
包)中导入 Stata 数据集时,导入有时包含无效的UTF-8
.这本身就够令人不快了,但是一旦我尝试将对象转换为 JSON
(使用 rjson
包),它就会破坏一切.
When I import a Stata dataset in R (using the foreign
package), the import sometimes contains characters that are not valid UTF-8
. This is unpleasant enough by itself, but it breaks everything as soon as I try to transform the object to JSON
(using the rjson
package).
如何识别字符串中无效的UTF-8
字符并在此之后删除它们?
How I can identify non-valid-UTF-8
-characters in a string and delete them after that?
推荐答案
另一种使用 iconv
和它的参数 sub
: 字符串的解决方案.如果不是 NA(这里我将其设置为 ''),则用于替换输入中的任何不可转换字节.
Another solution using iconv
and it argument sub
: character string. If not NA(here I set it to ''), it is used to replace any non-convertible bytes in the input.
x <- "faxE7ile"
Encoding(x) <- "UTF-8"
iconv(x, "UTF-8", "UTF-8",sub='') ## replace any non UTF-8 by ''
"faile"
这里注意,如果我们选择正确的编码:
Here note that if we choose the right encoding:
x <- "faxE7ile"
Encoding(x) <- "latin1"
xx <- iconv(x, "latin1", "UTF-8",sub='')
facile
这篇关于如何识别/删除 R 中的非 UTF-8 字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!