此问题与先前的one有关如何用等效的México代码Latex替换重音字符串(如M\'{e}xico)。

我的问题在这里稍有不同。我正在使用带有字符串变量的第三方数据库,并带有上述西班牙语的重音符号。但是,编码看起来很奇怪,因为这是我得到的行为:

> grep("México",temp$dest_nom_ent)
integer(0)
> grep("Mexico",temp$dest_nom_ent)
integer(0)
> grep("xico",temp$dest_nom_ent)
[1] 18 19 20
> temp$dest_nom_ent[grep("xico",temp$dest_nom_ent)]
[2] "México" "México" "México"


其中,temp$dest_nom_ent是状态名称为México的变量。

那么,我的问题是如何将来自第三方数据库的字符串变量转换为标准R函数可以识别的编码。请注意:

> Encoding(temp$dest_nom_ent)
 [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
 [8] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[15] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[22] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[29] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[36] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[43] "unknown" "unknown"


有关更多信息,请使用Windows 7 64。

> charToRaw(temp$dest_nom_ent[18])
[1] 4d e9 78 69 63 6f


source中的哪个与Windows西班牙文(传统排序)语言环境一致。

M=4d
é=e9
x=78
i=69
c=63
o=6f


还要注意:

> charToRaw("México")
[1] 4d c3 a9 78 69 63 6f
> Encoding("México")
[1] "latin1"


我没有成功尝试以下操作(例如,grep("é",temp$dest_nom_ent)返回空向量):

Encoding(temp$dest_nom_ent)<-"latin1"
temp$dest_nom_ent <- iconv(temp$dest_nom_ent,"","latin1")
temp$dest_nom_ent  <- enc2utf8(temp$dest_nom_ent)
...


我使用iconvlist()检查了受支持的字符集,并且支持了"WINDOWS-1252"。但是,以下内容不起作用:

> temp1 <- temp$dest_nom_ent[grep("xico",temp$dest_nom_ent)]
> temp1
[1] "México" "México" "México"
> Encoding(temp1)<-"WINDOWS-1252"
> temp1 <- iconv(temp1,"WINDOWS-1252","latin1")
> temp1
[1] "México" "México" "México"
> Encoding(temp1)
[1] "latin1" "latin1" "latin1"
> charToRaw(temp1[1])
[1] 4d e9 78 69 63 6f
> grep("é",temp1)
integer(0)


相比之下:

> temp2 <- c("México","México","México")
> temp2
[1] "México" "México" "México"
> Encoding(temp2)
[1] "latin1" "latin1" "latin1"
> charToRaw(temp2[1])
[1] 4d c3 a9 78 69 63 6f
> grep("é",temp2)
[1] 1 2 3)


试图通过蛮力找出编码,例如:

try(for(i in 1:length(iconvlist())){
    temp1 <- temp$dest_nom_ent[grep("xico",temp$dest_nom_ent)]
    Encoding(temp1)<-iconvlist()[i]
    temp1 <- iconv(temp1,iconvlist()[i],"latin1")
    print(grep("é",temp1))
    print(i)
        },silent=FALSE)


我对try函数不熟悉,但是它仍然会出错而不是忽略它,因此无法检查整个列表:

...
[1] 17
integer(0)
[1] 18
integer(0)
[1] 19
integer(0)
[1] 20
Error in iconv(temp1, iconvlist()[i], "latin1") :
  unsupported conversion from 'CP-GR' to 'latin1' in codepage 1252


最后:

> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
> d<-c("México","México")
> for(i in 1:7){d1 <- str_sub(d[1],i,i); print(d1)}
[1] "M"
[1] "Ã"
[1] "©
[1] "x"
[1] "i"
[1] "c"
[1] "o"
> print(grep("é",d))
[1] 1 2


因此,看来我将不得不根据建议的here更改计算机的语言环境。另请参见here

PS:如果您想知道我如何在English_United States.1252语言环境下键入d<-c("México","México"),则方法是使用Control Panel > Clock, Language and Region > Region and Language > Keyboards and Languages > Change Keyboards设置辅助西班牙语键盘(传统排序),然后在installed services下单击添加并导航至西班牙语传统排序。然后,可以在advanced key settings下创建快捷方式来切换键盘。就我而言,Shit+Alt。因此,如果要在英语默认语言环境中键入ñ,请先执行Shift+Alt,然后依次输入;Shift+Alt以返回英文键盘。

最佳答案

使用temp$dest_nom_ent来查看Encoding(x)和“墨西哥”的编码。您可能需要使用enc2nativeenc2utf8进行转换。

10-04 17:09