问题描述
试图解析墨西哥参议院的参议院声明,但是网页上的UTF-8编码有问题。
Trying to parse Senate statements from the Mexican Senate, but having trouble with UTF-8 encodings of the web page.
这个html清楚地显示了:
This html comes through clearly:
library(rvest)
Senate<-html("http://comunicacion.senado.gob.mx/index.php/informacion/versiones/19675-version-estenografica-de-la-reunion-ordinaria-de-las-comisiones-unidas-de-puntos-constitucionales-de-anticorrupcion-y-participacion-ciudadana-y-de-estudios-legislativos-segunda.html")
这是一个网页的一个例子:
Here is an example of a bit of the webpage:
"CONTINÚA EL SENADOR CORRAL JURADO: Nosotros decimos. Entonces, bueno, el tema es que hay dos rutas señor presidente y también tratar, por ejemplo, de forzar ahora. Una decisión de pre dictamen a lo mejor lo único que va a hacer es complicar más las cosas."
可以看出,口音和ñ都会罚款。
As can be seen, both accents and the "ñ" come through fine.
该问题出现在其他一些html(同一个域!)中。例如:
The issue arises in some other htmls (of the same domain!). For example:
Senate2<-html("http://comunicacion.senado.gob.mx/index.php/informacion/versiones/14694-version-estenografica-de-la-sesion-de-la-comision-permanente-celebrada-el-13-de-agosto-de-2014.html")
我得到:
"-EL C. DIPUTADO ADAME ALEMÃÂN: En consecuencia está a discusión la propuesta. Y para hablar sobre este asunto, se le concede el uso de la palabra a la senadora…….."
在第二张作品中,我尝试过iconv(),并将html()上的encoding参数强制转换为encoding =UTF -8但仍然得到相同的结果。
On this second piece I've tried iconv() and coercing the encoding parameter on html() to encoding="UTF-8" but keep getting the same results.
我还使用 W3 Validator 一个>它似乎是UTF-8,没有问题。
I've also checked the webpage encoding using W3 Validator and it seems to be UTF-8 and have no issues.
使用gsub似乎没有效率,因为编码使用相同的代码下载不同的字符:
Using gsub does not seem efficient as the encoding downloads different characters with the same "code":
í - ÃÂ
á - ÃÂ
ó - ÃÂ
很多新鲜的想法。
> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] grDevices utils datasets graphics stats grid methods base
other attached packages:
[1] stringi_0.4-1 magrittr_1.5 selectr_0.2-3 rvest_0.2.0 ggplot2_1.0.0 geosphere_1.3-11 fields_7.1
[8] maps_2.3-9 spam_1.0-1 sp_1.0-17 SOAR_0.99-11 data.table_1.9.4 reshape2_1.4.1 xlsx_0.5.7
[15] xlsxjars_0.6.1 rJava_0.9-6
loaded via a namespace (and not attached):
[1] bitops_1.0-6 chron_2.3-45 colorspace_1.2-4 digest_0.6.8 evaluate_0.5.5 formatR_1.0 gtable_0.1.2
[8] httr_0.6.1 knitr_1.8 lattice_0.20-29 MASS_7.3-35 munsell_0.4.2 plotly_0.5.17 plyr_1.8.1
[15] proto_0.3-10 Rcpp_0.11.3 RCurl_1.95-4.5 RJSONIO_1.3-0 scales_0.2.4 stringr_0.6.2 tools_3.1.2
[22] XML_3.98-1.1
更新:
这似乎是一个问题:
UPDATE:This seems to be the issue:
stri_enc_mark(Senate2)
[1] "ASCII" "latin1" "latin1" "ASCII" "ASCII" "latin1" "ASCII" "ASCII" "latin1"
...等等。显然,问题在latin1:
... and so forth. Clearly, the issue is in latin1:
stri_enc_isutf8(texto2)
[1] TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE FALSE
如何强制latin1修正UTF-8字符串?当被翻译的字符串似乎是做错了,给我上面提到的问题。
How can I coerce the latin1 to correct UTF-8 strings? When "translated" by stringi It appears to be doing it wrong, giving me the issues described earlier.
推荐答案
编码是21世纪更糟糕的头痛之一。但是这里有一个解决方案:
Encodings are one of 21st century's worse headaches. But here's a solution for you:
# Set-up remote reading connection, specifying UTF-8 as encoding.
addr <- "http://comunicacion.senado.gob.mx/index.php/informacion/versiones/14694-version-estenografica-de-la-sesion-de-la-comision-permanente-celebrada-el-13-de-agosto-de-2014.html"
read.html.con <- file(description = addr, encoding = "UTF-8", open = "rt")
# Read in cycles of 1000 characters
html.text <- c()
i = 0
while(length(html.text) == i) {
html.text <- append(html.text, readChar(con = read.html.con,nchars = 1000))
cat(i <- i + 1)
}
# close reading connection
close(read.html.con)
# Paste everything back together & at the same time, convert from UTF-8
# to... UTF-8 with iconv(). I know. It's crazy. Encodings are secretely
# meant to drive us insane.
content <- paste0(iconv(html.text, from="UTF-8", to = "UTF-8"), collapse="")
# Set-up local writing
outpath <- "~/htmlfile.html"
# Create file connection specifying "UTF-8" as encoding, once more
# (Although this one makes sense)
write.html.con <- file(description = outpath, open = "w", encoding = "UTF-8")
# Use capture.output to dump everything back into the html file
# Using cat inside it will prevent having [1]'s, quotes and such parasites
capture.output(cat(content), file = write.html.con)
# Close the output connection
close(write.html.con)
然后你可以在你喜欢的浏览器中打开你新创建的文件。您应该看到它完好无损,并准备好重新打开您选择的工具!
Then you're ready to open your newly created file in your favorite browser. You should see it intact and have it ready to be reopened with the tools of your choosing!
这篇关于R的UTF-8编码问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!