问题描述
我正在使用 open-uri 读取一个声称以 iso-8859-1 编码的网页.当我读取页面内容时,open-uri 返回一个以 ASCII-8BIT 编码的字符串.
I am using open-uri to read a webpage which claims to be encoded in iso-8859-1. When I read the contents of the page, open-uri returns a string encoded in ASCII-8BIT.
open("http://www.nigella.com/recipes/view/DEVILS-FOOD-CAKE-5310") {|f| p f.content_type, f.charset, f.read.encoding }
=> ["text/html", "iso-8859-1", #<Encoding:ASCII-8BIT>]
我猜这是因为网页的字节(或字符)x92 不是有效的 iso-8859 字符.http://en.wikipedia.org/wiki/ISO/IEC_8859-1.
I am guessing this is because the webpage has the byte (or character) x92 which is not a valid iso-8859 character. http://en.wikipedia.org/wiki/ISO/IEC_8859-1.
我需要将网页存储为 utf-8 编码文件.关于如何处理编码不正确的网页的任何想法.我可以捕捉到异常并尝试猜测正确的编码,但这似乎很麻烦且容易出错.
I need to store webpages as utf-8 encoded files. Any ideas on how to deal with webpage where the encoding is incorrect. I could catch the exception and try to guess the correct encoding but that seems cumbersome and error-prone.
推荐答案
ASCII-8BIT 是 BINARY 的别名
open-uri
做了一件有趣的事:如果文件小于 10kb(或类似的东西),它返回一个String
,如果它更大则返回StringIO
.如果您尝试处理编码问题,这可能会让人感到困惑.ASCII-8BIT is an alias for BINARY
open-uri
does a funny thing: if the file is less than 10kb (or something like that), it returns aString
and if it's bigger then it returns aStringIO
. That can be confusing if you're trying to deal with encoding issues.
如果文件不是很大,我建议手动将它们加载到字符串中:
If the files aren't huge, I would recommend manually loading them into strings:
require 'uri'
require 'net/http'
require 'net/https'
uri = URI.parse url_to_file
http = Net::HTTP.new(uri.host, uri.port)
if uri.scheme == 'https'
http.use_ssl = true
# possibly useful if you see ssl errors
# http.verify_mode = ::OpenSSL::SSL::VERIFY_NONE
end
body = http.start { |session| session.get uri.request_uri }.body
然后您可以使用 https://rubygems.org/gems/ensure-encoding宝石
require 'ensure/encoding'
utf8_body = body.ensure_encoding('UTF-8', :external_encoding => :sniff, :invalid_characters => :transcode)
我对 ensure-encoding
非常满意...我们在 http 的生产环境中使用它://data.brighterplanet.com
I have been pretty happy with ensure-encoding
... we use it in production at http://data.brighterplanet.com
请注意,您也可以说 :invalid_characters => :ignore
而不是 :transcode
.
Note that you can also say :invalid_characters => :ignore
instead of :transcode
.
另外,如果你知道编码,你可以通过 :external_encoding => 'ISO-8859-1'
而不是 :sniff
Also, if you know the encoding somehow, you can pass :external_encoding => 'ISO-8859-1'
instead of :sniff
这篇关于open-uri 从以 iso-8859 编码的网页返回 ASCII-8BIT的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!