问题描述
我正在尝试进行一些爬网,但是WWW:Mechanize gem似乎不喜欢编码和崩溃.
发布请求会导致302重定向(随后会进行机械化,到目前为止效果很好),结果页面似乎将其崩溃.我用谷歌搜索了很多,但是到目前为止,如何解决这个问题都没有.你们有个主意吗?
I'm trying to do a little bit of webscraping, but the WWW:Mechanize gem doesn't seem to like the encoding and crashes.
The post request results in a 302 redirect (which mechanize follows, so far so good) and the resulting page seems to crash it.I googled quite a bit, but nothing came up so far how to solve this. Any of you got an idea?
代码:
require 'rubygems'
require 'mechanize'
agent = WWW::Mechanize.new
agent.user_agent_alias = 'Mac Safari'
answer = agent.post('https://www.budget.de/de/reservierung/privatkunden/step1/schnellbuchung',
{"Country" => "Deutschland",
"Abholstation" => "Aalen",
"Abgabestation" => "Aalen",
"Abholdatum" => "26.02.2009",
"Abholzeit_stunde" => "13",
"Abholzeit_minute" => "30",
"Abgabedatum" => "28.02.2009",
"Abgabezeit_stunde" => "13",
"Abgabezeit_minute" => "30",
"CountryID" => "DE",
"AbholstationID"=>"AA1",
"AbgabestationID"=>"AA1"
}
)
puts answer.body
错误:
D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/util.rb:29:in `iconv': "\204nderungen vorbe"... (Iconv::IllegalSequence)
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/util.rb:29:in `to_native_charset'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/response_header_handler.rb:29:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain.rb:30:in `pass'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/handler.rb:6:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/response_body_parser.rb:35:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain.rb:30:in `pass'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/handler.rb:6:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/pre_connect_hook.rb:14:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain.rb:25:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:494:in `fetch_page'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:545:in `fetch_page'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:403:in `post_form'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:322:in `post'
from test.rb:7
推荐答案
该页面无疑是UTF-8,但是Mechanize使用NKF(核心Ruby库)来猜测编码,由于某种原因,它以Shift JIS的形式出现.解决该问题的最快方法是覆盖Mechanize的编码映射,以便当它尝试使用Iconv将主体转换为UTF-8时,也将源编码也传递为UTF-8.您可以这样做:
That page is most certainly UTF-8, however Mechanize uses NKF (a core Ruby library) to guess the encoding and for some reason it comes up as Shift JIS. The quickest way to work around the problem is to override the encoding mapping for Mechanize, so that when it attempts to convert the body to UTF-8 using Iconv it passes in the source encoding as UTF-8 as well. You can do it like this:
WWW::Mechanize::Util::CODE_DIC[:SJIS] = "UTF-8"
将其放置在require
Mechanize库所在的行之后.您可能希望在发现问题的根本原因后立即或更准确地重新设置该值,并在必要时提交补丁.
Place that just after the line where you require
the Mechanize library. You may want to set the value back immediately after, or even better, find the root cause of the problem and submit a patch if necessary.
注意:我解决此问题的方法是使用回溯调试Mechanize库. to_native_charset
方法调用问题所在的detect_charset
.
Note: The way I solved this was by debugging the Mechanize library by using the backtrace. The to_native_charset
method calls detect_charset
which is where the problem was.
这篇关于使用www :: mechanize时Iconv :: IllegalSequence的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!