1.基本代码
在gemfile中加入gem "hpricot",bundler install之后,在application。rb中require "hpricot" require "open-uri".
pp "===========begin============="
url = "http://www.xiaochuncnjp.com/search.php?mod=forum&searchid=552&orderby=lastpost&ascdesc=desc&searchsubmit=yes&kw=%E6%90%AC%E5%AE%B6"
doc = Hpricot(open(url))
# 获取返回页面的编码,使用了gem rchardet。
cd = CharDet.detect(doc.to_s)
pp encoding = cd["encoding"]
# pp doc.search("ul/.pbw") #获取返回页面ul标签下class为pbw的元素
doc.search("ul/.pbw").each do |item|
# pp timeStr = item.inner_html
pp titleStr = item.search("h3/a").inner_html
pp urlStr = item.search("h3").inner_html.to_s.gsub(/href="/, 'href="http://www.xiaochuncnjp.com/')
pp contentStr = item.search("p")[1].inner_html
end
pp "************end***********"
2。当链接的协议为https时,报certificate verify failed error,无法通过认证的错误。
https是安全协议,要通过验证可以add this ssl_verify option to the top of the file.来解决
FROM: module OpenURI
Options = {
:proxy => true,
:progress_proc => true,
:content_length_proc => true,
:http_basic_authentication => true,
} TO: module OpenURI
Options = {
:proxy => true,
:progress_proc => true,
:content_length_proc => true,
:http_basic_authentication => true,
:ssl_verify => true
} Change the part where it enables verification FROM: if target.class == URI::HTTPS
require 'net/https'
http.use_ssl = true
http.enable_post_connection_check = true
http.verify_mode = OpenSSL::SSL::VERIFY_PEER
store = OpenSSL::X509::Store.new
store.set_default_paths
http.cert_store = store
end TO:
if target.class == URI::HTTPS
require 'net/https'
http.use_ssl = true
http.enable_post_connection_check = true
if options[:ssl_verify] == false
http.verify_mode = OpenSSL::SSL::VERIFY_NONE
else
http.verify_mode = OpenSSL::SSL::VERIFY_PEER
end
store = OpenSSL::X509::Store.new
store.set_default_paths
http.cert_store = store
end run it like this: open("https://someurl", :ssl_verify => false) {|f|
print f.read
}
3.页面乱码
由于网页的编码方式不同意,当你摘录信息的时候,很容易出现乱码。因此,你需要根据网页的编码方式转换编码。这个过程使用到了rchardet插件。
4.rchardet的使用
在gemfile中加入gem "rchardet",bundler install之后,在application。rb中require "rchardet".
cd = CharDet.detect(some_data)
encoding = cd['encoding']
confidence = cd['confidence'] # 0.0 <= confidence <= 1.0
eg: CharDet.detect("\xA4\xCF") #=> {"encoding"=>"EUC-JP", "confidence"=>0.99}