问题描述
我正在尝试了解 Nokogiri.有没有人有一个链接到 Nokogiri 解析/刮取显示结果树的基本示例.认为这真的有助于我的理解.
I am trying to understand Nokogiri. Does anyone have a link to a basic example of Nokogiri parse/scrape showing the resultant tree. Think it would really help my understanding.
推荐答案
使用 IRB 和 Ruby 1.9.2:
Using IRB and Ruby 1.9.2:
加载 Nokogiri:
Load Nokogiri:
> require 'nokogiri'
#=> true
解析文档:
> doc = Nokogiri::HTML('<html><body><p>foobar</p></body></html>')
#=> #<Nokogiri::HTML::Document:0x1012821a0
@node_cache = [],
attr_accessor :errors = [],
attr_reader :decorators = nil
Nokogiri 喜欢格式良好的文档.请注意,它添加了 DOCTYPE
因为我解析为文档.也可以解析为文档片段,但这是非常专业的.
Nokogiri likes well formed docs. Note that it added the DOCTYPE
because I parsed as a document. It's possible to parse as a document fragment too, but that is pretty specialized.
> doc.to_html
#=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>foobar</p></body></html>\n"
使用 CSS 搜索文档以找到第一个 节点并抓取其内容:
Search the document to find the first <p>
node using CSS and grab its content:
> doc.at('p').text
#=> "foobar"
使用不同的方法名称来做同样的事情:
Use a different method name to do the same thing:
> doc.at('p').content
#=> "foobar"
在文档中搜索 标签内的所有
节点,并获取第一个的内容.
search
返回一个节点集,就像一个节点数组.
Search the document for all <p>
nodes inside the <body>
tag, and grab the content of the first one. search
returns a nodeset, which is like an array of nodes.
> doc.search('body p').first.text
#=> "foobar"
这是很重要的一点,几乎每个人在第一次使用 Nokogiri 时都会被绊倒.search
及其 css
和 xpath
变体返回一个 NodeSet.NodeSet.text
或 content
将所有返回节点的文本连接成一个字符串,这使得 非常 难以再次拆开.
This is an important point, and one that trips up almost everyone when first using Nokogiri. search
and its css
and xpath
variants return a NodeSet. NodeSet.text
or content
concatenates the text of all the returned nodes into a single String which can make it very difficult to take apart again.
使用稍微不同的 HTML 有助于说明这一点:
Using a little different HTML helps illustrate this:
> doc = Nokogiri::HTML('<html><body><p>foo</p><p>bar</p></body></html>')
> puts doc.to_html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>foo</p>
<p>bar</p>
</body></html>
> doc.search('p').text
#=> "foobar"
> doc.search('p').map(&:text)
#=> ["foo", "bar"]
返回原始 HTML...
Returning back to the original HTML...
改变节点的内容:
> doc.at('p').content = 'bar'
#=> "bar"
将解析后的文档作为 HTML 发出:
Emit a parsed document as HTML:
> doc.to_html
#=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>bar</p></body></html>\n"
删除节点:
> doc.at('p').remove
#=> #<Nokogiri::XML::Element:0x80939178 name="p" children=[#<Nokogiri::XML::Text:0x8091a624 "bar">]>
> doc.to_html
#=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body></body></html>\n"
至于抓取,关于使用 Nokogiri 从网站上撕下 HTML 有很多问题.在 StackOverflow 中搜索nokogiri 和 open-uri"应该会有所帮助.
As for scraping, there are a lot of questions on SO about using Nokogiri for tearing apart HTML from sites. Searching StackOverflow for "nokogiri and open-uri" should help.
这篇关于使用 Nokogiri 的一些示例是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!