本文介绍了如何访问多个<p>一次标记一个的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有以下 HTML:
<p>一些话.</p><p>还有一些词.</p><p>更多的话.</p>
如果我使用以下方法解析 HTML:
doc = Nokogiri::HTML(open("http://my_url"))
然后运行
doc.css('#test_id').text
在控制台中我得到:
=>一些话.\n还有一些话.\n更多的话"
如何只获取第一个
元素?
我想我用.children
doc.css('#test_id').children[0].text
这是正确的做法吗?
解决方案
问题在于您没有在正确类型的对象上使用 text
.
如果您正在查看 NodeSet text
文档说:
获取所有包含的 Node 对象的内部文本
如果您正在查看 节点 又名元素,它说:
返回此节点的内容
区别如下:
需要'nokogiri'doc = Nokogiri::HTML(<<EOT)<div id="test_id"><p>一些话.</p><p>还有一些词.</p><p>更多的话.</p>
EOTdoc.search('p').class # =>Nokogiri::XML::NodeSetdoc.search('p').text # =>一些话.一些更多的话.更多的话."doc.at('p').class # =>Nokogiri::XML::Elementdoc.at('p').text # =>一些单词."
at
就像 search(...).first
.
通常,如果我们想要一个 NodeSet 的文本,我们会使用:
doc.search('p').map(&:text) # =>[一些话.",更多的话.",更多的话."]
这使得选择特定节点的文本变得容易.
参见如何避免从抓取时的节点"也是.
doc.css('#test_id').children[0].text
嗯,是的,你可以这样做,但是 children
不会做同样的事情:
doc.search('#test_id').children# =>[#<Nokogiri::XML::Text:0x3fc31580ca24 "\n ">, #<Nokogiri::XML::Element:0x3fc315103714 name="p" children=[#<Nokogiri::XML::Text:0x3fc31580d5a0 "一些话.">]>, #<Nokogiri::XML::Text:0x3fc315107f44 "\n ">, #<Nokogiri::XML::Element:0x3fc3151036ec name="p" children=[#<Nokogiri::XML::Text:0x3fc315107cc4 "还有一些词.">]>, #<Nokogiri::XML::Text:0x3fc315107b20 "\n ">, #<Nokogiri::XML::Element:0x3fc3151036c4 name="p" children=[#<Nokogiri::XML::Text:0x3fc3151078a0 "还有更多的话.">]>, #<Nokogiri::XML::Text:0x3fc3151076fc "\n">]doc.search('#test_id').children[0] # =>#<Nokogiri::XML::Text:0x3fc31580ca24 "\n ">doc.search('#test_id').children[1] # =>#<Nokogiri::XML::Element:0x3fc315103714 name="p" children=[#<Nokogiri::XML::Text:0x3fc31580d5a0一些话.">]>
对比:
doc.search('#test_id p')# =>[#<Nokogiri::XML::Element:0x3fc315103714 name="p" children=[#<Nokogiri::XML::Text:0x3fc31580d5a0一些话.">]>, #<Nokogiri::XML::Element:0x3fc3151036ec name="p" children=[#<Nokogiri::XML::Text:0x3fc315107cc4 "还有一些词.">]>, #<Nokogiri::XML::Element:0x3fc3151036c4name="p" children=[#<Nokogiri::XML::Text:0x3fc3151078a0 "还有更多的话.">]>]doc.search('#test_id p')[0] # =>#<Nokogiri::XML::Element:0x3fc315103714 name="p" children=[#<Nokogiri::XML::Text:0x3fc31580d5a0一些话.">]>doc.search('#test_id p')[1] # =>#<Nokogiri::XML::Element:0x3fc3151036ec name="p" children=[#<Nokogiri::XML::Text:0x3fc315107cc4还有一些词.">]>
注意 children
如何返回用于格式化 HTML 的标签之间的文本节点.您必须注意 children
返回所选标签下方 HTML 中的所有内容.这有时很有用,但对于一般的文本检索,它可能不是您想要的.
相反,使用更具选择性的 '#test_id p'
选择器并迭代返回的 NodeSet,您将避免格式化文本节点,并且在使用切片时不必考虑它们或索引到 NodeSet.
I have the following HTML:
<div id="test_id">
<p>Some words.</p>
<p>Some more words.</p>
<p>Even more words.</p>
</div>
If I parse the HTML using:
doc = Nokogiri::HTML(open("http://my_url"))
and run
doc.css('#test_id').text
in the console I get:
=> "Some words.\nSome more words.\nEven more words"
How do I get the first <p>
element only?
I think I figured it out with .children
doc.css('#test_id').children[0].text
Is this the correct way to do this?
解决方案
The problem is that you're not using text
on the right type of object.
If you're looking at a NodeSet the text
documentation says:
If you're looking at a Node AKA Element, it says:
Here's the difference:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div id="test_id">
<p>Some words.</p>
<p>Some more words.</p>
<p>Even more words.</p>
</div>
EOT
doc.search('p').class # => Nokogiri::XML::NodeSet
doc.search('p').text # => "Some words.Some more words.Even more words."
doc.at('p').class # => Nokogiri::XML::Element
doc.at('p').text # => "Some words."
at
is like search(...).first
.
Typically, if we want the text of a NodeSet we'd use:
doc.search('p').map(&:text) # => ["Some words.", "Some more words.", "Even more words."]
which makes it easy to pick the text of a specific node.
See "How to avoid joining all text from Nodes when scraping" also.
Well, yeah, you can do that, but children
isn't going to do the same thing:
doc.search('#test_id').children
# => [#<Nokogiri::XML::Text:0x3fc31580ca24 "\n ">, #<Nokogiri::XML::Element:0x3fc315103714 name="p" children=[#<Nokogiri::XML::Text:0x3fc31580d5a0 "Some words.">]>, #<Nokogiri::XML::Text:0x3fc315107f44 "\n ">, #<Nokogiri::XML::Element:0x3fc3151036ec name="p" children=[#<Nokogiri::XML::Text:0x3fc315107cc4 "Some more words.">]>, #<Nokogiri::XML::Text:0x3fc315107b20 "\n ">, #<Nokogiri::XML::Element:0x3fc3151036c4 name="p" children=[#<Nokogiri::XML::Text:0x3fc3151078a0 "Even more words.">]>, #<Nokogiri::XML::Text:0x3fc3151076fc "\n">]
doc.search('#test_id').children[0] # => #<Nokogiri::XML::Text:0x3fc31580ca24 "\n ">
doc.search('#test_id').children[1] # => #<Nokogiri::XML::Element:0x3fc315103714 name="p" children=[#<Nokogiri::XML::Text:0x3fc31580d5a0 "Some words.">]>
versus:
doc.search('#test_id p')
# => [#<Nokogiri::XML::Element:0x3fc315103714 name="p" children=[#<Nokogiri::XML::Text:0x3fc31580d5a0 "Some words.">]>, #<Nokogiri::XML::Element:0x3fc3151036ec name="p" children=[#<Nokogiri::XML::Text:0x3fc315107cc4 "Some more words.">]>, #<Nokogiri::XML::Element:0x3fc3151036c4 name="p" children=[#<Nokogiri::XML::Text:0x3fc3151078a0 "Even more words.">]>]
doc.search('#test_id p')[0] # => #<Nokogiri::XML::Element:0x3fc315103714 name="p" children=[#<Nokogiri::XML::Text:0x3fc31580d5a0 "Some words.">]>
doc.search('#test_id p')[1] # => #<Nokogiri::XML::Element:0x3fc3151036ec name="p" children=[#<Nokogiri::XML::Text:0x3fc315107cc4 "Some more words.">]>
Notice how children
is returning the text nodes between the tags used to format the HTML. You have to be aware that children
returns everything in the HTML below the selected tag. This is useful sometimes but for general text retrieval it's probably not what you want.
Instead, use the more selective '#test_id p'
selector and iterate over the returned NodeSet and you'll avoid the formatting text nodes and won't have to account for them when using a slice or index into the NodeSet.
这篇关于如何访问多个<p>一次标记一个的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!