问题描述
我希望 Nokogiri 保持 HTML 实体不变,但它似乎正在将实体转换为实际的符号.例如:
I want Nokogiri to leave HTML entities untouched, but it seems to be converting the entities into the actual symbol. For example:
Nokogiri::HTML.fragment('<p>®</p>').to_s
结果: 似乎没有什么可以将原始 HTML 返回给我..inner_html、.text、.content 方法都返回 Nothing seems to return the original HTML back to me.The .inner_html, .text, .content methods all return 有没有办法让 Nokogiri 保持这些 HTML 实体不变? Is there a way for Nokogiri to leave these HTML entities untouched? 我已经搜索过 stackoverflow 并发现了类似的问题,但没有一个完全像这个. I've already searched stackoverflow and found similar questions, but nothing exactly like this one. 不是理想的答案,但您可以通过设置允许的编码来强制它生成实体(如果不是很好的名称): Not an ideal answer, but you can force it to generate entities (if not nice names) by setting the allowed encoding: 如果 Nokogiri 在定义的地方使用漂亮"的实体名称,而不是总是使用简洁的十六进制实体,那就太好了,但即使这样也不会保留"原始实体. It would be nice if Nokogiri used 'nice' names of entities where defined, instead of always using the terse hexadecimal entity, but even that wouldn't be 'preserving' the original. 问题的根源在于,在 HTML 中,以下描述的内容完全相同: The root of the problem is that, in HTML, the following all describe the exact same content: 如果您希望文本节点的 If you wanted the 如果 Nokogiri 总是为每个字符返回与用于输入文档相同的编码,则需要将每个字符存储为记录实体引用的自定义节点.存在一个可能用于此的类 ( If Nokogiri was to always return the same encoding per character as was used to enter the document it would need to store each character as a custom node recording the entity reference. There exists a class that might be used for this ( 但是,我找不到在使用 Nokogiri v1.4.4 或 v1.5.0 进行解析期间创建这些的方法.具体来说,是否存在 However, I can't find a way to cause these to be created during parsing using Nokogiri v1.4.4 or v1.5.0. Specifically, the presence or absence of 这篇关于Nokogiri 保持 HTML 实体不变的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!"
'®'
而不是 '®'
'®'
instead of '®'
推荐答案
#encoding: UTF-8
require 'nokogiri'
html = Nokogiri::HTML.fragment('<p>®</p>')
puts html.to_html #=> <p>®</p>
puts html.to_html( encoding:'US-ASCII' ) #=> <p>®</p>
<p>®</p>
<p>®</p>
<p>®</p>
<p>®</p>
to_s
表示实际上是 ®
那么描述它的标记实际上应该是:<p>&reg;</p>
.to_s
representation of a text node to be actually ®
then the markup describing that would really be: <p>&reg;</p>
.Nokogiri::XML::EntityReference代码>
):Nokogiri::XML::EntityReference
):require 'nokogiri'
html = Nokogiri::HTML.fragment("<p>Foo</p>")
html.at('p') << Nokogiri::XML::EntityReference.new( html.document, 'reg' )
puts html
#=> <p>Foo®</p>
Nokogiri::XML::ParseOptions::NOENT
在解析过程中似乎不会导致创建:Nokogiri::XML::ParseOptions::NOENT
during parsing does not appear to cause one to be created:require 'nokogiri'
html = "<p>Foo®</p>"
[ Nokogiri::XML::ParseOptions::NOENT,
Nokogiri::XML::ParseOptions::DEFAULT_HTML,
Nokogiri::XML::ParseOptions::DEFAULT_XML,
Nokogiri::XML::ParseOptions::STRICT
].each do |parse_option|
p Nokogiri::HTML(html,nil,'utf-8',parse_option).at('//text()')
end
#=> #<Nokogiri::XML::Text:0x810cca48 "Foo\u00AE">
#=> #<Nokogiri::XML::Text:0x810cc624 "Foo\u00AE">
#=> #<Nokogiri::XML::Text:0x810cc228 "Foo\u00AE">
#=> #<Nokogiri::XML::Text:0x810cbe04 "Foo\u00AE">