问题描述
任何熟悉RubyGem Sanitize的人,都会提供一个构建Transformer来转换的例子
< UL><李>一种与LT; /立GT;<李> b将/立GT;<李>℃下/立GT;< / UL>中
转换为
a,b和c
?
$ b
这不是你想要的做;您试图从节点中提取数据,并对其进行转换。在你的例子中,你不会对每个元素做同样的事情:有时候会附加逗号,有时会附加逗号和单词and。
为此,您需要保存状态和后处理,或者在节点流中向前看,以查看是否正在访问最后一个节点。我不知道用Sanitize的变换器做一个简单的方法,所以这个例子保存了状态和后处理。
需要'净化'
项目= []
s =< ul>< li>一些空间< / li>< li>更多东西&空间< / li>< li>最后一个< ; /立GT;< / UL>中
save_li = lambda do | env |
node = env [:node]
items<< node.text.strip if node.text?
end
Sanitize.clean(s,:transformers => save_li)
#=>
output =#{items [0 ..- 2] .join(,)}和#{items [-1]}
#=> 一些空间,更多的东西与空间,最后一个
IMO这个例子是滥用变形金刚因为它只是为了副作用而运行,所以它除了查找文本节点外没有任何其他功能。
如果其中一个列表项已嵌入HTML,天真的方法不再有效,你需要开始了解更多Nokogiri:
items = [] $ b $使用< b /> html< / li>< li>< / li>< / ul>< b>< ;
save_li = lambda do | env |
node = env [:node]
items<< node.content if node.name ==li
end
Sanitize.clean(s,:transformers => save_li)
#=> 一些带有html c的空格项
output =#{items [0 ..- 2] .join(,)}和#{items [-1]}
#= > 一些空格,带有html的项目和c
这种方法依赖于默认的无被列入白名单。 < b>
标签仍然由 save_li
lambda访问,但它们被剥离。这可能会在各种情况下导致问题。
Any one familiar with the RubyGem Sanitize, that provide an example of building a "Transformer" to convert
"<ul><li>a</li><li>b</li><li>c</li></ul>"
into
"a,b, and c"
?
IMO transformers are not for pulling out data like this:
This is not what you're trying to do; you're trying to pull data out of nodes, and transform it. In your example, you're not doing the same thing to each element: you're sometimes appending a comma, sometimes appending a comma and the word "and".
In order to do that, you either need to save state and post-process, or look ahead in the node stream to see if you're visiting the last node. I don't know of a trivial way to do that with Sanitize's transformers, so this example saves state and post-processes.
require 'sanitize'
items = []
s = "<ul><li>some space</li><li>more stuff with spaces</li><li>last one</li></ul>"
save_li = lambda do |env|
node = env[:node]
items << node.text.strip if node.text?
end
Sanitize.clean(s, :transformers => save_li)
# => " some space more stuff with spaces last one "
output = "#{items[0..-2].join(", ")}, and #{items[-1]}"
# => "some space, more stuff with spaces, and last one"
IMO this example is an abuse of transformers because it's being run only for its side effect, it does nothing other than look for text nodes.
If one of the list items has embedded HTML, the naive approach no longer works, and you need to start knowing more Nokogiri anyway:
items = []
s = "<ul><li>some space</li><li>item <b>with<b/> html</li><li>c</li></ul>"
save_li = lambda do |env|
node = env[:node]
items << node.content if node.name == "li"
end
Sanitize.clean(s, :transformers => save_li)
# => " some space item with html c "
output = "#{items[0..-2].join(", ")}, and #{items[-1]}"
# => "some space, item with html, and c"
This approach relies on the default Sanitize behavior of nothing being whitelisted. The <b>
tags are still visited by the save_li
lambda, but they're stripped. This has a potential to cause issues under a variety of circumstances.
这篇关于如何使用RubyGem Sanitize变换器将无序列表清理为逗号分隔列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!