本文介绍了XmlSlurper / NekoHTML文档片段解析 - 不需要HTML或BODY标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
亲爱的所有人,我试图解析下面的HTML片段,我想获得相同的片段作为输出(没有HTML和BODY标签)。这可能吗?如果是这样,怎么样?谢谢b $ b Misha
我在这里阅读:
,我相信我在下面添加了正确的选项。但是,输出仍然不正确:($ / b>
谢谢
Misha
import groovy.xml.MarkupBuilder
import groovy.xml.StreamingMarkupBuilder
import groovy.util.XmlNodePrinter
import groovy.util.slurpersupport.NodeChild
def text =
< div>< h2>测试< / h2>
< div> Hi< / div>
< / div>
//解析
def config = new org.cyberneko.html.HTMLConfiguration()
config.setFeature(http:// cyberneko。 org / html / features / balance-tags / document-fragment,true)
def html = new XmlSlurper(new org.cyberneko.html.parsers.SAXParser())。parseText(text)
//输出
def printNode(NodeChild节点){
def writer = new StringWriter()
writer<< new StreamingMarkupBuilder()。bind {
mkp。 declareNamespace('':node [0] .namespaceURI())
mkp.yield node
}
new XmlNodePrinter()。print(new XmlParser()。parse Text(writer.toString()))
}
printNode(html)
输出:
< HTML>
< tag0:HEAD xmlns:tag0 =http://www.w3.org/1999/xhtml/>
< BODY>
< DIV>
< H2>
测试
< / H2>
< DIV>
Hi
< / DIV>
< / DIV>
< / BODY>
< / HTML>
解决方案
直接在解析器对象上调用setFeature,如下所示:
@Grab(group ='net.sourceforge.nekohtml',module ='nekohtml',version ='1.9.14' )
导入groovy.xml.MarkupBuilder
导入groovy.xml.StreamingMarkupBuilder
导入groovy.util.XmlNodePrinter
导入groovy.util.slurpersupport.NodeChild
def text =
< div>< h2>测试< / h2>
< div> Hi< / div>
< ; / div>
// Parse
def parser = new org.cyberneko.html.parsers.SAXParser()
parser.setFeature( http://cyberneko.org/html/features/balance-tags/document-fragment\",true)
def html = new XmlSlurper(parser).parseText(text)
//输出
def printNode(NodeChild节点){
def writer = new StringWriter()
writer<<新的StreamingMarkupBuilder()。bind {
mkp.declareNamespace('':node [0] .namespaceURI())
mkp.yield节点
}
new XmlNodePrinter()。print (new XmlParser()。parseText(writer.toString()))
}
printNode(html)
Dear All, I am trying to parse the following HTML fragment, and I would like to get the same fragment as output (without HTML and BODY tags). Is this possible? If so, how?
Thank youMisha
p.s. I am reading here:http://nekohtml.sourceforge.net/faq.html#fragmentsand I believe I have added the correct options below. However, the output is still incorrect :(
Thank youMisha
import groovy.xml.MarkupBuilder
import groovy.xml.StreamingMarkupBuilder
import groovy.util.XmlNodePrinter
import groovy.util.slurpersupport.NodeChild
def text="""
<div><h2>Test</h2>
<div>Hi</div>
</div>
"""
// Parse
def config=new org.cyberneko.html.HTMLConfiguration()
config.setFeature("http://cyberneko.org/html/features/balance-tags/document-fragment",true)
def html=new XmlSlurper(new org.cyberneko.html.parsers.SAXParser()).parseText(text)
// Output
def printNode(NodeChild node) {
def writer = new StringWriter()
writer << new StreamingMarkupBuilder().bind {
mkp.declareNamespace('':node[0].namespaceURI())
mkp.yield node
}
new XmlNodePrinter().print(new XmlParser().parseText(writer.toString()))
}
printNode(html)
Output:
<HTML>
<tag0:HEAD xmlns:tag0="http://www.w3.org/1999/xhtml"/>
<BODY>
<DIV>
<H2>
Test
</H2>
<DIV>
Hi
</DIV>
</DIV>
</BODY>
</HTML>
解决方案
Call setFeature on the parser object directly, like so:
@Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.14')
import groovy.xml.MarkupBuilder
import groovy.xml.StreamingMarkupBuilder
import groovy.util.XmlNodePrinter
import groovy.util.slurpersupport.NodeChild
def text="""
<div><h2>Test</h2>
<div>Hi</div>
</div>
"""
// Parse
def parser=new org.cyberneko.html.parsers.SAXParser()
parser.setFeature("http://cyberneko.org/html/features/balance-tags/document-fragment",true)
def html=new XmlSlurper(parser).parseText(text)
// Output
def printNode(NodeChild node) {
def writer = new StringWriter()
writer << new StreamingMarkupBuilder().bind {
mkp.declareNamespace('':node[0].namespaceURI())
mkp.yield node
}
new XmlNodePrinter().print(new XmlParser().parseText(writer.toString()))
}
printNode(html)
这篇关于XmlSlurper / NekoHTML文档片段解析 - 不需要HTML或BODY标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!