我很难使用nekohtml解析器。
它在url上运行良好,但是当我想在一个简单的xml测试中进行测试时,它没有正确地读取它。
以下是我的申报方式:
def createAndSetParser() {
SAXParser parser = new SAXParser() //Default Sax NekoHTML parser
def charset = "Windows-1252" // The encoding of the page
def tagFormat = "upper" // Ensures all the tags and consistently written, by putting all of them in upper-case. We can choose "lower", "upper" of "match"
def attrFormat = "lower" // Same thing for attributes. We can choose "upper", "lower" or "match"
Purifier purifier = new Purifier() //Creating a purifier, in order to clean the incoming HTML
XMLDocumentFilter[] filter = [purifier] //Creating a filter, and adding the purifier to this filter. (NekoHTML feature)
parser.setProperty("http://cyberneko.org/html/properties/filters", filter)
parser.setProperty("http://cyberneko.org/html/properties/default-encoding", charset)
parser.setProperty("http://cyberneko.org/html/properties/names/elems", tagFormat)
parser.setProperty("http://cyberneko.org/html/properties/names/attrs", attrFormat)
parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset", true) // Forces the parser to use the charset we provided to him.
parser.setFeature("http://cyberneko.org/html/features/override-doctype", false) // To let the Doctype as it is.
parser.setFeature("http://cyberneko.org/html/features/override-namespaces", false) // To make sure no namespace is added or overridden.
parser.setFeature("http://cyberneko.org/html/features/balance-tags", true)
return new XmlSlurper(parser) // A groovy parser that does not download the all tree structure, but rather supply only the information it is asked for.
}
同样的,当我在网站上使用它的时候,它工作得很好。
你猜我为什么不能对简单的xml文本样本这么做??
任何非常感谢的帮助:)
最佳答案
我让您的脚本在groovy控制台上可执行,以便使用grape从maven中央存储库中获取所需的nekohtml库。
@Grapes(
@Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.15')
)
import groovy.xml.StreamingMarkupBuilder
import org.apache.xerces.xni.parser.XMLDocumentFilter
import org.cyberneko.html.parsers.SAXParser
import org.cyberneko.html.filters.Purifier
def createAndSetParser() {
SAXParser parser = new SAXParser()
parser.setProperty("http://cyberneko.org/html/properties/filters", [new Purifier()] as XMLDocumentFilter[])
parser.setProperty("http://cyberneko.org/html/properties/default-encoding", "Windows-1252")
parser.setProperty("http://cyberneko.org/html/properties/names/elems", "upper")
parser.setProperty("http://cyberneko.org/html/properties/names/attrs", "lower")
parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset", true)
parser.setFeature("http://cyberneko.org/html/features/override-doctype", false)
parser.setFeature("http://cyberneko.org/html/features/override-namespaces", false)
parser.setFeature("http://cyberneko.org/html/features/balance-tags", true)
return new XmlSlurper(parser)
}
def printResult(def gPathResult) {
println new StreamingMarkupBuilder().bind { out << gPathResult }
}
def parser = createAndSetParser()
printResult parser.parseText('<html><body>Hello World</body></html>')
printResult parser.parseText('<house><room>bedroom</room><room>kitchen</room></house>')
当以这种方式执行时,这两个
printResult
-语句的结果如下所示,并且可以解释解析xml字符串时的问题,因为它被包装成<html><body>...</body></html>
标记并释放了名为<house/>
的根标记:<HTML><tag0:HEAD xmlns:tag0='http://www.w3.org/1999/xhtml'></tag0:HEAD><BODY>Hello World</BODY></HTML>
<HTML><BODY><ROOM>bedroom</ROOM><ROOM>kitchen</ROOM></BODY></HTML>
所有这些都是由您在脚本中启用的
http://cyberneko.org/html/features/balance-tags
功能引起的。如果禁用此功能(它必须显式设置为false,因为它默认为true),则结果如下:<HTML><BODY>Hello World</BODY></HTML>
<HOUSE><ROOM>bedroom</ROOM><ROOM>kitchen</ROOM></HOUSE>