问题描述
我想在 nokogiri 中进行 sax 解析,但是当解析具有又长又疯狂的 xml 元素名称或属性的 xml 元素时......然后一切都会变得疯狂.
I want to sax-parse in nokogiri, but when it comes to parse xml element that have a long and crazy xml element name or a attribute on it.. then everthing goes crazy.
首先,如果我想解析这个 xml 文件并获取所有标题元素,我该如何使用 nokogiri-sax 做到这一点.
Fore instans if I like to parse this xml file and grab all the title element, how do I do that with nokogiri-sax.
<titles>
<title xml:lang="sv">Arkivvetenskap</title>
<title xml:lang="en">Archival science</title>
</titles>
推荐答案
在您的示例中,title
是元素的名称.xml:lang="sv"
是一个属性.此解析器假定标题元素内没有嵌套元素
In your example, title
is the name of the element. xml:lang="sv"
is an attribute.This parser assumes there are no elements nested inside of title elements
require 'rubygems'
require 'nokogiri'
class MyDocument < Nokogiri::XML::SAX::Document
def start_element(name, attrs)
@attrs = attrs
@content = ''
end
def end_element(name)
if name == 'title'
puts Hash[@attrs]['xml:lang']
puts @content.inspect
@content = nil
end
end
def characters(string)
@content << string if @content
end
def cdata_block(string)
characters(string)
end
end
parser = Nokogiri::XML::SAX::Parser.new(MyDocument.new)
parser.parse(DATA)
__END__
<titles>
<title xml:lang="sv">Arkivvetenskap</title>
<title xml:lang="en">Archival science</title>
</titles>
打印出来
sv
"Arkivvetenskap"
en
"Archival science"
SAX 解析通常太复杂了.因此,我推荐 Nokogiri 的标准内存解析器,或者如果您确实需要速度和内存效率,Nokogiri 的阅读器解析器.
SAX parsing is usually way too complex. Because of that, I recommend Nokogiri's standard in-memory parser, or if you really need speed and memory efficiency, Nokogiri's Reader parser.
为了比较,这里是同一个文档的标准 Nokogiri 解析器
For comparison, here is a standard Nokogiri parser for the same document
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::XML(DATA)
doc.css('title').each do |title|
puts title['lang']
puts title.text.to_s.inspect
end
__END__
<titles>
<title xml:lang="sv">Arkivvetenskap</title>
<title xml:lang="en">Archival science</title>
</titles>
这里是同一个文档的阅读器解析器
And here is a reader parser for the same document
require 'rubygems'
require 'nokogiri'
reader = Nokogiri::XML::Reader(DATA)
while reader.read
if reader.name == 'title' && reader.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
puts reader.attribute('xml:lang')
puts reader.inner_xml.inspect # TODO xml decode this, if necessary.
end
end
__END__
<titles>
<title xml:lang="sv">Arkivvetenskap</title>
<title xml:lang="en">Archival science</title>
</titles>
这篇关于萨克斯管用 nokogiri 解析奇怪的元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!