问题描述
我正在尝试使用Ruby的Nokogiri解析大型(1 GB或更多)XML文件.我正在一个较小的文件上测试代码,该文件仅包含4条记录在此处可用.我在Ubuntu 10.10上使用Nokogiri版本1.5.0,Ruby 1.8.7.由于我不太了解SAX,因此我尝试启动Nokogiri :: XML :: Reader.
I'm trying to use Ruby's Nokogiri to parse large (1 GB or more) XML files. I'm testing code on a smaller file, containing only 4 records available here. I'm using Nokogiri version 1.5.0, Ruby 1.8.7 on Ubuntu 10.10. Since I don't understand SAX very well, I'm trying Nokogiri::XML::Reader to start.
我第一次尝试检索PMID标签的内容,如下所示:
My first attempt, to retrieve the content of the PMID tag, looks like this:
#!/usr/bin/ruby
require "rubygems"
require "nokogiri"
file = ARGV[0]
reader = Nokogiri::XML::Reader(File.open(file))
p = []
reader.each do |node|
if node.name == "PMID"
p << node.inner_xml
end
end
puts p.inspect
这就是我希望看到的:
["21714156", "21693734", "21692271", "21692260"]
这是我实际看到的:
["21714156", "", "21693734", "", "21692271", "", "21692260", ""]
似乎出于某种原因,我的代码正在为每个PMID实例查找或生成一个额外的空PMID标签.或inner_xml
不能按我的想法工作.
It seems that for some reason, my code is finding, or generating, an extra, empty PMID tag for every instance of PMID. Either that or inner_xml
does not work as I thought.
如果有人能确认我的代码和数据产生了所显示的结果并指出我要去哪里,我将不胜感激.
I'd be grateful if anyone could confirm that my code and data generates the result shown and suggest where I'm going wrong.
推荐答案
流中的每个元素都通过两个事件来处理:一个打开元素,另一个关闭元素.开幕活动将有
Each element in the stream comes through as two events: one to open the element and one to close it. The opening event will have
node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
关闭事件将有
node.node_type == Nokogiri::XML::Reader::TYPE_END_ELEMENT
您看到的空字符串仅仅是元素关闭事件.请记住,通过SAX解析,您基本上是在走一棵树,因此您需要第二个事件来告诉您何时返回和关闭元素.
The empty strings you're seeing are just the element closing events. Remember that with SAX parsing, you're basically walking through a tree so you need the second event to tell you when you're going back up and closing an element.
您可能想要更多类似这样的东西:
You probably want something more like this:
reader.each do |node|
if node.name == "PMID" && node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
p << node.inner_xml
end
end
或者也许:
reader.each do |node|
next if node.name != 'PMID'
next if node.node_type != Nokogiri::XML::Reader::TYPE_ELEMENT
p << node.inner_xml
end
或其他一些变化.
这篇关于如何使用Nokogiri :: XML :: Reader解析大型XML文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!