问题描述
所以我尝试使用 Nokogiri 解析 40 万行以上的 XML 文件.
So I'm attempting to parse a 400k+ line XML file using Nokogiri.
XML 文件具有以下基本格式:
The XML file has this basic format:
<?xml version="1.0" encoding="windows-1252"?>
<JDBOR date="2013-09-01 04:12:31" version="1.0.20 [2012-12-14]" copyright="Orphanet (c) 2013">
<DisorderList count="6760">
*** Repeated Many Times ***
<Disorder id="17601">
<OrphaNumber>166024</OrphaNumber>
<Name lang="en">Multiple epiphyseal dysplasia, Al-Gazali type</Name>
<DisorderSignList count="18">
<DisorderSign>
<ClinicalSign id="2040">
<Name lang="en">Macrocephaly/macrocrania/megalocephaly/megacephaly</Name>
</ClinicalSign>
<SignFreq id="640">
<Name lang="en">Very frequent</Name>
</SignFreq>
</DisorderSign>
</Disorder>
*** Repeated Many Times ***
</DisorderList>
</JDBOR>
这是我创建的用于解析每个 DisorderSign id 和名称并将其返回到数据库中的代码:
Here is the code I've created to parse and return each DisorderSign id and name into a database:
require 'nokogiri'
sympFile = File.open("Temp.xml")
@doc = Nokogiri::XML(sympFile)
sympFile.close()
symptomsList = []
@doc.xpath("////DisorderSign").each do |x|
signId = x.at('ClinicalSign').attribute('id').text()
name = x.at('ClinicalSign').element_children().text()
symptomsList.push([signId, name])
end
symptomsList.each do |x|
Symptom.where(:name => x[1], :signid => Integer(x[0])).first_or_create
end
这在我使用过的测试文件上效果很好,尽管它们要小得多,大约 10000 行.
This works perfect on the test files I've used, although they were much smaller, around 10000 lines.
当我尝试在大型 XML 文件上运行它时,它根本没有完成.我把它放在一夜之间,它似乎只是锁定了.我编写的代码是否有任何根本原因会使内存非常密集或效率低下?我意识到我将所有可能的配对都存储在一个列表中,但这不应大到足以填满内存.
When I attempt to run this on the large XML file, it simply does not finish. I left it on overnight and it seemed to just lockup. Is there any fundamental reason the code I've written would make this very memory intensive or inefficient? I realize I store every possible pair in a list, but that shouldn't be large enough to fill up memory.
感谢您的帮助.
推荐答案
我看到了一些可能的问题.首先,这个:
I see a few possible problems. First of all, this:
@doc = Nokogiri::XML(sympFile)
会将整个 XML 文件作为某种 libxml2 数据结构放入内存中,并且可能比原始 XML 文件大.
will slurp the whole XML file into memory as some sort of libxml2 data structure and that will probably be larger than the raw XML file.
然后你做这样的事情:
@doc.xpath(...).each
这可能不够聪明,无法生成仅维护指向 XML 内部形式的指针的枚举器,它可能会在构建 NodeSet
时生成所有内容的副本,xpath
返回.这将为您提供大部分扩展内存版本的 XML 的另一个副本.我不确定这里发生了多少复制和数组构造,但即使不复制所有内容,也有相当多的内存和 CPU 开销的空间.
That may not be smart enough to produce an enumerator that just maintains a pointer to the internal form of the XML, it might be producing a copy of everything when it builds the NodeSet
that xpath
returns. That would give you another copy of most of the expanded-in-memory version of the XML. I'm not sure how much copying and array construction happens here but there is room for a fair bit of memory and CPU overhead even if it doesn't copy duplicate everything.
然后复制您感兴趣的内容:
Then you make your copy of what you're interested in:
symptomsList.push([signId, name])
最后遍历该数组:
symptomsList.each do |x|
Symptom.where(:name => x[1], :signid => Integer(x[0])).first_or_create
end
我发现 SAX 解析器在处理大型数据集时效果更好,但它们更麻烦跟...共事.您可以尝试创建自己的 SAX 解析器,如下所示:
I find that SAX parsers work better with large data sets but they are more cumbersome to work with. You could try creating your own SAX parser something like this:
class D < Nokogiri::XML::SAX::Document
def start_element(name, attrs = [ ])
if(name == 'DisorderSign')
@data = { }
elsif(name == 'ClinicalSign')
@key = :sign
@data[@key] = ''
elsif(name == 'SignFreq')
@key = :freq
@data[@key] = ''
elsif(name == 'Name')
@in_name = true
end
end
def characters(str)
@data[@key] += str if(@key && @in_name)
end
def end_element(name, attrs = [ ])
if(name == 'DisorderSign')
# Dump @data into the database here.
@data = nil
elsif(name == 'ClinicalSign')
@key = nil
elsif(name == 'SignFreq')
@key = nil
elsif(name == 'Name')
@in_name = false
end
end
end
结构应该很清楚:你观察你感兴趣的元素的打开,并在执行时做一些簿记设置,然后如果你在你关心的元素内缓存字符串,最后在元素关闭时清理和处理数据.你的数据库工作将取代
The structure should be pretty clear: you watch for the opening of the elements that you're interested in and do a bit of bookkeeping set up when the do, then cache the strings if you're inside an element you care about, and finally clean up and process the data as the elements close. You're database work would replace the
# Dump @data into the database here.
评论.
这种结构使观察 <Disorder id="17601">
元素变得非常容易,这样您就可以跟踪您走了多远.这样,您可以通过对脚本进行一些小的修改来停止和重新启动导入.
This structure makes it pretty easy to watch for the <Disorder id="17601">
elements so that you can keep track of how far you've gone. That way you can stop and restart the import with some small modifications to your script.
这篇关于使用 Nokogiri 解析大型 XML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!