问题描述
我正在使用lxml etree.parse解析一个巨大的XML文件(大约65MB-300MB).当我运行包含以下功能的独立python脚本时,出现内存分配失败:
I'm using lxml etree.parse to parse a, somehow, huge XML file (around 65MB - 300MB). When I run my stand alone python script containing the below function, I am getting a Memory Allocation failure:
Error:
Memory allocation failed : xmlSAX2Characters, line 5350155, column 16
部分功能代码:
def getID():
try:
from lxml import etree
xml = etree.parse(<xml_file>) # here is where the failure occurs
for element in xml.iter():
...
result = <formed by concatenating element texts>
return result
except Exception, ex:
<handle exception>
奇怪的是,当我在IDLE上输入相同的功能并测试了相同的XML文件时,我没有遇到任何MemoryAllocation错误.
The weird thing is when I input the same function on IDLE, and tested the same XML file, I am not encountering any MemoryAllocation error.
对这个问题有什么想法吗?预先感谢.
Any ideas on this issue? Thanks in advance.
推荐答案
我会使用 迭代解析器,而是对完成的任何元素调用 .clear()
;这样一来,您就不必一次将整个文档加载到内存中.
I would parse the document using the iterative parser instead, calling .clear()
on any element you are done with; that way you avoid having to load the whole document in memory in one go.
您可以将迭代解析器限制为仅对您感兴趣的那些标记.如果您只想解析< person>
标记,请告诉解析器,这样:
You can limit the iterative parser to only those tags you are interested in. If you only want to parse <person>
tags, tell your parser so:
for _, element in etree.iterparse(input, tag='person'):
# process your person data
element.clear()
通过清除循环中的元素,可以将其从内存中释放出来.
By clearing the element in the loop, you free it from memory.
这篇关于lxml etree.parse内存分配错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!