本文介绍了lxml etree.parse内存分配错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用lxml etree.parse解析一个巨大的XML文件(大约65MB-300MB).当我运行包含以下功能的独立python脚本时,出现内存分配失败:

I'm using lxml etree.parse to parse a, somehow, huge XML file (around 65MB - 300MB). When I run my stand alone python script containing the below function, I am getting a Memory Allocation failure:

Error:

     Memory allocation failed : xmlSAX2Characters, line 5350155, column 16

部分功能代码:

def getID():
        try:
            from lxml import etree
            xml = etree.parse(<xml_file>)  # here is where the failure occurs
            for element in xml.iter():
                   ...
                   result = <formed by concatenating element texts>
            return result
        except Exception, ex:
            <handle exception>

奇怪的是,当我在IDLE上输入相同的功能并测试了相同的XML文件时,我没有遇到任何MemoryAllocation错误.

The weird thing is when I input the same function on IDLE, and tested the same XML file, I am not encountering any MemoryAllocation error.

对这个问题有什么想法吗?预先感谢.

Any ideas on this issue? Thanks in advance.

推荐答案

我会使用 迭代解析器,而是对完成的任何元素调用 .clear();这样一来,您就不必一次将整个文档加载到内存中.

I would parse the document using the iterative parser instead, calling .clear() on any element you are done with; that way you avoid having to load the whole document in memory in one go.

您可以将迭代解析器限制为仅对您感兴趣的那些标记.如果您只想解析< person> 标记,请告诉解析器,这样:

You can limit the iterative parser to only those tags you are interested in. If you only want to parse <person> tags, tell your parser so:

for _, element in etree.iterparse(input, tag='person'):
    # process your person data
    element.clear()

通过清除循环中的元素,可以将其从内存中释放出来.

By clearing the element in the loop, you free it from memory.

这篇关于lxml etree.parse内存分配错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-06 07:21