问题描述
我正在使用elementtree.ElementTree.iterparse解析大型(371 MB)xml文件.
I am using elementtree.ElementTree.iterparse to parse a large (371 MB) xml file.
我的代码基本上是这样的:
My code is basically this:
outf = open('out.txt', 'w')
context = iterparse('copyright.xml')
context = iter(context)
dummy, root = context.next()
for event, elem in context:
if elem.tag == 'foo':
author = elem.text
elif elem.tag == 'bar':
if elem.text is not None and 'bat' in elem.text.lower():
outf.write(elem.text + '\n')
elem.clear() #line A
root.clear() #line B
我的问题有两个:
首先-我是否同时需要A和B(请参阅代码段注释)?有人告诉我root.clear()清除不必要的子项,因此不会占用内存,但这是我的观察结果:就内存消耗而言,使用B而不使用A与不使用两者相同(由任务管理器绘制).仅使用A似乎与同时使用两者相同.
First - Do I need both A and B (see code snippet comments)? I was told that root.clear() clears unnecessary children so memory isn't devoured, but here are my observations: using B and not A is the same as using neither in terms of memory consumption (plotted with task manager). Using only A seems to be the same as using both.
第二个-为什么它仍然消耗这么多内存?程序运行时,将在末尾使用大约100 MB的RAM.
Second - Why is this still consuming so much memory? As the program runs, it uses about 100 MB of RAM near the end.
我认为它与outf有关,但是为什么呢?它不只是写入磁盘吗?如果它在关闭之前就存储了数据,该如何避免呢?
I assume it has something to do with outf, but why? Isn't it just writing to disk? And if it is storing that data before outf closes, how can I avoid that?
其他信息:我在Windows上使用的是Python 2.7.3.
Other information:I am using Python 2.7.3 on Windows.
推荐答案
(发布的代码应缩进第二行,不应运行.) http://bugs.python.org/issue14762 是一个类似的问题,答案是您应该清除每个元素(A行).如果不知道什么是outf(或创建它的代码),很难回答第二个问题.如果它是一个StringIO对象,答案将是显而易见的.您可能会看到跟踪器问题的第二条消息中链接的教程:
(The code as posted, with the second line indented, should not run.)http://bugs.python.org/issue14762 was a similar issue and the answer there is that you should clear each element (line A). Without seeing what outf is (or the code that created it), it is hard to answer the second question. If it were a StringIO object, the answer would be obvious. You might take a look at the tutorial linked in the second message of the tracker issue:
http://eli.thegreenplace .net/2012/03/15/processing-xml-in-python-with-elementtree/
这篇关于为什么elementtree.ElementTree.iterparse使用如此多的内存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!