问题描述
在 r 中使用 XML 包时发生内存泄漏并不是什么新鲜事.这个话题已经讨论过了:
Memory leaks when using XML package in r is not something new. This subject has already been discussed:
- 迭代解析 XML 文件时出现严重的内存泄漏
- http://www.omegahat.org/RSXML/MemoryManagement.html
- http://r.789695.n4.nabble.com/memory-leak-using-XML-readHTMLTable-td4643332.html
但是,在阅读了所有这些文档后,我仍然不知道针对我的特定情况的解决方案.考虑以下代码:
However, after reading all these documents, I still do not know a solution for my particular case.Consider the following code:
library(XML)
GetHref = function(x)
{
subDoc = xmlChildren(x)
hrefs = ifelse(is.null(subDoc$a), NA, xmlGetAttr(subDoc$a, 'href'))
rm(subDoc)
return(hrefs)
}
url = 'http://www.atpworldtour.com/Share/Event-Draws.aspx?e=338&y=2013'
parse = htmlParse(url)
print(.Call("R_getXMLRefCount", parse)) #prints 1
NodeList = xpathSApply(parse, "//td[@class='col_1']/div/div/div[@class='player']")
print(.Call("R_getXMLRefCount", parse)) #prints 33
PlNames = sapply(NodeList, xmlValue, trim = T)
print(.Call("R_getXMLRefCount", parse)) #prints 33
hrefs = sapply(NodeList, GetHref)
print(.Call("R_getXMLRefCount", parse)) #prints 157
rm(NodeList)
gc()
print(.Call("R_getXMLRefCount", parse)) #prints 157
在后期处理期间创建的内部 XML 节点似乎没有被删除.在这种情况下有什么解决方案?
It seems that internal XML nodes created during the post processing do not get deleted. What would be a solution in this case?
Session Info:
R version 3.0.2 (2013-09-25)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] XML_3.98-1.1
loaded via a namespace (and not attached):
[1] tools_3.0.2
推荐答案
我成功地纠正了一个与您非常相似的问题.
I succeed in correcting a problem very similar as yours.
我的文档是一个简单的 xml 文档:
My document is a simple xml doc:
doc = xmlParse(file_path)
doc = xmlParse(file_path)
我应用了 Duncan Temple Lang 关于绕过内存管理的建议在收集子节点.为此,我首先使用 getNodeSet
收集子节点并停用终结器:
I apply the advise from Duncan Temple Lang about by-passing the memory management in collecting subnodes. For that purpose, I first gather subnodes with getNodeSet
with deactivating finalizer:
nodeset = getNodeSet(doc, xml_path, addFinalizer = FALSE)
从这个集合中,我可以构建一个子文档并在没有任何内存泄漏的情况下释放它:
From this set, I can build a subdoc and free it without any memory leak:
subxml = subdoc(nodeset[[1]])
# ... do plenty of sapply
free(subxml)
最后,我按顺序强制释放对象:
At the end, I force the objects to be released, in that order:
free(doc)
rm(nodeset)
有了这一切,我再也没有内存泄漏了.希望能帮到你!
With all of this, I have no memory leak anylonger. Hope it can help!
这篇关于内存泄漏在 r 中解析 XML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!