问题描述
我有大约100个发布数据XML文件,每个文件> 10GB的格式如下:
I have ~100 XML files of publication data each > 10GB formatted like this:
<?xml version="1.0" encoding="UTF-8"?>
<records xmlns="http://website">
<REC rid="this is a test">
<UID>ABCD123</UID>
<data_1>
<fullrecord_metadata>
<references count="3">
<reference>
<uid>ABCD2345</uid>
</reference>
<reference>
<uid>ABCD3456</uid>
</reference>
<reference>
<uid>ABCD4567</uid>
</reference>
</references>
</fullrecord_metadata>
</data_1>
</REC>
<REC rid="this is a test">
<UID>XYZ0987</UID>
<data_1>
<fullrecord_metadata>
<references count="N">
</references>
</fullrecord_metadata>
</data_1>
</REC>
</records>
,每个唯一条目的引用数(以UID索引)有所不同,其中一些可能为零.
, with variation in the number of references for each unique entry (indexed by UID), some of which may be zero.
目标:如下所示,每个XML文件创建1个简单的data.frame-
The goal: create 1 simple data.frame per XML file as follows-
UID reference
ABCD123 ABCD2345
ABCD123 ABCD3456
ABCD123 ABCD4567
XYZ0987 NULL
由于文件的大小以及需要有效循环许多文件的原因,我一直在探索xmlEventParse以限制内存使用量.我可以使用先前问题中的以下代码为每个"REC"成功提取关键的唯一"UID"并创建一个data.frame:
Due to the size of files and need for efficient looping over many files, I have been exploring xmlEventParse to limit memory usage. I can successfully extract the key unique "UID"s for each "REC" and create a data.frame using the following code from prior questions:
branchFunction <- function() {
store <- new.env()
func <- function(x, ...) {
ns <- getNodeSet(x, path = "//UID")
key <- xmlValue(ns[[1]])
value <- xmlValue(ns[[1]])
print(value)
store[[key]] <- value
}
getStore <- function() { as.list(store) }
list(UID = func, getStore=getStore)
}
myfunctions <- branchFunction()
xmlEventParse(
file = "test.xml",
handlers = NULL,
branches = myfunctions
)
DF <- do.call(rbind.data.frame, myfunctions$getStore())
但是我无法成功存储参考数据,也无法处理单个UID的参考号变化.感谢您的任何建议!
But I cannot successfully store the reference data nor handle the variation in reference numbers for a single UID. Thanks for any suggestions!
推荐答案
设置一个将为元素数据创建临时存储区的函数,以及一个每次找到a都会被调用的函数.
Setup a function that will create a temp storage area for our element data as well as a function that will be called every time a is found.
library(XML)
uid_traverse <- function() {
# we'll store them as character vectors and then make a data frame out of them.
# this is likely one of the cheapest & fastest methods despite growing a vector
# inch by inch. You can pre-allocate space and modify this idiom accordingly
# for another speedup.
uids <- c()
refs <- c()
REC <- function(x) {
uid <- xpathSApply(x, "//UID", xmlValue)
ref <- xpathSApply(x, "//reference/uid", xmlValue)
if (length(uid) > 0) {
if (length(ref) == 0) {
uids <<- c(uids, uid)
refs <<- c(refs, NA_character_)
} else {
uids <<- c(uids, rep(uid, length(ref)))
refs <<- c(refs, ref)
}
}
}
# we return a named list with the element handler and another
# function that turns the vectors into a data frame
list(
REC = REC,
uid_df = function() {
data.frame(uid = uids, ref = refs, stringsAsFactors = FALSE)
}
)
}
我们需要此功能的一个实例.
We need one instance of this function.
uid_f <- uid_traverse()
现在,我们调用xmlEventParse()并使用invisible()为其提供函数,因为我们不需要xmlEventParse()返回什么,而只需要副作用:
Now, we call xmlEventParse() and give it our function, using invisible() since we don’t need what xmlEventParse() returns but just want the side-effects:
invisible(
xmlEventParse(
file = path.expand("~/data/so.xml"),
branches = uid_f["REC"])
)
然后,我们看到结果:
uid_f$uid_df()
## uid ref
## 1 ABCD123 ABCD2345
## 2 ABCD123 ABCD3456
## 3 ABCD123 ABCD4567
## 4 XYZ0987 <NA>
这篇关于R:具有大型可变节点XML输入并转换为数据帧的xmlEventParse的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!