问题描述
如何从R中的大型XML文件中获取给定大小的样本?
How to get a sample of a given size from a large XML file in R?
与阅读随机行不同,这很简单,这里需要保留用于R的XML文件的结构将其读入适当的数据框架。
Unlike reading random lines, which is simple, it is necessary here to preserve the structure of the XML file for R to read it into a proper data.frame.
可能的解决方案是读取整个文件,然后对行进行示例,但是可能仅读取必要的块?
A possible solution is to read the whole file and then sample rows, but is it possible to read only necessary chunks?
文件中的示例:
<?xml version="1.0" encoding="UTF-8"?>
<products>
<product>
<sku>967190</sku>
<productId>98611</productId>
...
<listingId/>
<sellerId/>
<shippingRestrictions/>
</product>
...
每个产品的行数不相等。在打开文件之前,最终的记录数量是未知的。
The number of lines for each "product" is not equal. The final number of records is unknown before opening the file.
推荐答案
而不是读整个文件,可以使用事件使用一个闭包
解析,处理您感兴趣的节点。要到达那里,我将从一个文件的随机抽样策略开始。过程一次记录一个。如果 i
th记录小于或等于要保存的记录的数字 n
,否则存储它具有概率 n / i
。这可以实现为
Instead of reading the entire file in, it's possible to use event parsing with a closure
that handles the nodes you're interested in. To get there, I'll start with a strategy for random sampling from a file. Process records one at a time. If the i
th record is less than or equal to the number n
of records to keep then store it, otherwise store it with probability n / i
. This could be implemented as
i <- 0L; n <- 10L
select <- function() {
i <<- i + 1L
if (i <= n)
i
else {
if (runif(1) < n / i)
sample(n, 1)
else
0
}
}
其行为如下:
> i <- 0L; n <- 10L; replicate(20, select())
[1] 1 2 3 4 5 6 7 8 9 10 1 5 7 0 1 9 0 2 1 0
这告诉我们保留前10个元素,然后我们用元素11替换元素1,元素5与元素12,元素7替换元素13,然后删除第14个元素等等,因为我变得比n大得多,所以更换不太频繁。
This tells us to keep the first 10 elements, then we replace element 1 with element 11, element 5 with element 12, element 7 with element 13, then drop the 14th element, etc. Replacements become less frequent as i becomes much larger than n.
我们将它作为产品的一部分
处理程序,它为我们感兴趣的结果预先分配空间,然后每次遇到产品节点时,我们测试是否选择,如果是,请在适当的位置将其添加到我们当前的结果
We use this as part of a product
handler, which pre-allocates space for the results we're interested in, then each time a 'product' node is encountered we test whether to select and if so, add it to our current results at the appropriate location
sku <- character(n)
product <- function(p) {
i <- select()
if (i)
sku[[i]] <<- xmlValue(p[["sku"]])
NULL
}
'select'和'product'处理程序与一个函数( get
),允许我们重新开始对于当前的值,并将它们全部放在一个关闭中,这样我们就可以使用一种工厂模式来封装变量 n
, / code>和
sku
The 'select' and 'product' handlers are combined with a function (
get
) that allows us to retrieve the current values, and they're all placed in a closure so that we have a kind of factory pattern that encapsulates the variables n
, i
, and sku
sampler <- function(n)
{
force(n) # otherwise lazy evaluation could lead to surprises
i <- 0L
select <- function() {
i <<- i + 1L
if (i <= n) {
i
} else {
if (runif(1) < n / i)
sample(n, 1)
else
0
}
}
sku <- character(n)
product <- function(p) {
i <- select()
if (i)
sku[[i]] <<- xmlValue(p[["sku"]])
NULL
}
list(product=product, get=function() list(sku=sku))
}
然后我们准备好了
products <- xmlTreeParse("foo.xml", handler=sampler(1000))
as.data.frame(products$get())
一旦处理的节点数量
i
相对于 n
,这将与文件的大小线性关联,因此您可以通过从原始文件的子集开始,以获得足够的效果。
Once the number of nodes processed
i
gets large relative to n
, this will scale linearly with the size of the file, so you can get a sense for whether it performs well enough by starting with subsets of the original file.
这篇关于从XML文件到R中的数据帧的随机抽样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!