从XML文件到R中的数据帧的随机抽样

本文介绍了从XML文件到R中的数据帧的随机抽样的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如何从R中的大型XML文件中获取给定大小的样本？

How to get a sample of a given size from a large XML file in R?

与阅读随机行不同，这很简单，这里需要保留用于R的XML文件的结构将其读入适当的数据框架。

Unlike reading random lines, which is simple, it is necessary here to preserve the structure of the XML file for R to read it into a proper data.frame.

可能的解决方案是读取整个文件，然后对行进行示例，但是可能仅读取必要的块？

A possible solution is to read the whole file and then sample rows, but is it possible to read only necessary chunks?

文件中的示例：

<?xml version="1.0" encoding="UTF-8"?>
<products>
  <product>
    <sku>967190</sku>
    <productId>98611</productId>
...
    <listingId/>
    <sellerId/>
    <shippingRestrictions/>
  </product>
...

每个产品的行数不相等。在打开文件之前，最终的记录数量是未知的。

The number of lines for each "product" is not equal. The final number of records is unknown before opening the file.

推荐答案

而不是读整个文件，可以使用事件使用一个闭包解析，处理您感兴趣的节点。要到达那里，我将从一个文件的随机抽样策略开始。过程一次记录一个。如果 i th记录小于或等于要保存的记录的数字 n ，否则存储它具有概率 n / i 。这可以实现为

Instead of reading the entire file in, it's possible to use event parsing with a closure that handles the nodes you're interested in. To get there, I'll start with a strategy for random sampling from a file. Process records one at a time. If the ith record is less than or equal to the number n of records to keep then store it, otherwise store it with probability n / i. This could be implemented as

i <- 0L; n <- 10L
select <- function() {
    i <<- i + 1L
    if (i <= n)
        i
    else {
        if (runif(1) < n / i)
            sample(n, 1)
        else
            0
    }
}

其行为如下：

> i <- 0L; n <- 10L; replicate(20, select())
 [1]  1  2  3  4  5  6  7  8  9 10  1  5  7  0  1  9  0  2  1  0

这告诉我们保留前10个元素，然后我们用元素11替换元素1，元素5与元素12，元素7替换元素13，然后删除第14个元素等等，因为我变得比n大得多，所以更换不太频繁。

This tells us to keep the first 10 elements, then we replace element 1 with element 11, element 5 with element 12, element 7 with element 13, then drop the 14th element, etc. Replacements become less frequent as i becomes much larger than n.

我们将它作为产品的一部分处理程序，它为我们感兴趣的结果预先分配空间，然后每次遇到产品节点时，我们测试是否选择，如果是，请在适当的位置将其添加到我们当前的结果

We use this as part of a product handler, which pre-allocates space for the results we're interested in, then each time a 'product' node is encountered we test whether to select and if so, add it to our current results at the appropriate location

sku <- character(n)
product <- function(p) {
    i <- select()
    if (i)
        sku[[i]] <<- xmlValue(p[["sku"]])
    NULL
}

'select'和'product'处理程序与一个函数（ get ），允许我们重新开始对于当前的值，并将它们全部放在一个关闭中，这样我们就可以使用一种工厂模式来封装变量 n ， / code>和 sku

The 'select' and 'product' handlers are combined with a function (get) that allows us to retrieve the current values, and they're all placed in a closure so that we have a kind of factory pattern that encapsulates the variables n, i, and sku

sampler <- function(n)
{
    force(n)    # otherwise lazy evaluation could lead to surprises
    i <- 0L
    select <- function() {
        i <<- i + 1L
        if (i <= n) {
            i
        } else {
            if (runif(1) < n / i)
                sample(n, 1)
            else
                0
        }
    }

    sku <- character(n)
    product <- function(p) {
        i <- select()
        if (i)
            sku[[i]] <<- xmlValue(p[["sku"]])
        NULL
    }

    list(product=product, get=function() list(sku=sku))
}

然后我们准备好了

products <- xmlTreeParse("foo.xml", handler=sampler(1000))
as.data.frame(products$get())

一旦处理的节点数量 i 相对于 n ，这将与文件的大小线性关联，因此您可以通过从原始文件的子集开始，以获得足够的效果。

Once the number of nodes processed i gets large relative to n, this will scale linearly with the size of the file, so you can get a sense for whether it performs well enough by starting with subsets of the original file.

                        这篇关于从XML文件到R中的数据帧的随机抽样的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！