R中的多维数组操作：apply vs data.table vs plyr（parallel）

本文介绍了R中的多维数组操作：apply vs data.table vs plyr（parallel）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在我的研究工作中，我通常处理大的4D数组（20-200万元素）。
我试图提高计算的计算速度，寻找速度和简单性之间的最佳权衡。我已经做了一些进步感谢（参见和）

In my research work, I normally deal with big 4D arrays (20-200 millions of elements).I'm trying to improve the computational speed of my calculations looking for an optimal trade-off between speed and simplicity. I've already did some step forward thanks to SO (see here and here)

现在，我试图利用最新的软件包，例如 data.table 和 plyr 。

Now, I'm trying to exploit the latest packages like data.table and plyr.

让我们从类似的开始：

D = c(100, 1000, 8) #x,y,t
d = array(rnorm(prod(D)), dim = D)

我想获得每个 x （第一维）和 y （第二维）高于第90百分位的 t 的值。让我们用基础R：

I'd like to get for each x (first dimension) and y (second dimension) the values of t that are above the 90th percentile. Let's do that with base R:

system.time(
    q1 <- apply(d, c(1,2), function(x) {
        return(x >= quantile(x, .9, names = F))
        })
)

在我的Macbook上大概是十秒钟。我得到一个数组为：

On my Macbook it's about ten seconds. And I get back an array as:

> dim(q1)
[1]    8  100 1000

（应用奇怪地改变维度的顺序，反正我现在不在乎）。现在我可以熔化（ reshape2 包）我的数组并将其用于 data.table ：

(apply strangely change the order of the dimensions, anyway I don't care now). Now I can melt (reshape2 package) my array and use it into data.table:

> d_m = melt(d)
> colnames(d_m) = c('x', 'y', 't', 'value')
> d_t = data.table(d_m)

然后我做一些data.tablemagic p>

Then I do some data.table "magic":

system.time({
    q2 = d_t[,q := quantile(value, .9, names = F), by="x,y"][,ev := value > q]
})

$ b b

现在计算所需时间略少于10秒。现在我想尝试 plyr 和 ddply ：

system.time({
    q3 <- ddply(d_m, .(x, y), summarise, q = quantile(value, .9, names = F))
})

现在，需要60秒。如果我移动到 dplyr ，我可以在十秒内再次进行相同的计算。

Now, it takes 60 seconds. If I move to dplyr I can do the same calculation again in ten seconds.

但是，我的问题如下：你将如何以更快的方式做同样的计算？如果我考虑一个更大的矩阵（比20倍大），我使用data.table wrt apply 函数获得更快的计算，但是在相同的数量级（14分钟vs 10分钟）。
任何评论都非常感谢...

However, my question is the following: what would you do to do the same calculation in a faster way? If I consider a larger matrix (say 20 times bigger) I obtain a faster computation using data.table wrt the apply function but however at the same order of magnitude (14 minutes vs 10 minutes).Any comment is really appreciated...

EDIT

我使用 Rcpp 在c ++中实现了分位数函数，加速了8次计算。

I've implemented the quantile function in c++ using Rcpp speeding up the computation of eight times.

推荐答案

根据@roland的建议，加速代码的一个可能的解决方案是实现一个更快的版本 quantile 函数。我花了一个小时学习如何使用 Rcpp ，运行时间减少了八次。我实现了类型7 版本的分位数算法（默认选择）。
我们仍然远离MATLAB性能（讨论），但在我的情况下，这是一个令人印象深刻的一步。我不为自己写的到目前为止的Rcpp代码感到自豪，我没有时间擦亮它。无论如何，它工作（我检查结果与R函数），所以如果你有兴趣，你可以从。

As suggested by @roland, one possible solution to speed up the code was to implement a faster version of quantile function. I spent one hour to learn how to do that using Rcpp and the running time decreased eight times. I've implemented the type 7 version of the quantile algorithm (default choice).We are still far from the MATLAB performance (discussed here) but in my case this is an impressive step forward. I am not proud of the Rcpp code I have written so far, I didn't have the time to polish it. Anyway, it works (I checked the results with the R function) and so if you are interested you can download it from here.

这篇关于R中的多维数组操作：apply vs data.table vs plyr（parallel）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！