如何优化读取和写入R中矩阵的子部分（可能使用data.table）

本文介绍了如何优化读取和写入R中矩阵的子部分（可能使用data.table）的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述 TL; DR 第一个堆栈溢出问题 - 我非常感谢你的时间，看看，我道歉，如果我留下了任何东西。我正在研究一个R包，其中我有一个性能瓶颈从子集和写入矩阵的一部分（统计学家，应用程序正在更新足够的统计数据处理每个数据点后）。单个操作速度非常快，但是它们的数量要求它尽可能快。该想法的最简单版本是维数K×V的矩阵，其中K通常在5和1000之间，V可以在1000和1,000,000之间。 code> set.seed（94253） K V mat 然后我们最终对列子集执行计算，全矩阵。天真地看起来像 Vsub< - sample（1：V，20） toinsert < - matrix（runif（K * length（Vsub）），nrow = K，ncol = length（Vsub）） mat [，Vsub] （microbenchmark） microbenchmark（mat [，Vsub] $ b b 因为这样做很多次，它可能是相当慢的，因为R的副本on-change语义（但看到下面的经验教训，修改实际上可能发生在一些地方的地方）。对于我的问题，对象不需要是一个矩阵（我对这里的差异很敏感将矩阵分配给数据表的子集。我总是想要完整的列，所以列表结构的数据框是好的。我的解决方案是使用Matthew Dowle的真棒data.table包。使用set（）可以非常快地完成写入。不幸的是，获得价值有点更复杂。我们必须调用变量设置为= FALSE，这显着减慢了事情。 library（data.table） DT< - as.data.table（mat） set（DT，i = NULL，j = Vsub，DT [，Vsub，with = FALSE] + as.numeric（toinsert））在set（）函数中，使用i = NULL来引用所有行非常快，但是（可能是由于存储在内部的方式），所以没有可比较的选项。 @Roland在注释中注释，一个选项将转换为三重表示（行号，列号，值），并使用data.tables二进制搜索来加速检索。我手动测试，虽然它是快速，它做矩阵的大约三倍的内存需求。如果可能，我想避免这种情况。按照这里的问题：从data.table和data.frame对象获取单个elemets的时间。 Hadley Wickham为单个索引提供了难以置信的快速解决方案 Vone< - Vsub [1] toinsert.one < - toinsert [，1] set（DT，i = NULL，j = Vone，（。subset2（DT，Vone）+ toinsert.one）） / pre> 然而由于.subset2（DT，i）只是DT [[i]]没有方法调度，没有办法几个列一次，虽然它肯定似乎应该是可能的。和上一个问题一样，它似乎是因为我们可以迅速覆盖这些值，我们应该能够快速阅读它们。有任何建议吗？还请让我知道如果有一个比这个问题的data.table更好的解决方案。我意识到它在许多方面不是真正的预期用例，但我试图避免将整个系列的操作移植到C。这里是一系列的时间元素讨论 - 前两个都是列，后两个只是一列。 microbenchmark（mat [，Vsub] set i = NULL，j = Vsub，DT [，Vsub，with = FALSE] + as.numeric（toinsert））， mat [，Vone]< - mat [，Vone] + toinsert.one， set（DT，i = NULL，j = Vone，（。subset2（DT，Vone）+ toinsert.one））， times = 1000L）单位：微秒 expr min lq median uq max neval Matrix 51.970 53.895 61.754 77.313 135.698 1000 数据表4751.982 4962.426 5087.376 5256.597 23710.826 1000 Matrix Single Col 8.021 9.304 10.427 19.570 55303.659 1000 Data.Table Single Col 6.737 7.700 9.304 11.549 89.824 1000 回答和经验教训：解决方案 Fun with Rcpp：您可以使用 Eigen's Map类以修改R对象。库（RcppEigen）库（内联） incl< - '使用Eigen :: Map; 使用Eigen :: MatrixXd; 使用Eigen :: VectorXi; typedef Map< MatrixXd> MapMatd; typedef Map< VectorXi> MapVeci; ' body< - ' MapMatd A（as< MapMatd>（AA））; const MapMatd B（as< MapMatd>（BB））; const MapVeci ix（as< MapVeci>（ind））; const int mB（B.cols（））; for（int i = 0; i { A.col（ix.coeff（i）-1）+ = B.col ; } ' funRcpp< - cxxfunction（签名（AA =matrix，BB =matrix，ind =integer），，RcppEigen，incl） set.seed（94253） K V mat2 Vsub toinsert& K * length（Vsub））nrow = K，ncol = length（Vsub） mat [，Vsub] invisible（funRcpp matb，matlab，matlab，matlab，matlab，matlab，matlab，matlab，matlab，matlab，matlab ] ＃单位：微秒＃expr min lq median uq max neval $ b mat [，Vsub] ＃funRcpp（mat2，toinsert，Vsub）6.450 6.805 7.6605 7.9215 25.914 100 我认为这基本上是@Joshua Ulrich提出的。他对关于破坏R的函数范式的警告适用。我在C ++中添加了，但是将函数更改为只做赋值操作并不重要。显然，如果你可以在Rcpp中实现你的整个循环，你可以避免在R级重复的函数调用，并获得性能。 TL;DRThis is my first Stack Overflow question- I greatly appreciate your time in taking a look and I apologize if I've left anything out. I'm working on an R package where I have a performance bottleneck from subsetting and writing to portions of a matrix (NB for statisticians the application is updating sufficient statistics after processing each data point). The individual operations are incredibly fast but the sheer number of them requires it to be as fast as possible. The simplest version of the idea is a matrix of dimension K by V where K is generally between 5 and 1000 and V can be between 1000 and 1,000,000.set.seed(94253)K <- 100V <- 100000mat <- matrix(runif(K*V),nrow=K,ncol=V)we then end up performing a calculation on a subset of the columns and adding this into the full matrix.thus naively it looks likeVsub <- sample(1:V, 20)toinsert <- matrix(runif(K*length(Vsub)), nrow=K, ncol=length(Vsub))mat[,Vsub] <- mat[,Vsub] + toinsertlibrary(microbenchmark)microbenchmark(mat[,Vsub] <- mat[,Vsub] + toinsert)because this is done so many times it can be quite slow as a result of R's copy-on-change semantics (but see the lessons learned below, modification can actually happen in place in some cricumstances). For my problem the object need not be a matrix (and I'm sensitive to the difference as outlined here Assign a matrix to a subset of a data.table). I always want the full column and so the list structure of a data frame is fine. My solution was to use Matthew Dowle's awesome data.table package. The write can be done extraordinarily quickly using set(). Unfortunately getting the value is somewhat more complicated. We have to call the variables setting with=FALSE which dramatically slows things down. library(data.table)DT <- as.data.table(mat) set(DT, i=NULL, j=Vsub,DT[,Vsub,with=FALSE] + as.numeric(toinsert))Within the set() function using i=NULL to reference all rows is incredibly fast but (presumably due to the way things are stored under the hood) there is no comparable option for j. @Roland notes in the comments that one option would be to convert to a triple representation (row number, col number, value) and use data.tables binary search to speed retrieval. I tested this manually and while it is quick, it does approximately triple the memory requirements for the matrix. I would like to avoid this if possible.Following the question here: Time in getting single elemets from data.table and data.frame objects. Hadley Wickham gave an incredibly fast solution for a single indexVone <- Vsub[1]toinsert.one <- toinsert[,1]set(DT, i=NULL, j=Vone,(.subset2(DT, Vone) + toinsert.one))however since the .subset2(DT,i) is just DT[[i]] without the methods dispatch there is no way (to my knowledge) to grab several columns at once although it certainly seems like it should be possible. As in the previous question, it seems like since we can overwrite the values quickly we should be able to read them quickly.Any suggestions? Also please let me know if there is a better solution than data.table for this problem. I realized its not really the intended use case in many respects but I'm trying to avoid porting the whole series of operations to C.Here are a sequence of timings of elements discussed- the first two are all columns, the second two are just one column. microbenchmark(mat[,Vsub] <- mat[,Vsub] + toinsert, set(DT, i=NULL, j=Vsub,DT[,Vsub,with=FALSE] + as.numeric(toinsert)), mat[,Vone] <- mat[,Vone] + toinsert.one, set(DT, i=NULL, j=Vone,(.subset2(DT, Vone) + toinsert.one)), times=1000L)Unit: microseconds expr min lq median uq max neval Matrix 51.970 53.895 61.754 77.313 135.698 1000 Data.Table 4751.982 4962.426 5087.376 5256.597 23710.826 1000 Matrix Single Col 8.021 9.304 10.427 19.570 55303.659 1000 Data.Table Single Col 6.737 7.700 9.304 11.549 89.824 1000Answer and Lessons Learned: 解决方案 Fun with Rcpp:You can use Eigen's Map class to modify an R object in place.library(RcppEigen)library(inline)incl <- 'using Eigen::Map;using Eigen::MatrixXd;using Eigen::VectorXi;typedef Map<MatrixXd> MapMatd;typedef Map<VectorXi> MapVeci;'body <- 'MapMatd A(as<MapMatd>(AA));const MapMatd B(as<MapMatd>(BB));const MapVeci ix(as<MapVeci>(ind));const int mB(B.cols());for (int i = 0; i < mB; ++i) {A.col(ix.coeff(i)-1) += B.col(i);}'funRcpp <- cxxfunction(signature(AA = "matrix", BB ="matrix", ind = "integer"), body, "RcppEigen", incl)set.seed(94253)K <- 100V <- 100000mat2 <- mat <- matrix(runif(K*V),nrow=K,ncol=V)Vsub <- sample(1:V, 20)toinsert <- matrix(runif(K*length(Vsub)), nrow=K, ncol=length(Vsub))mat[,Vsub] <- mat[,Vsub] + toinsertinvisible(funRcpp(mat2, toinsert, Vsub))all.equal(mat, mat2)#[1] TRUElibrary(microbenchmark)microbenchmark(mat[,Vsub] <- mat[,Vsub] + toinsert, funRcpp(mat2, toinsert, Vsub))# Unit: microseconds# expr min lq median uq max neval# mat[, Vsub] <- mat[, Vsub] + toinsert 49.273 49.628 50.3250 50.8075 20020.400 100# funRcpp(mat2, toinsert, Vsub) 6.450 6.805 7.6605 7.9215 25.914 100I think this is basically what @Joshua Ulrich proposed. His warnings regarding breaking R's functional paradigm apply.I do the addition in C++, but it is trivial to change the function to only do assignment.Obviously, if you can implement your whole loop in Rcpp, you avoid repeated function calls at the R level and will gain performance. 这篇关于如何优化读取和写入R中矩阵的子部分（可能使用data.table）的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！