本文介绍了使用big.matrix进行操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须使用big.matrix对象,但无法计算某些函数.让我们考虑以下big.matrix:

I have to work with big.matrix objects and I can’t compute some functions. Let's consider the following big.matrix:

# create big.matrix object
x <- as.big.matrix(
      matrix( sample(1:10, 20, replace=TRUE), 5, 4,
           dimnames=list( NULL, c("a", "b", "c", "d")) ) )

> x
An object of class "big.matrix"
Slot "address":
<pointer: 0x00000000141beee0>

对应的矩阵对象是:

# create matrix object

x2<-x[,]

> x2
     a b  c  d
[1,] 6 9  5  3
[2,] 3 6 10  8
[3,] 7 1  2  8
[4,] 7 8  4 10
[5,] 6 3  6  4

如果我使用矩阵对象计算此操作,它将起作用:

If I compute this operations with the matrix object, it works:

sqrt(slam::col_sums(x2*x2))

> sqrt(slam::col_sums(x2*x2))
       a        b        c        d 
13.37909 13.82027 13.45362 15.90597 

虽然我使用big.matrix对象(实际上是我必须使用的对象),但是它不起作用:

While if I use the big.matrix object (in fact what I have to use), it doesn’t work:

sqrt(biganalytics::colsum(x*x))

问题是2:*操作(创建矩阵的每个元素的平方),会产生错误:

The problems are 2 : the * operation (to create the square of each element of the matrix), which produces the error:

和sqrt函数,会产生错误:

and the sqrt function, which produces the error :

如何使用big.matrix对象计算此操作?

How can I compute this operations with big.matrix objects?

推荐答案

通过big.matrix对象,我发现了两种具有良好性能的解决方案:

With big.matrix objects, I found 2 solutions that offer good performances:

  • 在Rcpp中为您特别需要的功能编写代码.在这里,2个嵌套的for循环可以解决问题.但是,您无法重新编码所需的所有内容.
  • big.matrix的列块上使用R函数并汇总结果.这很容易做到,并且仅使用R代码.
  • code a function in Rcpp for what you specifically need. Here, 2 nested for loops would do the trick. Yet, you can't recode everything you need.
  • use an R function on column blocks of your big.matrix and aggregate the results. It is easy to do and uses R code only.

在您的情况下,列增加了10,000倍:

In your case, with 10,000 times more columns:

require(bigmemory)

x <- as.big.matrix(
  matrix( sample(1:10, 20000, replace=TRUE), 5, 40000,
          dimnames=list( NULL, rep(c("a", "b", "c", "d"), 10000) ) ) )

print(system.time(
  true <- sqrt(colSums(x[,]^2))
))

print(system.time(
  test1 <- biganalytics::apply(x, 2, function(x) {sqrt(sum(x^2))})
))
print(all.equal(test1, true))

因此,colSums速度非常快,但需要RAM中的所有矩阵,而biganalytics::apply速度很慢,但内存效率高.一种折衷办法是使用类似以下的内容:

So, colSums is very fast but needs all the matrix in the RAM, whereas biganalytics::apply is slow, but memory-efficient. A compromise would be to use something like this:

CutBySize <- function(m, block.size, nb = ceiling(m / block.size)) {
  int <- m / nb

  upper <- round(1:nb * int)
  lower <- c(1, upper[-nb] + 1)
  size <- c(upper[1], diff(upper))

  cbind(lower, upper, size)
}

seq2 <- function(lims) seq(lims["lower"], lims["upper"])

require(foreach)
big_aggregate <- function(X, FUN, .combine, block.size = 1e3) {
  intervals <- CutBySize(ncol(X), block.size)

  foreach(k = 1:nrow(intervals), .combine = .combine) %do% {
    FUN(X[, seq2(intervals[k, ])])
  }
}

print(system.time(
  test2 <- big_aggregate(x, function(X) sqrt(colSums(X^2)), .combine = 'c')
))
print(all.equal(test2, true))

编辑:现在已在软件包 bigstatsr 中实现:

This is now implemented in package bigstatsr:

print(system.time(
  test2 <- bigstatsr::big_apply(x, a.FUN = function(X, ind) {
    sqrt(colSums(X[, ind]^2))
  }, a.combine = 'c')
))
print(all.equal(test2, true))

这篇关于使用big.matrix进行操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-19 00:57