r - 向量化包含where语句和函数的for循环

我正在尝试向量化的代码的可复制示例。

cutOffs <- seq(1,10,0.2)

plotOutput <- matrix(nrow=length(cutOffs), ncol=2)
colnames(plotOutput) <- c("x","y")
plotOutput[,"y"] <- cutOffs

for(plotPoint in 1:length(cutOffs))
{
  plotOutput[plotPoint, "x"] <-
    nrow(iris[ which(iris$Sepal.Length > cutOffs[plotPoint] &
                   iris$Sepal.Width > cutOffs[plotPoint]), ])
}

plotOutput

我特别想找出的是，是否有一种方法可以矢量化这部分。

nrow(iris[ which(iris$Sepal.Length > cutOffs[plotPoint] &
                   iris$Sepal.Width > cutOffs[plotPoint]), ])

假设我要使用plyr库或某种形式的apply，可能没有太大的提高速度，这确实是我想要的。从根本上讲，我正在尝试查看是否有某种矢量化技术在搜索时被忽略或设法错过了。

更新:

Unit: milliseconds
  expr         min          lq        mean      median          uq         max neval
  op() 33663.39700 33663.39700 33663.39700 33663.39700 33663.39700 33663.39700     1
  jr()  3976.53088  3976.53088  3976.53088  3976.53088  3976.53088  3976.53088     1
  dd()  4253.21050  4253.21050  4253.21050  4253.21050  4253.21050  4253.21050     1
 exp()  5085.45331  5085.45331  5085.45331  5085.45331  5085.45331  5085.45331     1
 nic()  8719.82043  8719.82043  8719.82043  8719.82043  8719.82043  8719.82043     1
  sg()    16.66177    16.66177    16.66177    16.66177    16.66177    16.66177     1

我实际上正在做的一个更现实的近似是

# generate data
numObs <- 1e5
iris <- data.frame( Sepal.Length = sample(1:numObs), Sepal.Width = sample(1:numObs) )

cutOffs <- 1:(numObs*0.01)

plotOutput <- matrix(nrow=length(cutOffs), ncol=2)
colnames(plotOutput) <- c("x","y")
plotOutput[,"y"] <- cutOffs

然后选择一种偏爱的特定方法。

一般来说，它将用于具有50,000-200,000点的数据集。

使用有了很大的进步

sum(Sepal.Length > cutOffs[plotPoint] & Sepal.Width > cutOffs[plotPoint])

首先，这是我缺少的一种最佳方法。

但是，到目前为止，最好的答案是sgibb的sg()。关键是要意识到它只是重要的每一行两个值中的最低值。一旦实现了这种思维飞跃，就只剩下一个向量要处理，向量化就相当简单了。

# cutOff should be lower than the lowest of Sepal.Length & Sepal.Width
  m <- pmin(iris$Sepal.Length, iris$Sepal.Width)

最佳答案

我想添加另一个答案:

sg <- function() {
  # cutOff should be lower than the lowest of Sepal.Length & Sepal.Width
  m <- pmin(iris$Sepal.Length, iris$Sepal.Width)
  ms <- sort.int(m)
  # use `findInterval` to find all the indices
  # (equal to "how many numbers below") lower than the threshold
  plotOutput[,"x"] <- length(ms)-findInterval(cutOffs, ms)
  plotOutput
}

这种方法避免了for或outer循环，并且比@nicola的方法快4倍:

microbenchmark(sg(), nic(), dd())
#Unit: microseconds
#  expr     min       lq     mean   median       uq      max neval
#  sg()  88.726 104.5805 127.3172 123.2895 144.2690  232.441   100
# nic() 474.315 526.7780 625.0021 602.3685 706.7530  997.412   100
#  dd() 669.841 736.7800 887.4873 847.7730 976.6445 2800.930   100

identical(sg(), dd())
# [1] TRUE

关于r - 向量化包含where语句和函数的for循环，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/30054878/