本文介绍了确定向量中是否存在元素的最有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有几种算法,这些算法取决于确定向量中元素是否存在的效率.在我看来,%in%(等效于is.element())应该是最有效的,因为它只返回一个布尔值.在测试了几种方法之后,令我惊讶的是,这些方法到目前为止是效率最低的.以下是我的分析(随着向量大小的增加,结果会变得更糟):

EfficiencyTest <- function(n, Lim) {

    samp1 <- sample(Lim, n)
    set1 <- sample(Lim, Lim)

    print(system.time(for(i in 1:n) {which(set1==samp1[i])}))
    print(system.time(for(i in 1:n) {samp1[i] %in% set1}))
    print(system.time(for(i in 1:n) {is.element(samp1[i], set1)}))
    print(system.time(for(i in 1:n) {match(samp1[i], set1)}))
    a <- system.time(set1 <- sort(set1))
    b <- system.time(for (i in 1:n) {BinVecCheck(samp1[i], set1)})
    print(a+b)
}

> EfficiencyTest(10^3, 10^5)
user  system elapsed
0.29    0.11    0.40
user  system elapsed
19.79    0.39   20.21
user  system elapsed
19.89    0.53   20.44
user  system elapsed
20.04    0.28   20.33
user  system elapsed
0.02    0.00    0.03

其中BinVecCheck是我编写的二进制搜索算法,该算法返回TRUE/FALSE.请注意,我包括了使用最终方法对向量进行排序所花费的时间.这是二进制搜索的代码:

BinVecCheck <- function(tar, vec) {
    if (tar==vec[1] || tar==vec[length(vec)]) {return(TRUE)}
    size <- length(vec)
    size2 <- trunc(size/2)
    dist <- (tar - vec[size2])
    if (dist > 0) {
        lower <- size2 - 1L
        upper <- size
    } else {
        lower <- 1L
        upper <- size2 + 1L
    }
    while (size2 > 1 && !(dist==0)) {
        size2 <- trunc((upper-lower)/2)
        temp <- lower+size2
        dist <- (tar - vec[temp])
        if (dist > 0) {
            lower <- temp-1L
        } else {
            upper <- temp+1L
        }
    }
    if (dist==0) {return(TRUE)} else {return(FALSE)}
}

平台信息:

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

问题

是否有更有效的方法来确定元素是否存在于R中的向量中?例如,是否有与Python set 函数等效的R函数,改进这种方法?另外,为什么%in%之类的效率低下,即使与提供更多信息的which函数相比,效率也是如此(不仅确定存在,而且还给出所有真实账户的索引)?

解决方案

我的测试并不能满足您的所有要求,但这(?)似乎是由于跨平台差异所致(这使问题更加严重了)神秘的事物,可能值得尝试使用r-devel@r-project.org,尽管由于下面的fastmatch解决方案仍然占据主导地位,所以也许不值得...)

 n <- 10^3; Lim <- 10^5
 set.seed(101)
 samp1 <- sample(Lim,n)
 set1 <- sample(Lim,Lim)
 library("rbenchmark")

 library("fastmatch")
 `%fin%` <- function(x, table) {
     stopifnot(require(fastmatch))
     fmatch(x, table, nomatch = 0L) > 0L
 }
 benchmark(which=sapply(samp1,function(x) which(set1==x)),
           infun=sapply(samp1,function(x) x %in% set1),
           fin= sapply(samp1,function(x) x %fin% set1),
           brc= sapply(samp1,BinVecCheck,vec=sort(set1)),
           replications=20,
    columns = c("test", "replications", "elapsed", "relative"))

##    test replications elapsed relative
## 4   brc           20   0.871    2.329
## 3   fin           20   0.374    1.000
## 2 infun           20   6.480   17.326
## 1 which           20  10.634   28.433

这表示%in%的速度大约是which的两倍-您的BinVecCheck功能要好7倍,但是fastmatch解决方案/32934933/faster-in-operator>这里的另一个因素是2.我不知道专门的Rcpp实现是否可以做得更好或更好...实际上,即使在运行代码时,我也会得到不同的答案:

##    user  system elapsed   (which)
##   0.488   0.096   0.586
##    user  system elapsed   (%in%)
##   0.184   0.132   0.315
##    user  system elapsed  (is.element)
##   0.188   0.124   0.313
##    user  system elapsed  (match)
##   0.148   0.164   0.312
##    user  system elapsed  (BinVecCheck)
##   0.048   0.008   0.055

更新:在r-devel上,Peter Dalgaard通过指向R 新闻条目:

sessionInfo()
## R Under development (unstable) (2015-10-23 r69563)
## Platform: i686-pc-linux-gnu (32-bit)
## Running under: Ubuntu precise (12.04.5 LTS)

I have several algorithms that depend on the efficiency of determining whether an element exists in a vector or not. It seems to me that %in% (which is equivalent to is.element()) should be the most efficient as it simply returns a Boolean value. After testing several methods, to my surprise, those methods are by far the most inefficient. Below is my analysis (the results get worse as the size of the vectors increase):

EfficiencyTest <- function(n, Lim) {

    samp1 <- sample(Lim, n)
    set1 <- sample(Lim, Lim)

    print(system.time(for(i in 1:n) {which(set1==samp1[i])}))
    print(system.time(for(i in 1:n) {samp1[i] %in% set1}))
    print(system.time(for(i in 1:n) {is.element(samp1[i], set1)}))
    print(system.time(for(i in 1:n) {match(samp1[i], set1)}))
    a <- system.time(set1 <- sort(set1))
    b <- system.time(for (i in 1:n) {BinVecCheck(samp1[i], set1)})
    print(a+b)
}

> EfficiencyTest(10^3, 10^5)
user  system elapsed
0.29    0.11    0.40
user  system elapsed
19.79    0.39   20.21
user  system elapsed
19.89    0.53   20.44
user  system elapsed
20.04    0.28   20.33
user  system elapsed
0.02    0.00    0.03

Where BinVecCheck is a binary search algorithm that I wrote that returns TRUE/FALSE. Note that I include the time it takes to sort the vector with the final method. Here is the code for the binary search:

BinVecCheck <- function(tar, vec) {
    if (tar==vec[1] || tar==vec[length(vec)]) {return(TRUE)}
    size <- length(vec)
    size2 <- trunc(size/2)
    dist <- (tar - vec[size2])
    if (dist > 0) {
        lower <- size2 - 1L
        upper <- size
    } else {
        lower <- 1L
        upper <- size2 + 1L
    }
    while (size2 > 1 && !(dist==0)) {
        size2 <- trunc((upper-lower)/2)
        temp <- lower+size2
        dist <- (tar - vec[temp])
        if (dist > 0) {
            lower <- temp-1L
        } else {
            upper <- temp+1L
        }
    }
    if (dist==0) {return(TRUE)} else {return(FALSE)}
}

Platform Info:

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Question

Is there a more efficient way of determining whether an element exists in a vector in R? For example, is there an equivalent R function to the Python set function, that greatly improves on this approach? Also, why is %in%, and the like, so inefficient even when compared to the which function which gives more information (not only does it determine existence, but it also gives the indices of all true accounts)?

解决方案

My tests aren't bearing out all of your claims, but that seems (?) to be due to cross-platform differences (which makes the question even more mysterious, and possibly worth taking up on r-devel@r-project.org, although maybe not since the fastmatch solution below dominates anyway ...)

 n <- 10^3; Lim <- 10^5
 set.seed(101)
 samp1 <- sample(Lim,n)
 set1 <- sample(Lim,Lim)
 library("rbenchmark")

 library("fastmatch")
 `%fin%` <- function(x, table) {
     stopifnot(require(fastmatch))
     fmatch(x, table, nomatch = 0L) > 0L
 }
 benchmark(which=sapply(samp1,function(x) which(set1==x)),
           infun=sapply(samp1,function(x) x %in% set1),
           fin= sapply(samp1,function(x) x %fin% set1),
           brc= sapply(samp1,BinVecCheck,vec=sort(set1)),
           replications=20,
    columns = c("test", "replications", "elapsed", "relative"))

##    test replications elapsed relative
## 4   brc           20   0.871    2.329
## 3   fin           20   0.374    1.000
## 2 infun           20   6.480   17.326
## 1 which           20  10.634   28.433

This says that %in% is about twice as fast as which -- your BinVecCheck function is 7x better, but the fastmatch solution from here gets another factor of 2. I don't know if a specialized Rcpp implementation could do better or not ...In fact, I get different answers even when running your code:

##    user  system elapsed   (which)
##   0.488   0.096   0.586
##    user  system elapsed   (%in%)
##   0.184   0.132   0.315
##    user  system elapsed  (is.element)
##   0.188   0.124   0.313
##    user  system elapsed  (match)
##   0.148   0.164   0.312
##    user  system elapsed  (BinVecCheck)
##   0.048   0.008   0.055

update: on r-devel Peter Dalgaard explains the platform discrepancy (which is an R version difference, not an OS difference) by pointing to the R NEWS entry:

sessionInfo()
## R Under development (unstable) (2015-10-23 r69563)
## Platform: i686-pc-linux-gnu (32-bit)
## Running under: Ubuntu precise (12.04.5 LTS)

这篇关于确定向量中是否存在元素的最有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-10 23:51
查看更多