问题描述
假设您有一个值列表
x <- list(a=c(1,2,3), b = c(2,3,4), c=c(4,5,6))
我想从所有组合的列表元素中找到唯一的值.到目前为止,以下代码起到了作用
I would like to find unique values from all list elements combined. So far, the following code did the trick
unique(unlist(x))
有人知道更有效的方法吗?我列出了很多有价值的东西,请多加注意.
Does anyone know a more efficient way? I have a hefty list with a lot of values and would appreciate any speed-up.
推荐答案
Marek提出的此解决方案是对原始Q的最佳回答.有关其他方法以及Marek为什么最有用的讨论,请参见下文.
This solution suggested by Marek is the best answer to the original Q. See below for a discussion of other approaches and why Marek's is the most useful.
> unique(unlist(x, use.names = FALSE))
[1] 1 2 3 4 5 6
讨论
一种更快的解决方案是先对x
的组件计算unique()
,然后对这些结果进行最终的unique()
.仅当列表的组件具有相同数量的唯一值时,这才起作用,就像下面两个示例中一样.例如:
Discussion
A faster solution is to compute unique()
on the components of your x
first and then do a final unique()
on those results. This will only work if the components of the list have the same number of unique values, as they do in both examples below. E.g.:
首先是您的版本,然后是我的双重独特方式:
First your version, then my double unique approach:
> unique(unlist(x))
[1] 1 2 3 4 5 6
> unique.default(sapply(x, unique))
[1] 1 2 3 4 5 6
我们必须调用unique.default
,因为unique
有一个matrix
方法,该方法使一个边距保持不变.很好,因为矩阵可以视为向量.
We have to call unique.default
as there is a matrix
method for unique
that keeps one margin fixed; this is fine as a matrix can be treated as a vector.
Marek在对此答案的评论中指出,unlist
方法的速度较慢,可能是由于列表中的names
所致. Marek的解决方案是利用unlist
的use.names
参数,如果使用该参数,将比上述双重唯一版本提供更快的解决方案.对于Roman帖子的简单x
,我们得到
Marek, in the comments to this answer, notes that the slow speed of the unlist
approach is potentially due to the names
on the list. Marek's solution is to make use of the use.names
argument to unlist
, which if used, results in a faster solution than the double unique version above. For the simple x
of Roman's post we get
> unique(unlist(x, use.names = FALSE))
[1] 1 2 3 4 5 6
即使组件之间的唯一元素数量不同,Marek的解决方案也将起作用.
Marek's solution will work even when the number of unique elements differs between components.
这是一个更大的示例,其中列出了所有三种方法的某些时间点:
Here is a larger example with some timings of all three methods:
## Create a large list (1000 components of length 100 each)
DF <- as.list(data.frame(matrix(sample(1:10, 1000*1000, replace = TRUE),
ncol = 1000)))
以下是使用DF
的两种方法的结果:
Here are results for the two approaches using DF
:
> ## Do the three approaches give the same result:
> all.equal(unique.default(sapply(DF, unique)), unique(unlist(DF)))
[1] TRUE
> all.equal(unique(unlist(DF, use.names = FALSE)), unique(unlist(DF)))
[1] TRUE
> ## Timing Roman's original:
> system.time(replicate(10, unique(unlist(DF))))
user system elapsed
12.884 0.077 12.966
> ## Timing double unique version:
> system.time(replicate(10, unique.default(sapply(DF, unique))))
user system elapsed
0.648 0.000 0.653
> ## timing of Marek's solution:
> system.time(replicate(10, unique(unlist(DF, use.names = FALSE))))
user system elapsed
0.510 0.000 0.512
这表明双精度unique
可以更快地将unique()
应用于各个组件,然后将unique()
应用于较小的唯一值集,但是这种加速完全是由于names
列表DF
.如果我们告诉unlist
不使用names
,则对于此问题,Marek的解决方案要比双精度unique
快一点.由于Marek的解决方案正确地使用了正确的工具,并且比解决方法要快,因此是首选解决方案.
Which shows that the double unique
is a lot quicker to applying unique()
to the individual components and then unique()
those smaller sets of unique values, but this speed-up is purely due to the names
on the list DF
. If we tell unlist
to not use the names
, Marek's solution is marginally quicker than the double unique
for this problem. As Marek's solution is using the correct tool properly, and it is quicker than the work-around, it is the preferred solution.
使用双重unique
方法的最大陷阱是,仅在 if 才有效,如此处的两个示例一样,输入列表的每个组成部分(DF
或x
)具有相同数量的唯一值.在这种情况下,sapply
将结果简化为矩阵,这使我们可以应用unique.default
.如果输入列表的组件具有不同数量的唯一值,则双重唯一解决方案将失败.
The big gotcha with the double unique
approach is that it will only work if, as in the two examples here, each component of the input list (DF
or x
) has the same number of unique values. In such cases sapply
simplifies the result to a matrix which allows us to apply unique.default
. If the components of the input list have differing numbers of unique values, the double unique solution will fail.
这篇关于从列表中查找唯一值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!