问题描述
考虑以下数据框:
df <- data.frame(replicate(5,sample(1:10,10,rep=TRUE)))
# X1 X2 X3 X4 X5
#1 7 9 8 4 10
#2 2 4 9 4 9
#3 2 7 8 8 6
#4 8 9 6 6 4
#5 5 2 1 4 6
#6 8 2 2 1 7
#7 3 8 6 1 6
#8 3 8 5 9 8
#9 6 2 3 10 7
#10 2 7 4 2 9
使用 dplyr
,我如何在每一列(不隐式命名)上过滤所有大于 2 的值.
Using dplyr
, how can I filter, on each column (without implicitly naming them), for all values greater than 2.
可以模仿假设的 filter_each(funs(. >= 2))
现在我正在做:
df %>% filter(X1 >= 2, X2 >= 2, X3 >= 2, X4 >= 2, X5 >= 2)
相当于:
df %>% filter(!rowSums(. < 2))
注意:假设我只想过滤前 4 列,我会这样做:
Note: Let's say I wanted to filter only on the first 4 columns, I would do:
df %>% filter(X1 >= 2, X2 >= 2, X3 >= 2, X4 >= 2)
或
df %>% filter(!rowSums(.[-5] < 2))
是否有更有效的替代方案?
Would there be a more efficient alternative ?
子问题
如何指定列名并模拟假设的 filter_each(funs(. >= 2), -X5)
?
How to specify a column name and mimic an hypothethical filter_each(funs(. >= 2), -X5)
?
基准子问题
由于我必须在大型数据集上运行它,因此我对建议进行了基准测试.
Since I have to run this on a large dataset, I benchmarked the suggestions.
df <- data.frame(replicate(5,sample(1:10,10e6,rep=TRUE)))
mbm <- microbenchmark(
Marat = df %>% filter(!rowSums(.[,!colnames(.) %in% "X5", drop = FALSE] < 2)),
Richard = filter_(df, .dots = lapply(names(df)[names(df) != "X5"], function(x, y) { call(">=", as.name(x), y) }, 2)),
Docendo = df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L))),
times = 50
)
结果如下:
#Unit: milliseconds
# expr min lq mean median uq max neval
# Marat 1209.1235 1320.3233 1358.7994 1362.0590 1390.342 1448.458 50
# Richard 1151.7691 1196.3060 1222.9900 1216.3936 1256.191 1266.669 50
# Docendo 874.0247 933.1399 983.5435 985.3697 1026.901 1053.407 50
推荐答案
这是另一个带有 slice
的选项,在这种情况下,它可以与 filter
类似地使用.主要区别在于您为 slice
提供了一个整数向量,而 filter
需要一个逻辑向量.
Here's another option with slice
which can be used similarly to filter
in this case. Main difference is that you supply an integer vector to slice
whereas filter
takes a logical vector.
df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L)))
我喜欢这种方法的一点是,因为我们在 rowSums
中使用了 select
,您可以利用 select
提供的所有特殊功能,例如 matches
.
What I like about this approach is that because we use select
inside rowSums
you can make use of all the special functions that select
supplies, like matches
for example.
让我们看看它与其他答案的比较:
Let's see how it compares to the other answers:
df <- data.frame(replicate(5,sample(1:10,10e6,rep=TRUE)))
mbm <- microbenchmark(
Marat = df %>% filter(!rowSums(.[,!colnames(.) %in% "X5", drop = FALSE] < 2)),
Richard = filter_(df, .dots = lapply(names(df)[names(df) != "X5"], function(x, y) { call(">=", as.name(x), y) }, 2)),
dd_slice = df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L))),
times = 50L,
unit = "relative"
)
#Unit: relative
# expr min lq median uq max neval
# Marat 1.304216 1.290695 1.290127 1.288473 1.290609 50
# Richard 1.139796 1.146942 1.124295 1.159715 1.160689 50
# dd_slice 1.000000 1.000000 1.000000 1.000000 1.000000 50
编辑说明:更新了更可靠的基准测试,重复 50 次(次数 = 50L).
Edit note: updated with more reliable benchmark with 50 repetitions (times = 50L).
在评论基础 R 将具有与 slice
方法相同的速度之后(没有具体说明基础 R 方法的确切含义),我决定通过与基础 R 的比较来更新我的答案使用与我的答案几乎相同的方法.对于我使用的基础 R:
Following a comment that base R would have the same speed as the slice
approach (without specification of what base R approach is meant exactly), I decided to update my answer with a comparison to base R using almost the same approach as in my answer. For base R I used:
base = df[!rowSums(df[-5L] < 2L), ],
base_which = df[which(!rowSums(df[-5L] < 2L)), ]
基准:
df <- data.frame(replicate(5,sample(1:10,10e6,rep=TRUE)))
mbm <- microbenchmark(
Marat = df %>% filter(!rowSums(.[,!colnames(.) %in% "X5", drop = FALSE] < 2)),
Richard = filter_(df, .dots = lapply(names(df)[names(df) != "X5"], function(x, y) { call(">=", as.name(x), y) }, 2)),
dd_slice = df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L))),
base = df[!rowSums(df[-5L] < 2L), ],
base_which = df[which(!rowSums(df[-5L] < 2L)), ],
times = 50L,
unit = "relative"
)
#Unit: relative
# expr min lq median uq max neval
# Marat 1.265692 1.279057 1.298513 1.279167 1.203794 50
# Richard 1.124045 1.160075 1.163240 1.169573 1.076267 50
# dd_slice 1.000000 1.000000 1.000000 1.000000 1.000000 50
# base 2.784058 2.769062 2.710305 2.669699 2.576825 50
# base_which 1.458339 1.477679 1.451617 1.419686 1.412090 50
这两种基本的 R 方法并没有真正更好或可比的性能.
Not really any better or comparable performance with these two base R approaches.
编辑注释 #2: 添加了带有基本 R 选项的基准.
Edit note #2: added benchmark with base R options.
这篇关于根据特定值过滤 data.frame 的每一列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!