问题描述
在特定的列中,我有几个类别.我想稀疏/稀疏/删除仅在一个类别中的某些行.我已经看到 sample_n
与 group_by
一起使用,但是其 size
参数对分组变量中的每个类别应用了相同数量的行.我想为每个组指定不同的 size
.
In a specific column, I have several categories. I want to thin/dilute/remove randomly some rows only in one category. I've seen sample_n
used with group_by
, but its size
argument applies the removal of same number of rows for each category in the grouped variable. I want to specify different size
for each group.
第二,我希望就地"完成它,这意味着我希望它返回相同的原始数据帧,只是现在在我试图稀释"的特定类别中它将有更少的行.
Second, I'm looking to do it "in place", meaning that I want it to return the same original dataframe, just that now it will have fewer rows in the specific category I sought to "dilute".
library(tidyverse)
set.seed(123)
df <-
tibble(
color = sample(c("red", "blue", "yellow", "green", "brown"), size = 1000, replace = T),
value = sample(0:750, size = 1000, replace = T)
)
df
## # A tibble: 1,000 x 2
## color value
## <chr> <int>
## 1 yellow 251
## 2 yellow 389
## 3 blue 742
## 4 blue 227
## 5 yellow 505
## 6 brown 47
## 7 green 381
## 8 red 667
## 9 blue 195
## 10 yellow 680
## # ... with 990 more rows
按颜色计数时,我看到:
When tally by color I see that:
df %>% count(color)
color n
<chr> <int>
1 blue 204
2 brown 202
3 green 191
4 red 203
5 yellow 200
现在让我们说我只想减少 red
颜色的行数.假设我只希望 10
行用于 color == red
.显然,仅仅使用 sample_n
并不能帮助我实现这一目标
Now let's say that I want to decrease the number of rows only for red
color. Let's say I want only 10
rows for color == red
. Simply using sample_n
doesn't get me there, obviously:
df %>%
group_by(color) %>%
sample_n(10) %>%
count(color)
color n
<chr> <int>
1 blue 10
2 brown 10
3 green 10
4 red 10
5 yellow 10
如何指定仅 color =="red"
会具有 10
行,而其他颜色保持不变?
How can I specify that only color == "red"
will have 10
rows while the other colors remain untouched?
我看到了一些类似的问题(像这样的问题),但无法根据我的情况调整答案.
I've seen some similar questions (like this one), but wasn't able to adapt the answers to my case.
推荐答案
我们可以编写一个函数来过滤
特定颜色,对其进行采样并将其与原始数据绑定
We can write a function to filter
specific colors, sample them and bind them with the orignal data
library(dplyr)
sample_for_color <- function(data, col_to_change, n) {
data %>%
filter(color %in% col_to_change) %>%
group_by(color) %>%
slice_sample(n = n) %>%
ungroup %>%
bind_rows(data %>% filter(!color %in% col_to_change))
}
new_df <- df %>% sample_for_color('red', 10)
new_df %>% count(color)
# color n
# <chr> <int>
#1 blue 204
#2 brown 202
#3 green 191
#4 red 10
#5 yellow 200
new_df <- df %>% sample_for_color(c('red', 'blue'), 10)
new_df %>% count(color)
# color n
# <chr> <int>
#1 blue 10
#2 brown 202
#3 green 191
#4 red 10
#5 yellow 200
这篇关于如何仅在特定子组中随机删除数据框中的行(使用dplyr :: sample_n?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!