问题描述
我想从我的数据集中抽取一个随机样本,对因子变量的每个值使用不同的比例,并使用存储在其他列中的权重.dplyr
管道中的解决方案将是首选,因为它可以很容易地插入到长代码中.
I want to draw a random sample from my dataset, using different proportions for each value of a factor variable, as well as using weights stored in some other column. dplyr
solution in pipes will be preferred as it can be inserted easily in long code.
以iris
数据集为例.Species
列分为三个值,每个值 50 行.我们还假设样本权重存储在 Sepal.Length
列中.如果我必须对每个物种进行相等比例(或相等行)的采样,问题很容易解决
Let's take the example of iris
dataset. Species
column is divided into three values 50 rows each. Let's also assume the sample weights are stored in column Sepal.Length
. If I have to sample equal proportions (or equal rows) per species, the problem is easy to solve
library(tidyverse)
iris %>% group_by(Species) %>% slice_sample(prop = 0.1, weight_by = Sepal.Length)
# A tibble: 15 x 5
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.4 3.7 1.5 0.2 setosa
2 5.3 3.7 1.5 0.2 setosa
3 5.7 4.4 1.5 0.4 setosa
4 5 3.5 1.6 0.6 setosa
5 4.8 3.1 1.6 0.2 setosa
6 6.1 2.9 4.7 1.4 versicolor
7 6.7 3.1 4.7 1.5 versicolor
8 5 2 3.5 1 versicolor
9 7 3.2 4.7 1.4 versicolor
10 5.7 2.9 4.2 1.3 versicolor
11 7.2 3.2 6 1.8 virginica
12 6.7 2.5 5.8 1.8 virginica
13 6.4 2.8 5.6 2.1 virginica
14 6.3 3.3 6 2.5 virginica
15 7.2 3 5.8 1.6 virginica
但是当我必须为每个物种选择/采样不同的比例时,我陷入了困境,比如分别为 10%、20%、25%.
But I got stuck when I have to choose/sample different proportions for each species, say 10%, 20%, 25% respectively.
iris %>% group_by(Species) %>% slice_sample(prop = c(0.1, 0.2, 0.25), weight_by = Sepal.Length)
#Error: `prop` must be a single number
或
iris %>% group_split(Species) %>% map_df(c(0.1, 0.2, 0.25), ~ slice_sample(prop = ., weight_by = Sepal.Length))
# A tibble: 0 x 0
请帮忙
推荐答案
如果我理解正确:
iris %>%
group_split(Species) %>%
map2(c(0.1, 0.2, 0.25), ~ slice_sample(.x, prop = .y))
[[1]]
# A tibble: 5 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 4.9 3 1.4 0.2 setosa
2 4.8 3 1.4 0.1 setosa
3 5.2 4.1 1.5 0.1 setosa
4 5 3.5 1.6 0.6 setosa
5 5.2 3.5 1.5 0.2 setosa
[[2]]
# A tibble: 10 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 6.3 2.5 4.9 1.5 versicolor
2 5.5 2.6 4.4 1.2 versicolor
3 6.9 3.1 4.9 1.5 versicolor
4 6.6 2.9 4.6 1.3 versicolor
5 6.1 3 4.6 1.4 versicolor
6 5.7 2.8 4.5 1.3 versicolor
7 6.7 3.1 4.4 1.4 versicolor
8 5.1 2.5 3 1.1 versicolor
9 5.7 3 4.2 1.2 versicolor
10 7 3.2 4.7 1.4 versicolor
[[3]]
# A tibble: 12 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 6.4 3.2 5.3 2.3 virginica
2 7.2 3.2 6 1.8 virginica
3 6.3 3.3 6 2.5 virginica
4 6.2 2.8 4.8 1.8 virginica
5 7.6 3 6.6 2.1 virginica
6 5.7 2.5 5 2 virginica
7 4.9 2.5 4.5 1.7 virginica
8 6.7 3.1 5.6 2.4 virginica
9 7.7 2.8 6.7 2 virginica
10 6.7 3.3 5.7 2.5 virginica
11 6 3 4.8 1.8 virginica
12 5.6 2.8 4.9 2 virginica
如果你想要返回数据框,只需将 map2
更改为 map2_df
:
Just change map2
to map2_df
if you want a data frame returned:
iris %>%
group_split(Species) %>%
map2_df(c(0.1, 0.2, 0.25), ~ slice_sample(.x, prop = .y))
# A tibble: 27 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.7 3.8 1.7 0.3 setosa
2 4.8 3.1 1.6 0.2 setosa
3 5.1 3.8 1.5 0.3 setosa
4 4.9 3.6 1.4 0.1 setosa
5 4.8 3.4 1.6 0.2 setosa
6 5.7 2.8 4.1 1.3 versicolor
7 6.6 3 4.4 1.4 versicolor
8 6.8 2.8 4.8 1.4 versicolor
9 5.8 2.7 4.1 1 versicolor
10 6.4 3.2 4.5 1.5 versicolor
# ... with 17 more rows
这篇关于如何根据 R 中因子变量的每个值的不同比例,从数据集中抽取与大小成比例的随机样本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!