本文介绍了如何根据 R 中因子变量的每个值的不同比例,从数据集中抽取与大小成比例的随机样本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从我的数据集中抽取一个随机样本,对因子变量的每个值使用不同的比例,并使用存储在其他列中的权重.dplyr 管道中的解决方案将是首选,因为它可以很容易地插入到长代码中.

I want to draw a random sample from my dataset, using different proportions for each value of a factor variable, as well as using weights stored in some other column. dplyr solution in pipes will be preferred as it can be inserted easily in long code.

iris数据集为例.Species 列分为三个值,每个值 50 行.我们还假设样本权重存储在 Sepal.Length 列中.如果我必须对每个物种进行相等比例(或相等行)的采样,问题很容易解决

Let's take the example of iris dataset. Species column is divided into three values 50 rows each. Let's also assume the sample weights are stored in column Sepal.Length. If I have to sample equal proportions (or equal rows) per species, the problem is easy to solve

library(tidyverse)

iris %>% group_by(Species) %>% slice_sample(prop = 0.1, weight_by = Sepal.Length)

# A tibble: 15 x 5
# Groups:   Species [3]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>
 1          5.4         3.7          1.5         0.2 setosa
 2          5.3         3.7          1.5         0.2 setosa
 3          5.7         4.4          1.5         0.4 setosa
 4          5           3.5          1.6         0.6 setosa
 5          4.8         3.1          1.6         0.2 setosa
 6          6.1         2.9          4.7         1.4 versicolor
 7          6.7         3.1          4.7         1.5 versicolor
 8          5           2            3.5         1   versicolor
 9          7           3.2          4.7         1.4 versicolor
10          5.7         2.9          4.2         1.3 versicolor
11          7.2         3.2          6           1.8 virginica
12          6.7         2.5          5.8         1.8 virginica
13          6.4         2.8          5.6         2.1 virginica
14          6.3         3.3          6           2.5 virginica
15          7.2         3            5.8         1.6 virginica

但是当我必须为每个物种选择/采样不同的比例时,我陷入了困境,比如分别为 10%、20%、25%.

But I got stuck when I have to choose/sample different proportions for each species, say 10%, 20%, 25% respectively.

iris %>% group_by(Species) %>% slice_sample(prop = c(0.1, 0.2, 0.25), weight_by = Sepal.Length)

#Error: `prop` must be a single number

iris %>% group_split(Species) %>% map_df(c(0.1, 0.2, 0.25), ~ slice_sample(prop = ., weight_by = Sepal.Length))
# A tibble: 0 x 0

请帮忙

推荐答案

如果我理解正确:

iris %>%
  group_split(Species) %>%
  map2(c(0.1, 0.2, 0.25), ~ slice_sample(.x, prop = .y))

[[1]]
# A tibble: 5 x 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl> <fct>
1          4.9         3            1.4         0.2 setosa
2          4.8         3            1.4         0.1 setosa
3          5.2         4.1          1.5         0.1 setosa
4          5           3.5          1.6         0.6 setosa
5          5.2         3.5          1.5         0.2 setosa

[[2]]
# A tibble: 10 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>
 1          6.3         2.5          4.9         1.5 versicolor
 2          5.5         2.6          4.4         1.2 versicolor
 3          6.9         3.1          4.9         1.5 versicolor
 4          6.6         2.9          4.6         1.3 versicolor
 5          6.1         3            4.6         1.4 versicolor
 6          5.7         2.8          4.5         1.3 versicolor
 7          6.7         3.1          4.4         1.4 versicolor
 8          5.1         2.5          3           1.1 versicolor
 9          5.7         3            4.2         1.2 versicolor
10          7           3.2          4.7         1.4 versicolor

[[3]]
# A tibble: 12 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>
 1          6.4         3.2          5.3         2.3 virginica
 2          7.2         3.2          6           1.8 virginica
 3          6.3         3.3          6           2.5 virginica
 4          6.2         2.8          4.8         1.8 virginica
 5          7.6         3            6.6         2.1 virginica
 6          5.7         2.5          5           2   virginica
 7          4.9         2.5          4.5         1.7 virginica
 8          6.7         3.1          5.6         2.4 virginica
 9          7.7         2.8          6.7         2   virginica
10          6.7         3.3          5.7         2.5 virginica
11          6           3            4.8         1.8 virginica
12          5.6         2.8          4.9         2   virginica

如果你想要返回数据框,只需将 map2 更改为 map2_df:

Just change map2 to map2_df if you want a data frame returned:

iris %>%
  group_split(Species) %>%
  map2_df(c(0.1, 0.2, 0.25), ~ slice_sample(.x, prop = .y))

# A tibble: 27 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>
 1          5.7         3.8          1.7         0.3 setosa
 2          4.8         3.1          1.6         0.2 setosa
 3          5.1         3.8          1.5         0.3 setosa
 4          4.9         3.6          1.4         0.1 setosa
 5          4.8         3.4          1.6         0.2 setosa
 6          5.7         2.8          4.1         1.3 versicolor
 7          6.6         3            4.4         1.4 versicolor
 8          6.8         2.8          4.8         1.4 versicolor
 9          5.8         2.7          4.1         1   versicolor
10          6.4         3.2          4.5         1.5 versicolor
# ... with 17 more rows

这篇关于如何根据 R 中因子变量的每个值的不同比例,从数据集中抽取与大小成比例的随机样本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-05 16:24