我正在尝试根据值将数据框分类为分位数,然后为每一行分配子组。这些子组应该是随机分配的,但要使每个子组的大小尽可能相等。例如,如果我有一个 300 行的数据框,并且想要 5 个分位数,每个分位数有 3 个子组,我希望分位数 1 子组 1 中有 20 行,分位数 1 子组 2 中有 20 行,等等。

有没有一种漂亮而优雅的方法来做到这一点,最好可以使用 %>% 管道 dplyr

我当前代码的一个例子:

library(dplyr)

# Construct dataframe.
df <- structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
    13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
    29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44,
    45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
    61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
    77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92,
    93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106,
    107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119,
    120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132,
    133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145,
    146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158,
    159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171,
    172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184,
    185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197,
    198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210,
    211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223,
    224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236,
    237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249,
    250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262,
    263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275,
    276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288,
    289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300),
    value = c(26550.87, 37212.39, 57285.34, 90820.78, 20168.19,
    89838.97, 94467.53, 66079.78, 62911.4, 6178.63, 20597.46,
    17655.68, 68702.28, 38410.37, 76984.14, 49769.92, 71761.85,
    99190.61, 38003.52, 77744.52, 93470.52, 21214.25, 65167.38,
    12555.51, 26722.07, 38611.41, 1339.03, 38238.8, 86969.08,
    34034.9, 48208.01, 59956.58, 49354.13, 18621.76, 82737.33,
    66846.67, 79423.99, 10794.36, 72371.09, 41127.44, 82094.63,
    64706.02, 78293.28, 55303.63, 52971.96, 78935.62, 2333.12,
    47723.01, 73231.37, 69273.16, 47761.96, 86120.95, 43809.71,
    24479.73, 7067.9, 9946.62, 31627.17, 51863.43, 66200.51,
    40683.02, 91287.59, 29360.34, 45906.57, 33239.47, 65087.05,
    25801.68, 47854.52, 76631.07, 8424.69, 87532.13, 33907.29,
    83944.04, 34668.35, 33377.49, 47635.12, 89219.83, 86433.95,
    38998.95, 77732.07, 96061.8, 43465.95, 71251.47, 39999.44,
    32535.22, 75708.71, 20269.23, 71112.12, 12169.19, 24548.85,
    14330.44, 23962.94, 5893.44, 64228.83, 87626.92, 77891.47,
    79730.88, 45527.45, 41008.41, 81087.02, 60493.33, 65472.39,
    35319.73, 27026.01, 99268.41, 63349.33, 21320.81, 12937.23,
    47811.8, 92407.45, 59876.1, 97617.07, 73179.25, 35672.69,
    43147.37, 14821.16, 1307.76, 71556.61, 10318.42, 44628.43,
    64010.1, 99183.86, 49559.36, 48434.95, 17344.23, 75482.09,
    45389.55, 51116.98, 20754.51, 22865.81, 59571.2, 57487.22,
    7706.44, 3554.06, 64279.55, 92861.52, 59809.24, 56090.07,
    52602.77, 98509.52, 50764.18, 68278.81, 60154.12, 23886.87,
    25816.59, 72930.96, 45257.08, 17512.68, 74669.83, 10498.76,
    86454.49, 61464.5, 55715.95, 32877.73, 45313.14, 50044.1,
    18086.64, 52963.06, 7527.57, 27775.59, 21269.95, 28479.05,
    89509.41, 44623.53, 77998.49, 88061.9, 41312.42, 6380.85,
    33548.75, 72372.59, 33761.53, 63041.41, 84061.46, 85613.17,
    39135.93, 38049.39, 89544.54, 64431.58, 74107.86, 60530.34,
    90308.16, 29373.02, 19126.01, 88645.09, 50333.95, 87705.75,
    18919.36, 75810.31, 72449.89, 94372.48, 54764.66, 71174.39,
    38890.51, 10087.31, 92730.21, 28323.25, 59057.32, 11036.06,
    84050.7, 31796.37, 78285.13, 26750.82, 21864.53, 51679.68,
    26895.06, 18116.83, 51857.61, 56278.29, 12915.69, 25636.76,
    71793.53, 96140.99, 10014.08, 76322.27, 94796.64, 81863.47,
    30829.23, 64957.95, 95335.55, 95373.27, 33997.92, 26247.41,
    16545.39, 32216.81, 51012.52, 92396.85, 51095.97, 25762.13,
    4646.09, 41785.63, 85400.15, 34723.07, 13144.23, 37448.69,
    63142.02, 39007.89, 68962.78, 68941.34, 55490.06, 42962.44,
    45272.01, 30644.33, 57835.39, 91037.03, 14260.41, 41504.76,
    21092.58, 42875.04, 13269, 46009.64, 94295.71, 76197.39,
    93290.98, 47067.85, 60358.81, 48498.97, 10880.63, 24772.68,
    49851.45, 37286.67, 93469.14, 52398.61, 31714.47, 27796.6,
    78754.05, 70246.25, 16502.76, 6445.75, 75470.56, 62041, 16957.68,
    6221.41, 10902.93, 38171.64, 16931.09, 29865.25, 19220.95,
    25717, 18123.18, 47731.37, 77073.7, 2778.71, 52731.08, 88031.91,
    37306.34, 4795.91, 13862.82, 32149.21, 15483.16, 13222.82,
    22130.59, 22638.08, 13141.65, 98156.35, 32701.37, 50693.95,
    68144.25, 9916.91, 11890.26, 5043.97, 92925.39)), class = "data.frame", row.names = c(NA,
    -300L))


目前我正在使用此代码:
groups <- 5
subgroups <- 3

set.seed(30)
result.df <- df %>%
  mutate(group = ntile(x = value, n = groups),
         subgroup = replicate(nrow(df),
                              sample(1:subgroups, 1, replace = T))
         )


但是当我检查子组的分布时,我发现它并没有达到应有的平等。
table(result.df$group, result.df$subgroup)

# Returns:
#    1  2  3
# 1 21 22 17
# 2 22 20 18
# 3 24 17 19
# 4 25 15 20
# 5 23 19 18


由于有 300 行、5 个组和 3 个子组,在这种情况下,我希望为每个组分配 20 行。无论分位数和子组的数量如何,如何以始终如一的“随机”方式实现这一点,但也实现尽可能相等的子组大小?

谢谢你。

最佳答案

这是一种方法:

首先对数据集进行 shuffle,然后创建 group 并按 group 排序。最后创建 subgroup 作为 subgroups 的行号模数:

result.df <- sample_frac(df) %>%
  mutate(group = ntile(x = value, n = groups)) %>%
  arrange(group) %>%
  mutate(subgroup = row_number(group) %% subgroups)

table(result.df$group, result.df$subgroup)
     0  1  2
  1 20 20 20
  2 20 20 20
  3 20 20 20
  4 20 20 20
  5 20 20 20

关于在每个分位数内随机分配相同大小的子分组,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/57558457/

10-12 19:52