在 SQL 中使用每组不同的样本大小进行无替换采样

本文介绍了在 SQL 中使用每组不同的样本大小进行无替换采样的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

使用提供的表格，我想每天随机抽样用户.要抽样的用户数在 to_sample 列中指定，并由另一个查询填充.在这个例子中，我想在第一天抽取 1 个观察样本，在第二天抽取 2 个观察样本(但这会随着查询的每次执行而改变，所以不要把你的注意力放在这些数字上).我希望分配到不同日期的用户不同(没有重叠分配).

Using the provided table I would like to randomly sample users per day. The number of users to be sampled is specified in the to_sample column and it is filled by another query. In this example I would like to sample 1 observation for day one and 2 observations for day two (but this will change with every execution of the query, so don't set your mind to these numbers). I would like the users assigned to different days to be different (no overlapping assignment).

drop table if exists test;

create table test (
user_id int,
day_of_week int,
to_sample int);

insert into test values (1, 1, 1);
insert into test values (1, 2, 2);
insert into test values (2, 1, 1);
insert into test values (2, 2, 2);
insert into test values (3, 1, 1);
insert into test values (3, 2, 2);
insert into test values (4, 1, 1);
insert into test values (4, 2, 2);
insert into test values (5, 1, 1);
insert into test values (5, 2, 2);
insert into test values (6, 1, 1);
insert into test values (6, 2, 2);

预期结果如下:

create table results (
user_id int,
day_of_week int);

insert into results values (1, 1);
insert into results values (3, 2);
insert into results values (6, 2);

正如我所说，每次采样的用户数量都会有所不同，这应该从测试表中的to_sample列中获取.此外，我将运行它 7 天，这里有 2 天以保持示例简单.

As I said, the number of users to be sampled will be different every time, as should be taken from the to_sample column in the test table. Also I will run it for 7 days, here there are 2 to keep the example simple.

with day_1 as(
select t.user_id, t.day_of_week
from (select t.*, row_number() over (partition by day_of_week order by randomint(100)) as seqnum
      fromtest t where t.day_of_week = 1
     ) t
where t.seqnum <= (select distinct to_sample fromtest where day_of_week = 1)
)
, day_2 as(
select t.user_id, t.day_of_week
from (select t.*, row_number() over (partition by day_of_week order by randomint(100)) as seqnum
      from test t where t.user_id not in (select distinct user_id from day_1) and t.day_of_week = 2
     ) t
where t.seqnum <= (select distinct to_sample from test where day_of_week = 2)
)
select * from day_1 union all select * from day_2

我尝试根据一些答案创建一个粗暴的解决方案，但仍然有一些重复的用户，即使我从 day_2 中删除了 day_1 中已经使用的 user_id.

I tried creating a brute solution based on some of the answers, but still there are some repeated user, even though I remove the user_id that is already used in day_1 from day_2.

user_id | day_of_week
---------+-------------
       4 |           1
       4 |           2
       1 |           2

推荐答案

如果我没听错，请尝试下一个:(实际上是@BHouse 的改进方案)

If I got you correctly, so try next:(actually its a improved solution of @BHouse)

SELECT
    T.user_id,
    T.day_of_week
FROM (
    SELECT
        user_id,
        day_of_week,
        to_sample,
        row_number() OVER (PARTITION BY to_sample ORDER BY randomint(max(user_id) + 1)) AS RN
    FROM
        test
    GROUP BY
        user_id,
        day_of_week,
        to_sample
    ORDER BY
        to_sample
    ) AS T
WHERE
    T.RN <= T.to_sample;

所提供数据的输出示例:

第一次执行:

 user_id | day_of_week
---------+-------------
       1 |           1
       3 |           2
       2 |           2

第二次执行:

 user_id | day_of_week
---------+-------------
       1 |           1
       1 |           2
       4 |           2

第三次执行:

 user_id | day_of_week
---------+-------------
       5 |           1
       4 |           2
       2 |           2

因此，可以保证一定的随机性.

So, some randomness is guaranteed.

或者试试这个:

 SELECT
    T.user_id,
    T.day_of_week
FROM (
    SELECT
        user_id,
        day_of_week,
        to_sample,
        row_number() OVER (PARTITION BY to_sample) AS RN,
        randomint(42) AS RANDOM_ORDER /* <<-- here is main problem, number should be >= max(user_id) + 1 */
    FROM
        test
    ORDER BY
        to_sample,
        RANDOM_ORDER
    ) AS T
WHERE
    T.RN <= T.to_sample;

第二个选项更快，但我没有针对危急情况对其进行测试.

A second option is more faster, but I didn't testes it for critical cases.

这篇关于在 SQL 中使用每组不同的样本大小进行无替换采样的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！