本文介绍了纵向数据中不替换的随机抽样的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据是纵向的.

VISIT ID   VAR1
1     001  ...
1     002  ...
1     003  ...
1     004  ...
...
2     001  ...
2     002  ...
2     003  ...
2     004  ...

我们的最终目标是在每次访问中选出 10% 的人进行测试.我尝试使用 proc SURVEYSELECT 来执行 SRS 而无需替换并使用VISIT"作为分层.但是最终的样本会有重复的 ID.例如,可以在 VISIT=1 和 VISIT=2 中同时选择 ID=001.

Our end goal is picking out 10% each visit to run a test. I tried to use proc SURVEYSELECT to do SRS without replacement and using "VISIT" as strata. But the final sample would have duplicated IDs. For example, ID=001 might be selected both in VISIT=1 and VISIT=2.

有没有办法使用 SURVEYSELECT 或其他程序(R 也可以)来做到这一点?非常感谢.

Is there any way to do that using SURVEYSELECT or other procedure (R is also fine)? Thanks a lot.

推荐答案

这可以通过一些相当有创意的数据步进编程来实现.下面的代码使用了一种贪心的方法,依次从每次访问中采样,只对之前没有采样过的 id 进行采样.如果超过 90% 的访问 id 已经被采样,则输出不到 10%.在极端情况下,当访问的每个 id 都已被采样时,不会为该访问输出任何行.

This is possible with some fairly creative data step programming. The code below uses a greedy approach, sampling from each visit in turn, sampling only ids that have not previously been sampled. If more than 90% of the ids for a visit have already been sampled, less than 10% are output. In the extreme case, when every id for a visit has already been sampled, no rows are output for that visit.

/*Create some test data*/
data test_data;
  call streaminit(1);
  do visit = 1 to 1000;
    do id = 1 to ceil(rand('uniform')*1000);
      output;
    end;
  end;
run;


data sample;
  /*Create a hash object to keep track of unique IDs not sampled yet*/
  if 0 then set test_data;
  call streaminit(0);
  if _n_ = 1 then do;
    declare hash h();
    rc = h.definekey('id');
    rc = h.definedata('available');
    rc = h.definedone();
  end;
  /*Find out how many not-previously-sampled ids there are for the current visit*/
  do ids_per_visit = 1 by 1 until(last.visit);
    set test_data;
    by visit;
    if h.find() ne 0 then do;
      available = 1;
      rc = h.add();
    end;
    available_per_visit = sum(available_per_visit,available);
  end;
  /*Read through the current visit again, randomly sampling from the not-yet-sampled ids*/
  samprate = 0.1;
  number_to_sample = round(available_per_visit * samprate,1);
  do _n_ = 1 to ids_per_visit;
    set test_data;
    if available_per_visit > 0 then do;
      rc = h.find();
      if available = 1 then do;
        if rand('uniform') < number_to_sample / available_per_visit then do;
          available = 0;
          rc = h.replace();
          samples_per_visit = sum(samples_per_visit,1);
          output;
          number_to_sample = number_to_sample - 1;
        end;
        available_per_visit = available_per_visit - 1;
      end;
    end;
  end;
run;

/*Check that there are no duplicate IDs*/
proc sort data = sample out = sample_dedup nodupkey;
by id;
run;

这篇关于纵向数据中不替换的随机抽样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-28 18:11