本文介绍了SAS - 分组对的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个变量 ID1 和 ID2.它们都是同一种标识符.当它们出现在同一行数据中时,意味着它们在同一组中.我想为每个 ID 制作一个组标识符.例如,我有

I have two variables ID1 and ID2. They are both the same kinds of identifiers. When they appear in the same row of data it means they are in the same group. I want to make a group identifier for each ID. For example, I have

ID1   ID2
1     4
1     5
2     5
2     6
3     7
4     1
5     1
5     2
6     2
7     3

那我想要

ID   Group
1     1
2     1
3     2
4     1
5     1
6     1
7     2

因为 1,2,4,5,6 通过原始数据中的某种组合配对,所以它们共享一个组.3 和 7 只相互配对,因此它们是一个新组.我想对大约 20,000 行执行此操作.ID1 中的每个 ID 也在 ID2 中(更具体地说,如果 ID1=1 和 ID2=2 用于观察,那么还有另一个观察,即 ID1=2 和 ID2=1).

Because 1,2,4,5,6 are paired by some combination in the original data they share a group. 3 and 7 are only paired with each other so they are a new group. I want to do this for ~20,000 rows. Every ID that is in ID1 is also in ID2 (more specifically if ID1=1 and ID2=2 for an observation, then there is another observation that is ID1=2 and ID2=1).

我试过来回合并它们,但这不起作用.我还尝试调用 symput 并尝试为每个 ID 的组创建一个宏变量,然后在我移动行时更新它,但我也无法让它工作.

I've tried merging them back and forth but that doesn't work. I also tried call symput and trying to make a macro variable for each ID's group and then updating it as I move through rows, but I couldn't get that to work either.

推荐答案

我已经使用卞海阔的答案作为起点来开发一个稍微复杂的算法,该算法似乎适用于我迄今为止尝试过的所有测试用例.它可能会进一步优化,但它可以在我的 PC 上在一秒内处理 20000 行,而只使用几 MB 内存.输入数据集不需要按任何特定顺序排序,但如所写,它假设每一行至少出现一次且 id1

I have used Haikuo Bian's answer as a starting point to develop a slightly more complex algorithm that seems to work for all the test cases I have tried so far. It could probably be optimised further, but it copes with 20000 rows in under a second on my PC while using only a few MB of memory. The input dataset does not need to be sorted in any particular order, but as written it assumes that every row is present at least once with id1 < id2.

测试用例:

/* Original test case */
data have;
input id1 id2;
cards;
1     4
1     5
2     5
2     6
3     7
4     1
5     1
5     2
6     2
7     3
;
run;

/* Revised test case - all in one group with connecting row right at the end */
data have; 
input ID1 ID2; 
/*Make sure each row has id1 < id2*/
if id1 > id2 then do;
t_id2 = id2;
id2   = id1;
id1   = t_id2;
end;
drop t_id2;
cards; 
2 5 
4 8 
2 4 
2 6 
3 7 
4 1 
9 1 
3 2 
6 2 
7 3
;
run;

/*Full scale test case*/
data have;
    do _N_ = 1 to 20000;
        call streaminit(1);
        id1 = int(rand('uniform')*100000);
        id2 = int(rand('uniform')*100000);
        if id1 < id2 then output;
        t_id2 = id2;
        id2   = id1;
        id1   = t_id2;
        if id1 < id2 then output;
    end;
    drop t_id2; 
run;

代码:

option fullstimer;

data _null_;
    length id group 8;
    declare hash h();
    rc = h.definekey('id');
    rc = h.definedata('id');        
    rc = h.definedata('group');
    rc = h.definedone();

    array ids(2) id1 id2;
    array groups(2) group1 group2;

    /*Initial group guesses (greedy algorithm)*/
    do until (eof);
        set have(where = (id1 < id2)) end = eof;
        match = 0;
        call missing(min_group);
        do i = 1 to 2;
            rc = h.find(key:ids[i]);
            match + (rc=0);
            if rc = 0 then min_group = min(group,min_group);
        end;
        /*If neither id was in a previously matched group, create a new one*/
        if not(match) then do;
            max_group + 1;
            group = max_group;
        end;
        /*Otherwise, assign both to the matched group with the lowest number*/
        else group = min_group;
        do i = 1 to 2;
            id = ids[i];
            rc = h.replace();
        end;
    end;

    /*We now need to work through the whole dataset multiple times
      to deal with ids that were wrongly assigned to a separate group
      at the end of the initial pass, so load the table into a 
      hash object + iterator*/
    declare hash h2(dataset:'have(where = (id1 < id2))');
    rc = h2.definekey('id1','id2');
    rc = h2.definedata('id1','id2');
    rc = h2.definedone();
    declare hiter hi2('h2');

    change_count = 1;
    do while(change_count > 0);
        change_count = 0;
        rc = hi2.first();
        do while(rc = 0);
            /*Get the current group of each id from 
              the hash we made earlier*/
            do i = 1 to 2;
                rc = h.find(key:ids[i]);
                groups[i] = group;
            end;
            /*If we find a row where the two ids have different groups, 
              move the id in the higher group to the lower group*/
            if groups[1] < groups[2] then do;
                id = ids[2];
                group = groups[1];
                rc = h.replace();
                change_count + 1;           
            end;
            else if groups[2] < groups[1] then do;
                id = ids[1];
                group = groups[2];
                rc = h.replace();       
                change_count + 1;           
            end;
            rc = hi2.next();
        end;
        pass + 1;
        put pass= change_count=; /*For information only :)*/
    end;    

    rc = h.output(dataset:'want');

run;

/*Renumber the groups sequentially*/
proc sort data = want;
    by group id;
run;

data want;
    set want;
    by group;
    if first.group then new_group + 1;
    drop group;
    rename new_group = group;
run;

/*Summarise by # of ids per group*/
proc sql;
    select a.group, count(id) as FREQ 
        from want a
        group by a.group
        order by freq desc;
quit;   

有趣的是,如果 id1 已经匹配,则建议在初始传递期间不检查 id2 组的建议优化实际上在此扩展算法中减慢了速度,因为这意味着必须在后续传递中完成更多工作,如果id2 在编号较低的组中.例如.我之前试运行的输出:

Interestingly, the suggested optimisation of not checking the group of id2 during the initial pass if id1 is already matched actually slows things down a little in this extended algorithm, because it means that more work has to be done in the subsequent passes if id2 is in a lower numbered group. E.g. output from a trial run I did earlier:

使用优化":

 pass=0 change_count=4696
 pass=1 change_count=204
 pass=2 change_count=23
 pass=3 change_count=9
 pass=4 change_count=2
 pass=5 change_count=1
 pass=6 change_count=0

 NOTE: DATA statement used (Total process time):
       real time           0.19 seconds
       user cpu time       0.17 seconds
       system cpu time     0.04 seconds
       memory              9088.76k
       OS Memory           35192.00k

没有:

 pass=0 change_count=4637
 pass=1 change_count=182
 pass=2 change_count=23
 pass=3 change_count=9
 pass=4 change_count=2
 pass=5 change_count=1
 pass=6 change_count=0

 NOTE: DATA statement used (Total process time):
       real time           0.18 seconds
       user cpu time       0.16 seconds
       system cpu time     0.04 seconds

这篇关于SAS - 分组对的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-20 09:09