sql - SAS 中的高效滚动总和(窗口聚合)

我有两个表:

tb_payments : contract_id, payment_date, payment_value

tb_reference : contract_id, reference_date

对于 tb_reference 中的每个 (contract_id, reference_date)，我想创建一列 sum_payments 作为 tb_payments 的 90 天滚动总和。我可以用下面的查询来完成这个(非常低效):

%let window=90;
proc sql;
    create index contract_id on tb_payments;
quit;
proc sql;
    create table tb_rolling as
    select a.contract_id,
           a.reference_date,
           (select sum(b.payment_value)
            from tb_payments as b
            where a.contract_id = b.contract_id
                  and a.reference_date - &window. < b.payment_date
                  and b.payment_date <= a.reference_date
           ) as sum_payments
    from tb_reference as a;
quit;

如何使用 proc sql 或 SAS 数据步骤重写它以减少时间复杂度？

编辑更多信息:

我随意选择了90天作为窗口，但是我会针对几个窗口进行计算。可以同时为多个窗口执行计算的解决方案将是理想的

两个表都可以有10+百万行，数据完全任意。我的 SAS 服务器虽然

非常强大

Contract_ids 可以在两个表中重复

(contract_id, reference_date) 和 (contract_id, payment_date) 对是唯一的

使用示例数据进行编辑:

%let seed=1111;
data tb_reference (drop=i);
    call streaminit(&seed.);
    do i = 1 to 10000;
        contract_id = round(rand('UNIFORM')*1000000,1);
        output;
    end;
run;
proc surveyselect data=tb_reference out=tb_payments n=5000 seed=&seed.; run;
data tb_reference(drop=i);
    format reference_date date9.;
    call streaminit(&seed.);
    set tb_reference;
    do i = 1 to 1+round(rand('UNIFORM')*4,1);
        reference_date = '01jan2016'd + round(rand('UNIFORM')*1000,1);
        output;
    end;
run;
proc sort data=tb_reference nodupkey; by contract_id reference_date; run;
data tb_payments(drop=i);
    format payment_date date9. payment_value comma20.2;
    call streaminit(&seed.);
    set tb_payments;
    do i = 1 to 1+round(rand('UNIFORM')*20,1);
        payment_date = '01jan2015'd + round(rand('UNIFORM')*1365,1);
        payment_value = round(rand('UNIFORM')*3333,0.01);
        output;
    end;
run;
proc sort data=tb_payments nodupkey; by contract_id payment_date; run;

更新:
我将我的天真解决方案与 Quentin 和 Tom 的两个建议进行了比较。

合并方法非常快，在 n=10000 时实现了 10 倍以上的加速。它也非常强大，正如汤姆在他的回答中精美地展示的那样。

哈希表非常快，并且实现了超过 500 倍的加速。因为我的数据集很大，这是要走的路，但有一个问题:它们需要适合 RAM。

如果有人需要完整的测试代码，请随时给我发送消息。

最佳答案

这是散列方法的示例。由于您的数据已经排序，我认为哈希方法比 Tom 的合并方法没有太大好处。

大体思路是将所有的支付数据读入一个哈希表中(如果你的真实数据太大，你可能会耗尽内存)，然后通读引用日期的数据集。对于每个引用日期，您查找该 contract_id 的所有付款，并遍历它们，测试付款日期是否在 reference_date 之前
应该明显比您问题中的 SQL 方法快，但可能会输给 MERGE 方法。如果数据没有提前排序，这可能会超过对两个大数据集进行排序然后合并的时间。它可以在同一天处理多次付款。

data want;
  *initialize variables for hash table ;
  call missing(payment_date,payment_value) ;

  *Load a hash table with all of the payment data ;
  if _n_=1 then do ;
    declare hash h(dataset:"tb_payments", multidata: "yes");
    h.defineKey("contract_ID");
    h.defineData("payment_date","payment_value");
    h.defineDone() ;
  end ;

  *read in the reference dates ;
  set tb_reference (keep=contract_id reference_date) ;

  *for each reference date, look up all the payments for that contract_id ;
  *and iterate through them.  If the payment date is < 90 days before reference date then ;
  *increment sum_payments ;

  sum_payments=0 ;
  rc=h.find();
  do while (rc = 0); *found a record;
    if 0<=(reference_date-payment_date)<90 then sum_payments = sum_payments + payment_value ;
    rc=h.find_next();
  end;
run ;

关于sql - SAS 中的高效滚动总和(窗口聚合)，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/52022373/