我有两个表:
对于 tb_reference 中的每个 (contract_id, reference_date),我想创建一列 sum_payments 作为 tb_payments 的 90 天滚动总和。我可以用下面的查询来完成这个(非常低效):
%let window=90;
proc sql;
create index contract_id on tb_payments;
quit;
proc sql;
create table tb_rolling as
select a.contract_id,
a.reference_date,
(select sum(b.payment_value)
from tb_payments as b
where a.contract_id = b.contract_id
and a.reference_date - &window. < b.payment_date
and b.payment_date <= a.reference_date
) as sum_payments
from tb_reference as a;
quit;
如何使用 proc sql 或 SAS 数据步骤重写它以减少时间复杂度?
编辑更多信息:
使用示例数据进行编辑:
%let seed=1111;
data tb_reference (drop=i);
call streaminit(&seed.);
do i = 1 to 10000;
contract_id = round(rand('UNIFORM')*1000000,1);
output;
end;
run;
proc surveyselect data=tb_reference out=tb_payments n=5000 seed=&seed.; run;
data tb_reference(drop=i);
format reference_date date9.;
call streaminit(&seed.);
set tb_reference;
do i = 1 to 1+round(rand('UNIFORM')*4,1);
reference_date = '01jan2016'd + round(rand('UNIFORM')*1000,1);
output;
end;
run;
proc sort data=tb_reference nodupkey; by contract_id reference_date; run;
data tb_payments(drop=i);
format payment_date date9. payment_value comma20.2;
call streaminit(&seed.);
set tb_payments;
do i = 1 to 1+round(rand('UNIFORM')*20,1);
payment_date = '01jan2015'd + round(rand('UNIFORM')*1365,1);
payment_value = round(rand('UNIFORM')*3333,0.01);
output;
end;
run;
proc sort data=tb_payments nodupkey; by contract_id payment_date; run;
更新:
我将我的天真解决方案与 Quentin 和 Tom 的两个建议进行了比较。
如果有人需要完整的测试代码,请随时给我发送消息。
最佳答案
这是散列方法的示例。由于您的数据已经排序,我认为哈希方法比 Tom 的合并方法没有太大好处。
大体思路是将所有的支付数据读入一个哈希表中(如果你的真实数据太大,你可能会耗尽内存),然后通读引用日期的数据集。对于每个引用日期,您查找该 contract_id 的所有付款,并遍历它们,测试付款日期是否在 reference_date 之前
应该明显比您问题中的 SQL 方法快,但可能会输给 MERGE 方法。如果数据没有提前排序,这可能会超过对两个大数据集进行排序然后合并的时间。它可以在同一天处理多次付款。
data want;
*initialize variables for hash table ;
call missing(payment_date,payment_value) ;
*Load a hash table with all of the payment data ;
if _n_=1 then do ;
declare hash h(dataset:"tb_payments", multidata: "yes");
h.defineKey("contract_ID");
h.defineData("payment_date","payment_value");
h.defineDone() ;
end ;
*read in the reference dates ;
set tb_reference (keep=contract_id reference_date) ;
*for each reference date, look up all the payments for that contract_id ;
*and iterate through them. If the payment date is < 90 days before reference date then ;
*increment sum_payments ;
sum_payments=0 ;
rc=h.find();
do while (rc = 0); *found a record;
if 0<=(reference_date-payment_date)<90 then sum_payments = sum_payments + payment_value ;
rc=h.find_next();
end;
run ;
关于sql - SAS 中的高效滚动总和(窗口聚合),我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/52022373/